Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
7.5 KiB
PersistentCompactIntVec
Purpose
PersistentCompactIntVec stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
Design
Two-tier structure:
- Primary array —
[u8; n], stored at offset 24 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value 255 is a sentinel meaning "look in overflow". - Overflow section — sorted list of
(slot: u32, value: u32)pairs for all slots where the true value ≥ 255, with a sparse L1-fitting index for fast lookup.
primary[slot] < 255 → return primary[slot]
primary[slot] == 255 → binary search in overflow
Single-file format
Everything is stored in a single .pciv file. Write order matches computation order: the header placeholder is written first, then primary (known at new()), then overflow data and index (known at close()), then the header is overwritten at offset 0.
offset 0:
magic: [u8; 4] = b"PCIV"
n: u64 number of slots
n_overflow: u32 number of overflow entries
step: u32 sparse index step (0 = no index)
n_index: u32 number of index entries
offset 24:
primary: [u8; n] one byte per slot, 255 = overflow sentinel
offset 24 + n:
data: [(slot: u32, value: u32); n_overflow] sorted by slot
offset 24 + n + n_overflow × 8:
index: [(slot: u32, pos: u32); n_index] sparse index
The index entries point into data: index[i] = (slot of data[i×step], i×step).
Lifecycle
Builder (PersistentCompactIntVecBuilder)
Used during construction. The primary section is mmap'd immediately at construction time (both for new and build_from), so the file exists and is addressable from the start. The overflow is held in a HashMap<u64, u32> in RAM.
struct PersistentCompactIntVecBuilder {
path: PathBuf,
mmap: MmapMut, // primary section live in the file from the start
n: usize,
overflow: HashMap<u64, u32>, // values ≥ 255
}
new(n: usize, path: &Path) -> io::Result<Self>
Creates the file, pre-allocates HEADER_SIZE + n zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for set / get.
build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>
Copies the source PCIV file to path (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a HashMap. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
At close(), the primary section is not rewritten: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
set(slot: u64, value: u32) / get(slot: u64) -> u32
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
Element-wise operations — min, max, add, diff
Each takes a &PersistentCompactIntVec of equal length and updates self in place via set:
builder.min(&other); // self[i] = min(self[i], other[i])
builder.max(&other); // self[i] = max(self[i], other[i])
builder.add(&other); // self[i] = self[i].checked_add(other[i]) (panics on u32 overflow)
builder.diff(&other); // self[i] = self[i].saturating_sub(other[i])
All iterate other with other.iter() (merge-scan, O(n_other)).
close(self) -> io::Result<()>
- Flush and drop the mmap (primary changes are now on disk).
- Sort the overflow HashMap into
Vec<(u32, u32)>. - Truncate the file to
HEADER_SIZE + n(removes old data+index ifbuild_fromwas used). - Append sorted overflow data, then sparse index.
- Seek to offset 0, overwrite the header with final values.
Reader (PersistentCompactIntVec)
Used at query time. The whole file is mmap'd; only the sparse index is copied into a Vec at open time (≤ 32 KB, L1-resident).
struct PersistentCompactIntVec {
mmap: Mmap,
n: usize,
n_overflow: usize,
step: u32,
index: Vec<(u32, u32)>, // L1-resident
primary_offset: usize, // = 24 (HEADER_SIZE)
data_offset: usize, // = 24 + n
path: PathBuf,
}
open(path: &Path) -> io::Result<Self>
Mmaps the file, parses the 24-byte header, copies the sparse index entries into a Vec. The primary and data sections stay mmap'd.
get(slot: u64) -> u32 — random access
primary[slot] < 255 → return it directly
step == 0:
binary_search(data[0..n_overflow], slot)
step > 0:
i = upper_bound(index[..].slot, slot) − 1 // in L1-resident Vec
binary_search(data[index[i].pos .. index[i+1].pos], slot)
iter() -> Iter<'_> — sequential scan, O(n)
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
Iter implements ExactSizeIterator. &PersistentCompactIntVec implements IntoIterator.
Aggregate
fn sum(&self) -> u64 // Σ self[i] as u64, via iter()
Distance methods
All take &other of equal length, iterate both with zip(self.iter(), other.iter()), and return f64.
| Method | Formula |
|---|---|
bray_dist |
1 − 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ) |
relfreq_bray_dist |
Bray-Curtis on relative frequencies: 1 − Σmin(pᵢ,qᵢ) where pᵢ = aᵢ/Σa |
euclidean_dist |
√Σ(aᵢ − bᵢ)² |
relfreq_euclidean_dist |
Euclidean on relative frequencies |
hellinger_euclidean_dist |
√Σ(√pᵢ − √qᵢ)² — Euclidean on sqrt(relfreq) |
hellinger_dist |
hellinger_euclidean_dist / √2 — standard Hellinger distance ∈ [0, 1] |
threshold_jaccard_dist(&other, threshold: u32) |
`1 − |
jaccard_dist |
threshold_jaccard_dist(&other, 1) |
Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.
Step computation
Chosen at close() once n_overflow is known:
L1_entries = 32 768 / 8 = 4096
step = 0 if n_overflow ≤ 4096
step = ⌈n_overflow / 4096⌉ otherwise
For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.
Complexity
| Operation | Time | Notes |
|---|---|---|
set / get (builder) |
O(1) | mmap byte + HashMap |
get (reader, no overflow) |
O(1) | single mmap byte |
get (reader, with index) |
O(log step) | ≤ 2 memory regions |
get (reader, no index) |
O(log n_overflow) | data fits in a few cache lines |
iter() full scan |
O(n + n_overflow) | merge-scan, no binary search |
sum, distances |
O(n) | via iter() / zip(iter(), iter()) |
min / max / add / diff |
O(n) | via other.iter() + builder set |
close |
O(n_overflow log n_overflow) | sort + sequential write |
open |
O(n_index) | index copy into Vec |
build_from |
O(file_size) + O(n_overflow) | OS copy + HashMap load |