Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
9.1 KiB
PersistentBitVec and PersistentBitMatrix
Purpose
PersistentBitVec stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of PersistentCompactIntVec and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
PersistentBitMatrix wraps multiple PersistentBitVec columns in a directory, exposing a column-major binary matrix with row-access API. A single-column bit matrix is a vector at the API level.
PersistentBitVec — single-column file
File format
Single .pbiv file.
offset 0:
magic: [u8; 4] = b"PBIV"
_pad: [u8; 4] = 0 alignment padding
n: u64 number of bits
offset 16:
data: [u64; ⌈n/64⌉] bit words, LSB-first, zero-padded
Header is 16 bytes, so data starts at an offset divisible by 8. Since mmap returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy &[u8] → &[u64] reinterpretation.
Bit layout: bit i is in data[i >> 6] at bit position i & 63 (LSB-first). Bits [n, ⌈n/64⌉×64) are always zero (padding). This invariant is maintained by all write operations and must be restored by not() after flipping.
Total file size: 16 + ⌈n/64⌉ × 8 bytes.
Lifecycle
Builder (PersistentBitVecBuilder)
struct PersistentBitVecBuilder {
mmap: MmapMut,
n: usize,
}
The file and mmap are created immediately at construction. The header is written once at new() or copied from the source at build_from*(). close() is a single flush — there is no tail to append, unlike PersistentCompactIntVec.
new(n: usize, path: &Path) -> io::Result<Self>
Creates the file, writes the header, zero-extends to 16 + ⌈n/64⌉×8 bytes, mmaps immediately. All bits default to 0.
build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>
OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).
build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>
Creates a new file, iterates source with its merge-scan iterator (O(n)), and writes bits directly into u64 words:
// bit i = 1 iff source[i] >= threshold
words[slot >> 6] |= 1u64 << (slot & 63);
Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.
build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>
Shorthand for build_from_counts(source, 1, path).
Bit-level access
fn get(&self, slot: usize) -> bool
fn set(&mut self, slot: usize, value: bool)
Byte-level mmap access: mmap[16 + slot/8], bit slot % 8. O(1).
Word-level bulk operations
All operate on ⌈n/64⌉ u64 words. O(n/64) per call.
builder.and(&other); // self[i] &= other[i] for all i
builder.or(&other); // self[i] |= other[i]
builder.xor(&other); // self[i] ^= other[i]
builder.not(); // self[i] = !self[i], then re-zero padding bits
and/or/xor read other's word slice directly (no allocation). not() flips all words then masks the last word's padding bits to restore the invariant.
close(self) -> io::Result<()>
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
Reader (PersistentBitVec)
struct PersistentBitVec {
mmap: Mmap,
n: usize,
path: PathBuf,
}
open(path: &Path) -> io::Result<Self>
Mmaps the file, validates magic, reads n from bytes [8..16]. O(1).
get(slot: usize) -> bool
Byte-level read from mmap[16 + slot/8]. O(1).
iter() -> BitIter<'_>
Sequential scan, byte by byte, yielding bool values in slot order. Implements ExactSizeIterator. O(n).
Aggregates
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
fn count_zeros(&self) -> u64 // n - count_ones()
count_ones iterates ⌈n/64⌉ words and calls u64::count_ones() (maps to POPCNT). O(n/64).
Distance methods
Both operate word by word. O(n/64).
| Method | Formula | Notes |
|---|---|---|
jaccard_dist(&other) -> f64 |
1 − |A∩B| / |A∪B| |
(a&b).count_ones(), (a|b).count_ones() per word |
hamming_dist(&other) -> u64 |
number of differing bits | (a^b).count_ones() per word |
Edge case (both all-zero → union = 0): jaccard_dist returns 0.0.
Implementation notes
u64 word view
The unsafe cast from &[u8] to &[u64] is sound because:
mmapbase is page-aligned (≥ 4096-byte boundary).- Data offset = 16, and
16 % 8 == 0→ the data pointer is 8-byte aligned. - Data length =
⌈n/64⌉ × 8bytes — always a multiple of 8.
This gives zero-copy word-level access with no intermediate allocation.
Padding invariant
Writing not() without masking the last word would corrupt count_ones(), hamming_dist(), and jaccard_dist(). The mask applied after flipping is (1u64 << (n % 64)) - 1 (no-op if n % 64 == 0). All other operations (and, or, xor) preserve existing zero padding since they can only clear or preserve bits already set by not().
Complexity
| Operation | Time | Notes |
|---|---|---|
new / open |
O(1) | mmap setup + header parse |
get / set (builder or reader) |
O(1) | byte-level mmap |
iter() |
O(n) | byte-by-byte scan |
count_ones / count_zeros |
O(n/64) | POPCNT per u64 word |
and / or / xor / not |
O(n/64) | word-level bitwise ops |
jaccard_dist / hamming_dist |
O(n/64) | word AND/OR/XOR + POPCNT |
build_from |
O(file_size) | OS copy |
build_from_counts / build_from_presence |
O(n) | count iter + word fill |
close |
O(1) | flush only |
PersistentBitMatrix — column-major directory
Design
A directory containing meta.json and N column files col_000000.pbiv, col_000001.pbiv, …, each a PersistentBitVec. Used for presence/absence matrices: one column per genome, one bit per MPHF slot.
presence/
meta.json {"n": <n_slots>, "n_cols": <G>}
col_000000.pbiv genome 0
col_000001.pbiv genome 1
...
Column-major layout makes per-genome set operations (Jaccard, Hamming, AND/OR) cache-friendly — each genome is a contiguous file. Row access (which genomes contain a given kmer) requires one O(1) read per column.
Builder (PersistentBitMatrixBuilder)
struct PersistentBitMatrixBuilder {
dir: PathBuf,
n: usize,
n_cols: usize,
}
new(n: usize, dir: &Path) -> io::Result<Self>
Creates the directory (including parents).
add_col(&mut self) -> io::Result<PersistentBitVecBuilder>
Creates col_NNNNNN.pbiv for the next column and returns its builder. The caller fills the column and calls builder.close() before calling add_col again.
close(self) -> io::Result<()>
Writes meta.json with the final n and n_cols.
Reader (PersistentBitMatrix)
struct PersistentBitMatrix {
cols: Vec<PersistentBitVec>,
n: usize,
}
open(dir: &Path) -> io::Result<Self>
Reads meta.json, opens all col_NNNNNN.pbiv files.
row(slot: usize) -> Box<[bool]>
Returns the presence vector: [col_0[slot], col_1[slot], …, col_{G-1}[slot]]. One byte read per column. O(G).
col(c: usize) -> &PersistentBitVec
Direct access to a single column for column-oriented operations.
LayerData implementation
impl LayerData for PersistentBitMatrix {
type Item = Box<[bool]>;
fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/presence/ */ }
fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) }
}
Aggregation traits — obicompactvec::traits
PersistentBitMatrix implements two aggregation traits used by LayeredStore<S> for cross-layer and cross-partition distance computations.
ColumnWeights
impl ColumnWeights for PersistentBitMatrix {
fn col_weights(&self) -> Array1<u64> // = self.count_ones()
}
col_weights()[c] = number of set bits in column c across all slots.
BitPartials
impl BitPartials for PersistentBitMatrix {
// Self-contained partials (additive across layers)
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>) // (inter, union)
fn partial_hamming(&self) -> Array2<u64> // differing bits
// Provided finalisations
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn hamming_dist_matrix(&self) -> Array2<u64>
}
partial_jaccard returns (inter, union) as a pair because union is not reconstructible from per-column count_ones() — it depends on both columns simultaneously. Both components are additively decomposable across (partition, layer) pairs; the final jaccard_dist_matrix() is computed from their element-wise sums.