Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
5.8 KiB
PersistentBitVec
Purpose
PersistentBitVec stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of PersistentCompactIntVec and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
File format
Single .pbiv file.
offset 0:
magic: [u8; 4] = b"PBIV"
_pad: [u8; 4] = 0 alignment padding
n: u64 number of bits
offset 16:
data: [u64; ⌈n/64⌉] bit words, LSB-first, zero-padded
Header is 16 bytes, so data starts at an offset divisible by 8. Since mmap returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy &[u8] → &[u64] reinterpretation.
Bit layout: bit i is in data[i >> 6] at bit position i & 63 (LSB-first). Bits [n, ⌈n/64⌉×64) are always zero (padding). This invariant is maintained by all write operations and must be restored by not() after flipping.
Total file size: 16 + ⌈n/64⌉ × 8 bytes.
Lifecycle
Builder (PersistentBitVecBuilder)
struct PersistentBitVecBuilder {
mmap: MmapMut,
n: usize,
}
The file and mmap are created immediately at construction. The header is written once at new() or copied from the source at build_from*(). close() is a single flush — there is no tail to append, unlike PersistentCompactIntVec.
Constructors
new(n: usize, path: &Path) -> io::Result<Self>
Creates the file, writes the header, zero-extends to 16 + ⌈n/64⌉×8 bytes, mmaps immediately. All bits default to 0.
build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>
OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).
build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>
Creates a new file, iterates source with its merge-scan iterator (O(n)), and writes bits directly into u64 words:
// bit i = 1 iff source[i] >= threshold
words[slot >> 6] |= 1u64 << (slot & 63);
Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.
build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>
Shorthand for build_from_counts(source, 1, path).
Bit-level access
fn get(&self, slot: u64) -> bool
fn set(&mut self, slot: u64, value: bool)
Byte-level mmap access: mmap[16 + slot/8], bit slot % 8. O(1).
Word-level bulk operations
All operate on ⌈n/64⌉ u64 words. O(n/64) per call.
builder.and(&other); // self[i] &= other[i] for all i
builder.or(&other); // self[i] |= other[i]
builder.xor(&other); // self[i] ^= other[i]
builder.not(); // self[i] = !self[i], then re-zero padding bits
and/or/xor read other's word slice directly (no allocation). not() flips all words then masks the last word's padding bits to restore the invariant.
close(self) -> io::Result<()>
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
Reader (PersistentBitVec)
struct PersistentBitVec {
mmap: Mmap,
n: usize,
path: PathBuf,
}
open(path: &Path) -> io::Result<Self>
Mmaps the file, validates magic, reads n from bytes [8..16]. O(1).
get(slot: u64) -> bool
Byte-level read from mmap[16 + slot/8]. O(1).
iter() -> BitIter<'_>
Sequential scan, byte by byte, yielding bool values in slot order. Implements ExactSizeIterator. O(n).
Aggregates
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
fn count_zeros(&self) -> u64 // n - count_ones()
count_ones iterates ⌈n/64⌉ words and calls u64::count_ones() (maps to POPCNT). O(n/64).
Distance methods
Both operate word by word. O(n/64).
| Method | Formula | Notes |
|---|---|---|
jaccard_dist(&other) -> f64 |
`1 − | A∩B |
hamming_dist(&other) -> u64 |
number of differing bits | (a^b).count_ones() per word |
Edge case (both all-zero → union = 0): jaccard_dist returns 0.0.
Implementation notes
u64 word view
The unsafe cast from &[u8] to &[u64] is sound because:
mmapbase is page-aligned (≥ 4096-byte boundary).- Data offset = 16, and
16 % 8 == 0→ the data pointer is 8-byte aligned. - Data length =
⌈n/64⌉ × 8bytes — always a multiple of 8.
This gives zero-copy word-level access with no intermediate allocation.
Padding invariant
Writing not() without masking the last word would corrupt count_ones(), hamming_dist(), and jaccard_dist(). The mask applied after flipping is (1u64 << (n % 64)) - 1 (no-op if n % 64 == 0). All other operations (and, or, xor) preserve existing zero padding since they can only clear or preserve bits already set by not().
Complexity
| Operation | Time | Notes |
|---|---|---|
new / open |
O(1) | mmap setup + header parse |
get / set (builder or reader) |
O(1) | byte-level mmap |
iter() |
O(n) | byte-by-byte scan |
count_ones / count_zeros |
O(n/64) | POPCNT per u64 word |
and / or / xor / not |
O(n/64) | word-level bitwise ops |
jaccard_dist / hamming_dist |
O(n/64) | word AND/OR/XOR + POPCNT |
build_from |
O(file_size) | OS copy |
build_from_counts / build_from_presence |
O(n) | count iter + word fill |
close |
O(1) | flush only |