Files
obikmer/docmd/implementation/persistent_bit_vec.md
T
Eric Coissac 0b3fcf3cf0 feat: add PersistentBitVec and upgrade PersistentCompactIntVec format
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
2026-05-14 09:01:36 +08:00

174 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PersistentBitVec
## Purpose
`PersistentBitVec` stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of `PersistentCompactIntVec` and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
---
## File format
Single `.pbiv` file.
```
offset 0:
magic: [u8; 4] = b"PBIV"
_pad: [u8; 4] = 0 alignment padding
n: u64 number of bits
offset 16:
data: [u64; ⌈n/64⌉] bit words, LSB-first, zero-padded
```
**Header is 16 bytes**, so data starts at an offset divisible by 8. Since `mmap` returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy `&[u8] → &[u64]` reinterpretation.
**Bit layout**: bit `i` is in `data[i >> 6]` at bit position `i & 63` (LSB-first). Bits `[n, ⌈n/64⌉×64)` are **always zero** (padding). This invariant is maintained by all write operations and must be restored by `not()` after flipping.
**Total file size**: `16 + ⌈n/64⌉ × 8` bytes.
---
## Lifecycle
### Builder (`PersistentBitVecBuilder`)
```rust
struct PersistentBitVecBuilder {
mmap: MmapMut,
n: usize,
}
```
The file and mmap are created immediately at construction. The header is written once at `new()` or copied from the source at `build_from*()`. `close()` is a single flush — there is no tail to append, unlike `PersistentCompactIntVec`.
#### Constructors
**`new(n: usize, path: &Path) -> io::Result<Self>`**
Creates the file, writes the header, zero-extends to `16 + ⌈n/64⌉×8` bytes, mmaps immediately. All bits default to 0.
**`build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>`**
OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).
**`build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>`**
Creates a new file, iterates `source` with its merge-scan iterator (O(n)), and writes bits directly into u64 words:
```rust
// bit i = 1 iff source[i] >= threshold
words[slot >> 6] |= 1u64 << (slot & 63);
```
Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.
**`build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**
Shorthand for `build_from_counts(source, 1, path)`.
#### Bit-level access
```rust
fn get(&self, slot: u64) -> bool
fn set(&mut self, slot: u64, value: bool)
```
Byte-level mmap access: `mmap[16 + slot/8]`, bit `slot % 8`. O(1).
#### Word-level bulk operations
All operate on `⌈n/64⌉` u64 words. O(n/64) per call.
```rust
builder.and(&other); // self[i] &= other[i] for all i
builder.or(&other); // self[i] |= other[i]
builder.xor(&other); // self[i] ^= other[i]
builder.not(); // self[i] = !self[i], then re-zero padding bits
```
`and`/`or`/`xor` read `other`'s word slice directly (no allocation). `not()` flips all words then masks the last word's padding bits to restore the invariant.
#### `close(self) -> io::Result<()>`
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
---
### Reader (`PersistentBitVec`)
```rust
struct PersistentBitVec {
mmap: Mmap,
n: usize,
path: PathBuf,
}
```
#### `open(path: &Path) -> io::Result<Self>`
Mmaps the file, validates magic, reads `n` from bytes `[8..16]`. O(1).
#### `get(slot: u64) -> bool`
Byte-level read from `mmap[16 + slot/8]`. O(1).
#### `iter() -> BitIter<'_>`
Sequential scan, byte by byte, yielding `bool` values in slot order. Implements `ExactSizeIterator`. O(n).
#### Aggregates
```rust
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
fn count_zeros(&self) -> u64 // n - count_ones()
```
`count_ones` iterates `⌈n/64⌉` words and calls `u64::count_ones()` (maps to `POPCNT`). O(n/64).
#### Distance methods
Both operate word by word. O(n/64).
| Method | Formula | Notes |
|---|---|---|
| `jaccard_dist(&other) -> f64` | `1 |A∩B| / |AB|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
| `hamming_dist(&other) -> u64` | number of differing bits | `(a^b).count_ones()` per word |
Edge case (both all-zero → union = 0): `jaccard_dist` returns 0.0.
---
## Implementation notes
### u64 word view
The unsafe cast from `&[u8]` to `&[u64]` is sound because:
1. `mmap` base is page-aligned (≥ 4096-byte boundary).
2. Data offset = 16, and `16 % 8 == 0` → the data pointer is 8-byte aligned.
3. Data length = `⌈n/64⌉ × 8` bytes — always a multiple of 8.
This gives zero-copy word-level access with no intermediate allocation.
### Padding invariant
Writing `not()` without masking the last word would corrupt `count_ones()`, `hamming_dist()`, and `jaccard_dist()`. The mask applied after flipping is `(1u64 << (n % 64)) - 1` (no-op if `n % 64 == 0`). All other operations (`and`, `or`, `xor`) preserve existing zero padding since they can only clear or preserve bits already set by `not()`.
---
## Complexity
| Operation | Time | Notes |
|---|---|---|
| `new` / `open` | O(1) | mmap setup + header parse |
| `get` / `set` (builder or reader) | O(1) | byte-level mmap |
| `iter()` | O(n) | byte-by-byte scan |
| `count_ones` / `count_zeros` | O(n/64) | POPCNT per u64 word |
| `and` / `or` / `xor` / `not` | O(n/64) | word-level bitwise ops |
| `jaccard_dist` / `hamming_dist` | O(n/64) | word AND/OR/XOR + POPCNT |
| `build_from` | O(file_size) | OS copy |
| `build_from_counts` / `build_from_presence` | O(n) | count iter + word fill |
| `close` | O(1) | flush only |