0b3fcf3cf0
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
174 lines
5.8 KiB
Markdown
174 lines
5.8 KiB
Markdown
# PersistentBitVec
|
||
|
||
## Purpose
|
||
|
||
`PersistentBitVec` stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of `PersistentCompactIntVec` and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.
|
||
|
||
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
|
||
|
||
---
|
||
|
||
## File format
|
||
|
||
Single `.pbiv` file.
|
||
|
||
```
|
||
offset 0:
|
||
magic: [u8; 4] = b"PBIV"
|
||
_pad: [u8; 4] = 0 alignment padding
|
||
n: u64 number of bits
|
||
|
||
offset 16:
|
||
data: [u64; ⌈n/64⌉] bit words, LSB-first, zero-padded
|
||
```
|
||
|
||
**Header is 16 bytes**, so data starts at an offset divisible by 8. Since `mmap` returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy `&[u8] → &[u64]` reinterpretation.
|
||
|
||
**Bit layout**: bit `i` is in `data[i >> 6]` at bit position `i & 63` (LSB-first). Bits `[n, ⌈n/64⌉×64)` are **always zero** (padding). This invariant is maintained by all write operations and must be restored by `not()` after flipping.
|
||
|
||
**Total file size**: `16 + ⌈n/64⌉ × 8` bytes.
|
||
|
||
---
|
||
|
||
## Lifecycle
|
||
|
||
### Builder (`PersistentBitVecBuilder`)
|
||
|
||
```rust
|
||
struct PersistentBitVecBuilder {
|
||
mmap: MmapMut,
|
||
n: usize,
|
||
}
|
||
```
|
||
|
||
The file and mmap are created immediately at construction. The header is written once at `new()` or copied from the source at `build_from*()`. `close()` is a single flush — there is no tail to append, unlike `PersistentCompactIntVec`.
|
||
|
||
#### Constructors
|
||
|
||
**`new(n: usize, path: &Path) -> io::Result<Self>`**
|
||
|
||
Creates the file, writes the header, zero-extends to `16 + ⌈n/64⌉×8` bytes, mmaps immediately. All bits default to 0.
|
||
|
||
**`build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>`**
|
||
|
||
OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).
|
||
|
||
**`build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>`**
|
||
|
||
Creates a new file, iterates `source` with its merge-scan iterator (O(n)), and writes bits directly into u64 words:
|
||
|
||
```rust
|
||
// bit i = 1 iff source[i] >= threshold
|
||
words[slot >> 6] |= 1u64 << (slot & 63);
|
||
```
|
||
|
||
Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.
|
||
|
||
**`build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**
|
||
|
||
Shorthand for `build_from_counts(source, 1, path)`.
|
||
|
||
#### Bit-level access
|
||
|
||
```rust
|
||
fn get(&self, slot: u64) -> bool
|
||
fn set(&mut self, slot: u64, value: bool)
|
||
```
|
||
|
||
Byte-level mmap access: `mmap[16 + slot/8]`, bit `slot % 8`. O(1).
|
||
|
||
#### Word-level bulk operations
|
||
|
||
All operate on `⌈n/64⌉` u64 words. O(n/64) per call.
|
||
|
||
```rust
|
||
builder.and(&other); // self[i] &= other[i] for all i
|
||
builder.or(&other); // self[i] |= other[i]
|
||
builder.xor(&other); // self[i] ^= other[i]
|
||
builder.not(); // self[i] = !self[i], then re-zero padding bits
|
||
```
|
||
|
||
`and`/`or`/`xor` read `other`'s word slice directly (no allocation). `not()` flips all words then masks the last word's padding bits to restore the invariant.
|
||
|
||
#### `close(self) -> io::Result<()>`
|
||
|
||
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
|
||
|
||
---
|
||
|
||
### Reader (`PersistentBitVec`)
|
||
|
||
```rust
|
||
struct PersistentBitVec {
|
||
mmap: Mmap,
|
||
n: usize,
|
||
path: PathBuf,
|
||
}
|
||
```
|
||
|
||
#### `open(path: &Path) -> io::Result<Self>`
|
||
|
||
Mmaps the file, validates magic, reads `n` from bytes `[8..16]`. O(1).
|
||
|
||
#### `get(slot: u64) -> bool`
|
||
|
||
Byte-level read from `mmap[16 + slot/8]`. O(1).
|
||
|
||
#### `iter() -> BitIter<'_>`
|
||
|
||
Sequential scan, byte by byte, yielding `bool` values in slot order. Implements `ExactSizeIterator`. O(n).
|
||
|
||
#### Aggregates
|
||
|
||
```rust
|
||
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
|
||
fn count_zeros(&self) -> u64 // n - count_ones()
|
||
```
|
||
|
||
`count_ones` iterates `⌈n/64⌉` words and calls `u64::count_ones()` (maps to `POPCNT`). O(n/64).
|
||
|
||
#### Distance methods
|
||
|
||
Both operate word by word. O(n/64).
|
||
|
||
| Method | Formula | Notes |
|
||
|---|---|---|
|
||
| `jaccard_dist(&other) -> f64` | `1 − |A∩B| / |A∪B|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
|
||
| `hamming_dist(&other) -> u64` | number of differing bits | `(a^b).count_ones()` per word |
|
||
|
||
Edge case (both all-zero → union = 0): `jaccard_dist` returns 0.0.
|
||
|
||
---
|
||
|
||
## Implementation notes
|
||
|
||
### u64 word view
|
||
|
||
The unsafe cast from `&[u8]` to `&[u64]` is sound because:
|
||
|
||
1. `mmap` base is page-aligned (≥ 4096-byte boundary).
|
||
2. Data offset = 16, and `16 % 8 == 0` → the data pointer is 8-byte aligned.
|
||
3. Data length = `⌈n/64⌉ × 8` bytes — always a multiple of 8.
|
||
|
||
This gives zero-copy word-level access with no intermediate allocation.
|
||
|
||
### Padding invariant
|
||
|
||
Writing `not()` without masking the last word would corrupt `count_ones()`, `hamming_dist()`, and `jaccard_dist()`. The mask applied after flipping is `(1u64 << (n % 64)) - 1` (no-op if `n % 64 == 0`). All other operations (`and`, `or`, `xor`) preserve existing zero padding since they can only clear or preserve bits already set by `not()`.
|
||
|
||
---
|
||
|
||
## Complexity
|
||
|
||
| Operation | Time | Notes |
|
||
|---|---|---|
|
||
| `new` / `open` | O(1) | mmap setup + header parse |
|
||
| `get` / `set` (builder or reader) | O(1) | byte-level mmap |
|
||
| `iter()` | O(n) | byte-by-byte scan |
|
||
| `count_ones` / `count_zeros` | O(n/64) | POPCNT per u64 word |
|
||
| `and` / `or` / `xor` / `not` | O(n/64) | word-level bitwise ops |
|
||
| `jaccard_dist` / `hamming_dist` | O(n/64) | word AND/OR/XOR + POPCNT |
|
||
| `build_from` | O(file_size) | OS copy |
|
||
| `build_from_counts` / `build_from_presence` | O(n) | count iter + word fill |
|
||
| `close` | O(1) | flush only |
|