0b3fcf3cf0
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
197 lines
7.5 KiB
Markdown
197 lines
7.5 KiB
Markdown
# PersistentCompactIntVec
|
||
|
||
## Purpose
|
||
|
||
`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.
|
||
|
||
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
|
||
|
||
---
|
||
|
||
## Design
|
||
|
||
Two-tier structure:
|
||
|
||
1. **Primary array** — `[u8; n]`, stored at offset 24 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||
2. **Overflow section** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||
|
||
```
|
||
primary[slot] < 255 → return primary[slot]
|
||
primary[slot] == 255 → binary search in overflow
|
||
```
|
||
|
||
---
|
||
|
||
## Single-file format
|
||
|
||
Everything is stored in a single `.pciv` file. Write order matches computation order: the header placeholder is written first, then primary (known at `new()`), then overflow data and index (known at `close()`), then the header is overwritten at offset 0.
|
||
|
||
```
|
||
offset 0:
|
||
magic: [u8; 4] = b"PCIV"
|
||
n: u64 number of slots
|
||
n_overflow: u32 number of overflow entries
|
||
step: u32 sparse index step (0 = no index)
|
||
n_index: u32 number of index entries
|
||
|
||
offset 24:
|
||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||
|
||
offset 24 + n:
|
||
data: [(slot: u32, value: u32); n_overflow] sorted by slot
|
||
|
||
offset 24 + n + n_overflow × 8:
|
||
index: [(slot: u32, pos: u32); n_index] sparse index
|
||
```
|
||
|
||
The index entries point into `data`: `index[i] = (slot of data[i×step], i×step)`.
|
||
|
||
---
|
||
|
||
## Lifecycle
|
||
|
||
### Builder (`PersistentCompactIntVecBuilder`)
|
||
|
||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<u64, u32>` in RAM.
|
||
|
||
```rust
|
||
struct PersistentCompactIntVecBuilder {
|
||
path: PathBuf,
|
||
mmap: MmapMut, // primary section live in the file from the start
|
||
n: usize,
|
||
overflow: HashMap<u64, u32>, // values ≥ 255
|
||
}
|
||
```
|
||
|
||
#### `new(n: usize, path: &Path) -> io::Result<Self>`
|
||
|
||
Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.
|
||
|
||
#### `build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`
|
||
|
||
Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
|
||
|
||
At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
|
||
|
||
#### `set(slot: u64, value: u32)` / `get(slot: u64) -> u32`
|
||
|
||
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
|
||
|
||
#### Element-wise operations — `min`, `max`, `add`, `diff`
|
||
|
||
Each takes a `&PersistentCompactIntVec` of equal length and updates `self` in place via `set`:
|
||
|
||
```rust
|
||
builder.min(&other); // self[i] = min(self[i], other[i])
|
||
builder.max(&other); // self[i] = max(self[i], other[i])
|
||
builder.add(&other); // self[i] = self[i].checked_add(other[i]) (panics on u32 overflow)
|
||
builder.diff(&other); // self[i] = self[i].saturating_sub(other[i])
|
||
```
|
||
|
||
All iterate `other` with `other.iter()` (merge-scan, O(n_other)).
|
||
|
||
#### `close(self) -> io::Result<()>`
|
||
|
||
1. Flush and drop the mmap (primary changes are now on disk).
|
||
2. Sort the overflow HashMap into `Vec<(u32, u32)>`.
|
||
3. Truncate the file to `HEADER_SIZE + n` (removes old data+index if `build_from` was used).
|
||
4. Append sorted overflow data, then sparse index.
|
||
5. Seek to offset 0, overwrite the header with final values.
|
||
|
||
---
|
||
|
||
### Reader (`PersistentCompactIntVec`)
|
||
|
||
Used at query time. The whole file is mmap'd; only the sparse index is copied into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||
|
||
```rust
|
||
struct PersistentCompactIntVec {
|
||
mmap: Mmap,
|
||
n: usize,
|
||
n_overflow: usize,
|
||
step: u32,
|
||
index: Vec<(u32, u32)>, // L1-resident
|
||
primary_offset: usize, // = 24 (HEADER_SIZE)
|
||
data_offset: usize, // = 24 + n
|
||
path: PathBuf,
|
||
}
|
||
```
|
||
|
||
#### `open(path: &Path) -> io::Result<Self>`
|
||
|
||
Mmaps the file, parses the 24-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||
|
||
#### `get(slot: u64) -> u32` — random access
|
||
|
||
```
|
||
primary[slot] < 255 → return it directly
|
||
|
||
step == 0:
|
||
binary_search(data[0..n_overflow], slot)
|
||
|
||
step > 0:
|
||
i = upper_bound(index[..].slot, slot) − 1 // in L1-resident Vec
|
||
binary_search(data[index[i].pos .. index[i+1].pos], slot)
|
||
```
|
||
|
||
#### `iter() -> Iter<'_>` — sequential scan, O(n)
|
||
|
||
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
|
||
|
||
`Iter` implements `ExactSizeIterator`. `&PersistentCompactIntVec` implements `IntoIterator`.
|
||
|
||
#### Aggregate
|
||
|
||
```rust
|
||
fn sum(&self) -> u64 // Σ self[i] as u64, via iter()
|
||
```
|
||
|
||
#### Distance methods
|
||
|
||
All take `&other` of equal length, iterate both with `zip(self.iter(), other.iter())`, and return `f64`.
|
||
|
||
| Method | Formula |
|
||
|---|---|
|
||
| `bray_dist` | `1 − 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ)` |
|
||
| `relfreq_bray_dist` | Bray-Curtis on relative frequencies: `1 − Σmin(pᵢ,qᵢ)` where `pᵢ = aᵢ/Σa` |
|
||
| `euclidean_dist` | `√Σ(aᵢ − bᵢ)²` |
|
||
| `relfreq_euclidean_dist` | Euclidean on relative frequencies |
|
||
| `hellinger_euclidean_dist` | `√Σ(√pᵢ − √qᵢ)²` — Euclidean on sqrt(relfreq) |
|
||
| `hellinger_dist` | `hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1] |
|
||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − |A∩B| / |A∪B|` where presence iff count ≥ threshold |
|
||
| `jaccard_dist` | `threshold_jaccard_dist(&other, 1)` |
|
||
|
||
Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.
|
||
|
||
---
|
||
|
||
## Step computation
|
||
|
||
Chosen at `close()` once `n_overflow` is known:
|
||
|
||
```
|
||
L1_entries = 32 768 / 8 = 4096
|
||
|
||
step = 0 if n_overflow ≤ 4096
|
||
step = ⌈n_overflow / 4096⌉ otherwise
|
||
```
|
||
|
||
For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.
|
||
|
||
---
|
||
|
||
## Complexity
|
||
|
||
| Operation | Time | Notes |
|
||
|---|---|---|
|
||
| `set` / `get` (builder) | O(1) | mmap byte + HashMap |
|
||
| `get` (reader, no overflow) | O(1) | single mmap byte |
|
||
| `get` (reader, with index) | O(log step) | ≤ 2 memory regions |
|
||
| `get` (reader, no index) | O(log n_overflow) | data fits in a few cache lines |
|
||
| `iter()` full scan | O(n + n_overflow) | merge-scan, no binary search |
|
||
| `sum`, distances | O(n) | via `iter()` / `zip(iter(), iter())` |
|
||
| `min` / `max` / `add` / `diff` | O(n) | via `other.iter()` + builder `set` |
|
||
| `close` | O(n_overflow log n_overflow) | sort + sequential write |
|
||
| `open` | O(n_index) | index copy into Vec |
|
||
| `build_from` | O(file_size) + O(n_overflow) | OS copy + HashMap load |
|