Files
obikmer/docmd/implementation/persistent_compact_int_vec.md
T
Eric Coissac 0b3fcf3cf0 feat: add PersistentBitVec and upgrade PersistentCompactIntVec format
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
2026-05-14 09:01:36 +08:00

197 lines
7.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PersistentCompactIntVec
## Purpose
`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
---
## Design
Two-tier structure:
1. **Primary array**`[u8; n]`, stored at offset 24 in the PCIV file and mmap'd. Values 0254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
2. **Overflow section** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
```
primary[slot] < 255 → return primary[slot]
primary[slot] == 255 → binary search in overflow
```
---
## Single-file format
Everything is stored in a single `.pciv` file. Write order matches computation order: the header placeholder is written first, then primary (known at `new()`), then overflow data and index (known at `close()`), then the header is overwritten at offset 0.
```
offset 0:
magic: [u8; 4] = b"PCIV"
n: u64 number of slots
n_overflow: u32 number of overflow entries
step: u32 sparse index step (0 = no index)
n_index: u32 number of index entries
offset 24:
primary: [u8; n] one byte per slot, 255 = overflow sentinel
offset 24 + n:
data: [(slot: u32, value: u32); n_overflow] sorted by slot
offset 24 + n + n_overflow × 8:
index: [(slot: u32, pos: u32); n_index] sparse index
```
The index entries point into `data`: `index[i] = (slot of data[i×step], i×step)`.
---
## Lifecycle
### Builder (`PersistentCompactIntVecBuilder`)
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<u64, u32>` in RAM.
```rust
struct PersistentCompactIntVecBuilder {
path: PathBuf,
mmap: MmapMut, // primary section live in the file from the start
n: usize,
overflow: HashMap<u64, u32>, // values ≥ 255
}
```
#### `new(n: usize, path: &Path) -> io::Result<Self>`
Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.
#### `build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`
Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
#### `set(slot: u64, value: u32)` / `get(slot: u64) -> u32`
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
#### Element-wise operations — `min`, `max`, `add`, `diff`
Each takes a `&PersistentCompactIntVec` of equal length and updates `self` in place via `set`:
```rust
builder.min(&other); // self[i] = min(self[i], other[i])
builder.max(&other); // self[i] = max(self[i], other[i])
builder.add(&other); // self[i] = self[i].checked_add(other[i]) (panics on u32 overflow)
builder.diff(&other); // self[i] = self[i].saturating_sub(other[i])
```
All iterate `other` with `other.iter()` (merge-scan, O(n_other)).
#### `close(self) -> io::Result<()>`
1. Flush and drop the mmap (primary changes are now on disk).
2. Sort the overflow HashMap into `Vec<(u32, u32)>`.
3. Truncate the file to `HEADER_SIZE + n` (removes old data+index if `build_from` was used).
4. Append sorted overflow data, then sparse index.
5. Seek to offset 0, overwrite the header with final values.
---
### Reader (`PersistentCompactIntVec`)
Used at query time. The whole file is mmap'd; only the sparse index is copied into a `Vec` at open time (≤ 32 KB, L1-resident).
```rust
struct PersistentCompactIntVec {
mmap: Mmap,
n: usize,
n_overflow: usize,
step: u32,
index: Vec<(u32, u32)>, // L1-resident
primary_offset: usize, // = 24 (HEADER_SIZE)
data_offset: usize, // = 24 + n
path: PathBuf,
}
```
#### `open(path: &Path) -> io::Result<Self>`
Mmaps the file, parses the 24-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
#### `get(slot: u64) -> u32` — random access
```
primary[slot] < 255 → return it directly
step == 0:
binary_search(data[0..n_overflow], slot)
step > 0:
i = upper_bound(index[..].slot, slot) 1 // in L1-resident Vec
binary_search(data[index[i].pos .. index[i+1].pos], slot)
```
#### `iter() -> Iter<'_>` — sequential scan, O(n)
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
`Iter` implements `ExactSizeIterator`. `&PersistentCompactIntVec` implements `IntoIterator`.
#### Aggregate
```rust
fn sum(&self) -> u64 // Σ self[i] as u64, via iter()
```
#### Distance methods
All take `&other` of equal length, iterate both with `zip(self.iter(), other.iter())`, and return `f64`.
| Method | Formula |
|---|---|
| `bray_dist` | `1 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ)` |
| `relfreq_bray_dist` | Bray-Curtis on relative frequencies: `1 Σmin(pᵢ,qᵢ)` where `pᵢ = aᵢ/Σa` |
| `euclidean_dist` | `√Σ(aᵢ bᵢ)²` |
| `relfreq_euclidean_dist` | Euclidean on relative frequencies |
| `hellinger_euclidean_dist` | `√Σ(√pᵢ √qᵢ)²` — Euclidean on sqrt(relfreq) |
| `hellinger_dist` | `hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1] |
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 |A∩B| / |AB|` where presence iff count ≥ threshold |
| `jaccard_dist` | `threshold_jaccard_dist(&other, 1)` |
Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.
---
## Step computation
Chosen at `close()` once `n_overflow` is known:
```
L1_entries = 32 768 / 8 = 4096
step = 0 if n_overflow ≤ 4096
step = ⌈n_overflow / 4096⌉ otherwise
```
For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.
---
## Complexity
| Operation | Time | Notes |
|---|---|---|
| `set` / `get` (builder) | O(1) | mmap byte + HashMap |
| `get` (reader, no overflow) | O(1) | single mmap byte |
| `get` (reader, with index) | O(log step) | ≤ 2 memory regions |
| `get` (reader, no index) | O(log n_overflow) | data fits in a few cache lines |
| `iter()` full scan | O(n + n_overflow) | merge-scan, no binary search |
| `sum`, distances | O(n) | via `iter()` / `zip(iter(), iter())` |
| `min` / `max` / `add` / `diff` | O(n) | via `other.iter()` + builder `set` |
| `close` | O(n_overflow log n_overflow) | sort + sequential write |
| `open` | O(n_index) | index copy into Vec |
| `build_from` | O(file_size) + O(n_overflow) | OS copy + HashMap load |