Files
obikmer/docmd/implementation/persistent_compact_int_vec.md
T
Eric Coissac f2de79acde Add persistent compact integer vector and cache-line-optimized MPHF
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
2026-05-13 10:09:46 +08:00

187 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PersistentCompactIntVec
## Purpose
`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0254) and large values are rare. It is designed for mmap-compatible random access with minimal memory footprint and optimal cache behaviour.
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
---
## Design
Two-tier structure:
1. **Primary array**`[u8; n]`, mmap'd as a flat file. Values 0254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
2. **Overflow structure** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
```
primary[slot] < 255 → return primary[slot]
primary[slot] == 255 → binary search in overflow
```
---
## Lifecycle
The structure has two distinct runtime roles with different APIs.
### Builder (`PersistentCompactIntVecBuilder`)
Used during layer construction. Holds the primary array and overflow map in memory; supports arbitrary reads and writes before finalisation.
```rust
struct PersistentCompactIntVecBuilder {
primary: Vec<u8>, // in memory; written to disk at close()
overflow: HashMap<u64, u32>, // O(1) get/set for values ≥ 255
}
```
**Phase 1 — `new(n: usize)`**
Allocates `primary` of length `n` initialised to 0. `overflow` is empty.
**Phase 2 — fill (repeated `set` / `get`)**
```rust
fn set(&mut self, slot: u64, value: u32) {
if value < 255 {
self.primary[slot] = value as u8;
self.overflow.remove(&slot); // in case of downward mutation
} else {
self.primary[slot] = 255; // sentinel
self.overflow.insert(slot, value);
}
}
fn get(&self, slot: u64) -> u32 {
match self.primary[slot] {
255 => *self.overflow.get(&slot).unwrap(),
v => v as u32,
}
}
```
Reads and mutations are both O(1). Overflow entries can be created, updated, or removed freely during this phase.
**Phase 3 — `close(primary_path, overflow_path)`**
1. Write `primary` as raw bytes to `counts_primary.bin`.
2. Collect `overflow` into `Vec<(u32, u32)>`, sort by slot.
3. Compute `step` from `n_overflow` (see below).
4. Build sparse index.
5. Write `counts_overflow.bin`.
6. Drop all in-memory state.
The `HashMap` is the only extra allocation: bounded by `n_overflow × (8 + 4 + overhead)` bytes, typically a few MB in practice.
---
### Reader (`PersistentCompactIntVec`)
Used at query time. Both files are mmap'd; the sparse index is loaded into a `Vec` at open time (≤ 32 KB, L1-resident).
```rust
struct PersistentCompactIntVec {
primary: Mmap, // mmap of counts_primary.bin
index: Vec<(u32, u32)>, // sparse index, loaded into RAM at open
data: Mmap, // mmap of overflow data region
n_overflow: u32,
step: u32,
}
```
**`open(primary_path, overflow_path)`**
Mmaps both files. Parses the overflow file header; copies the sparse index into a `Vec` (tiny, warm in cache). The data region stays mmap'd.
**`get(slot: u64) -> u32`** — see Lookup section.
---
## Overflow file format
```
magic: [u8; 4] = b"PCIV"
n_overflow: u32
step: u32 (0 if n_overflow ≤ L1_entries → no sparse index)
[if step > 0]
n_index: u32 = ⌈n_overflow / step⌉
index: [(slot: u32, pos: u32); n_index] ← loaded into RAM at open
data: [(slot: u32, value: u32); n_overflow] sorted by slot, mmap'd
```
`index[i]` stores the slot value and data-array position of the `i × step`-th overflow entry.
---
## Step computation
The step is chosen at `close()` time, once `n_overflow` is known:
```
L1_SIZE = 32 * 1024 // 32 KB conservative target
INDEX_ENTRY = 8 // bytes: (u32, u32)
L1_entries = L1_SIZE / INDEX_ENTRY = 4096
if n_overflow ≤ L1_entries:
step = 0 // no sparse index; data itself fits in a few cache lines
else:
step = ⌈n_overflow / L1_entries⌉
```
For the Betula nana reference (359 044 overflows): step = 88, index = 4 080 entries = 31.9 KB.
---
## Lookup
```
fn get(slot: u64) -> u32:
if primary[slot] < 255:
return primary[slot] as u32
if step == 0:
return binary_search(data[0..n_overflow], slot)
// 1. binary search in index (Vec, L1-resident)
i = upper_bound(index[..].slot, slot) - 1
pos_start = index[i].pos
pos_end = if i+1 < n_index { index[i+1].pos } else { n_overflow }
// 2. binary search in contiguous block (mmap'd)
return binary_search(data[pos_start..pos_end], slot)
```
Cache behaviour: step 1 is entirely within the L1-resident `Vec<(u32,u32)>`; step 2 loads a contiguous block of ≤ `step × 8` bytes from the mmap.
---
## Files
```
layer_N/
counts_primary.bin — [u8; n_slots], raw bytes
counts_overflow.bin — PCIV header + sparse index + sorted data
(absent if n_overflow == 0)
```
If `counts_overflow.bin` is absent, no slot has value ≥ 255; all reads go directly to the primary array.
---
## Complexity
| Operation | Time | Notes |
|---|---|---|
| `set` / `get` (builder) | O(1) | HashMap for overflow |
| `get` (no overflow) | O(1) | single byte read |
| `get` (overflow, with index) | O(log step) | ~2 memory regions |
| `get` (overflow, no index) | O(log n_overflow) | data fits in a few cache lines |
| `close` | O(n_overflow log n_overflow) | sort + index build |
| `open` | O(n_index) | index copy into Vec |
---
## Generalisation
The sentinel (255) and primary type (u8) are fixed. The overflow value type is u32, sufficient for any realistic k-mer count. For a count matrix (mode 4), one `PersistentCompactIntVec` per genome column shares the primary array layout.