Add persistent compact integer vector and cache-line-optimized MPHF
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
This commit is contained in:
@@ -0,0 +1,186 @@
|
||||
# PersistentCompactIntVec
|
||||
|
||||
## Purpose
|
||||
|
||||
`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random access with minimal memory footprint and optimal cache behaviour.
|
||||
|
||||
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
|
||||
|
||||
---
|
||||
|
||||
## Design
|
||||
|
||||
Two-tier structure:
|
||||
|
||||
1. **Primary array** — `[u8; n]`, mmap'd as a flat file. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow structure** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return primary[slot]
|
||||
primary[slot] == 255 → binary search in overflow
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lifecycle
|
||||
|
||||
The structure has two distinct runtime roles with different APIs.
|
||||
|
||||
### Builder (`PersistentCompactIntVecBuilder`)
|
||||
|
||||
Used during layer construction. Holds the primary array and overflow map in memory; supports arbitrary reads and writes before finalisation.
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVecBuilder {
|
||||
primary: Vec<u8>, // in memory; written to disk at close()
|
||||
overflow: HashMap<u64, u32>, // O(1) get/set for values ≥ 255
|
||||
}
|
||||
```
|
||||
|
||||
**Phase 1 — `new(n: usize)`**
|
||||
Allocates `primary` of length `n` initialised to 0. `overflow` is empty.
|
||||
|
||||
**Phase 2 — fill (repeated `set` / `get`)**
|
||||
|
||||
```rust
|
||||
fn set(&mut self, slot: u64, value: u32) {
|
||||
if value < 255 {
|
||||
self.primary[slot] = value as u8;
|
||||
self.overflow.remove(&slot); // in case of downward mutation
|
||||
} else {
|
||||
self.primary[slot] = 255; // sentinel
|
||||
self.overflow.insert(slot, value);
|
||||
}
|
||||
}
|
||||
|
||||
fn get(&self, slot: u64) -> u32 {
|
||||
match self.primary[slot] {
|
||||
255 => *self.overflow.get(&slot).unwrap(),
|
||||
v => v as u32,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Reads and mutations are both O(1). Overflow entries can be created, updated, or removed freely during this phase.
|
||||
|
||||
**Phase 3 — `close(primary_path, overflow_path)`**
|
||||
|
||||
1. Write `primary` as raw bytes to `counts_primary.bin`.
|
||||
2. Collect `overflow` into `Vec<(u32, u32)>`, sort by slot.
|
||||
3. Compute `step` from `n_overflow` (see below).
|
||||
4. Build sparse index.
|
||||
5. Write `counts_overflow.bin`.
|
||||
6. Drop all in-memory state.
|
||||
|
||||
The `HashMap` is the only extra allocation: bounded by `n_overflow × (8 + 4 + overhead)` bytes, typically a few MB in practice.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentCompactIntVec`)
|
||||
|
||||
Used at query time. Both files are mmap'd; the sparse index is loaded into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVec {
|
||||
primary: Mmap, // mmap of counts_primary.bin
|
||||
index: Vec<(u32, u32)>, // sparse index, loaded into RAM at open
|
||||
data: Mmap, // mmap of overflow data region
|
||||
n_overflow: u32,
|
||||
step: u32,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(primary_path, overflow_path)`**
|
||||
Mmaps both files. Parses the overflow file header; copies the sparse index into a `Vec` (tiny, warm in cache). The data region stays mmap'd.
|
||||
|
||||
**`get(slot: u64) -> u32`** — see Lookup section.
|
||||
|
||||
---
|
||||
|
||||
## Overflow file format
|
||||
|
||||
```
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
n_overflow: u32
|
||||
step: u32 (0 if n_overflow ≤ L1_entries → no sparse index)
|
||||
[if step > 0]
|
||||
n_index: u32 = ⌈n_overflow / step⌉
|
||||
index: [(slot: u32, pos: u32); n_index] ← loaded into RAM at open
|
||||
data: [(slot: u32, value: u32); n_overflow] sorted by slot, mmap'd
|
||||
```
|
||||
|
||||
`index[i]` stores the slot value and data-array position of the `i × step`-th overflow entry.
|
||||
|
||||
---
|
||||
|
||||
## Step computation
|
||||
|
||||
The step is chosen at `close()` time, once `n_overflow` is known:
|
||||
|
||||
```
|
||||
L1_SIZE = 32 * 1024 // 32 KB conservative target
|
||||
INDEX_ENTRY = 8 // bytes: (u32, u32)
|
||||
L1_entries = L1_SIZE / INDEX_ENTRY = 4096
|
||||
|
||||
if n_overflow ≤ L1_entries:
|
||||
step = 0 // no sparse index; data itself fits in a few cache lines
|
||||
else:
|
||||
step = ⌈n_overflow / L1_entries⌉
|
||||
```
|
||||
|
||||
For the Betula nana reference (359 044 overflows): step = 88, index = 4 080 entries = 31.9 KB.
|
||||
|
||||
---
|
||||
|
||||
## Lookup
|
||||
|
||||
```
|
||||
fn get(slot: u64) -> u32:
|
||||
if primary[slot] < 255:
|
||||
return primary[slot] as u32
|
||||
|
||||
if step == 0:
|
||||
return binary_search(data[0..n_overflow], slot)
|
||||
|
||||
// 1. binary search in index (Vec, L1-resident)
|
||||
i = upper_bound(index[..].slot, slot) - 1
|
||||
pos_start = index[i].pos
|
||||
pos_end = if i+1 < n_index { index[i+1].pos } else { n_overflow }
|
||||
|
||||
// 2. binary search in contiguous block (mmap'd)
|
||||
return binary_search(data[pos_start..pos_end], slot)
|
||||
```
|
||||
|
||||
Cache behaviour: step 1 is entirely within the L1-resident `Vec<(u32,u32)>`; step 2 loads a contiguous block of ≤ `step × 8` bytes from the mmap.
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
layer_N/
|
||||
counts_primary.bin — [u8; n_slots], raw bytes
|
||||
counts_overflow.bin — PCIV header + sparse index + sorted data
|
||||
(absent if n_overflow == 0)
|
||||
```
|
||||
|
||||
If `counts_overflow.bin` is absent, no slot has value ≥ 255; all reads go directly to the primary array.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
| `set` / `get` (builder) | O(1) | HashMap for overflow |
|
||||
| `get` (no overflow) | O(1) | single byte read |
|
||||
| `get` (overflow, with index) | O(log step) | ~2 memory regions |
|
||||
| `get` (overflow, no index) | O(log n_overflow) | data fits in a few cache lines |
|
||||
| `close` | O(n_overflow log n_overflow) | sort + index build |
|
||||
| `open` | O(n_index) | index copy into Vec |
|
||||
|
||||
---
|
||||
|
||||
## Generalisation
|
||||
|
||||
The sentinel (255) and primary type (u8) are fixed. The overflow value type is u32, sufficient for any realistic k-mer count. For a count matrix (mode 4), one `PersistentCompactIntVec` per genome column shares the primary array layout.
|
||||
Reference in New Issue
Block a user