Add persistent compact integer vector and cache-line-optimized MPHF
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
This commit is contained in:
@@ -6,6 +6,59 @@
|
||||
|
||||
---
|
||||
|
||||
## Four usage modes
|
||||
|
||||
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
|
||||
|
||||
| Mode | Description | Payload |
|
||||
|---|---|---|
|
||||
| 1. Set | membership test only | none |
|
||||
| 2. Set with count | occurrences per kmer per sample | compact integer vector |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | bit matrix n_kmers × n_genomes |
|
||||
| 4. Count matrix | occurrences per kmer per genome | integer matrix n_kmers × n_genomes |
|
||||
|
||||
Mode 4 is architecturally identical to mode 3 but with counts instead of bits; the main open question is scale — a naive dense representation reaches hundreds of GB for large genome collections, which may require a sparse encoding.
|
||||
|
||||
### Payload for mode 2: compact count vector
|
||||
|
||||
Counts follow a roughly geometric distribution: the large majority are below 128, almost all below 16 000, with rare large values in highly covered regions. Options for a mmap-compatible random-access count vector:
|
||||
|
||||
- **`Vec<u16>`** — 2 bytes/slot, covers 0–65 535, O(1) access, trivially mmap-able. Practical upper bound sufficient for most metagenomic use cases.
|
||||
- **Bit-packed fixed width** — choose B = ⌈log₂(max\_count)⌉ globally (e.g. B=14 for 99.9% coverage at 1.75 bytes/slot). O(1) access via bit-shift arithmetic.
|
||||
- **Block-varint (PForDelta, StreamVByte)** — good compression but random access requires a separate offset index; no mature Rust crate for mmap use.
|
||||
|
||||
Decision not yet made. `Vec<u16>` is the default baseline pending profiling.
|
||||
|
||||
### Payload for mode 3: presence/absence matrix
|
||||
|
||||
Column-major bit matrix: column j (genome j) is a contiguous `n_slots`-bit array. This layout makes per-genome operations (union, intersection, diff) cache-friendly. For 10⁹ slots × 100 genomes ≈ 12.5 GB — large but mmap-able with no resident-memory cost at open time.
|
||||
|
||||
---
|
||||
|
||||
## Payload architecture
|
||||
|
||||
The payload is orthogonal to the MPHF + evidence layer. `Layer` will be parameterised by a payload type `D`:
|
||||
|
||||
```rust
|
||||
struct Layer<D = ()> {
|
||||
mphf: Mphf,
|
||||
evidence: Evidence,
|
||||
unitigs: UnitigFileReader,
|
||||
data: D,
|
||||
}
|
||||
```
|
||||
|
||||
Where `D` implements a `LayerData` trait covering open/close/get. Concrete instances:
|
||||
|
||||
- `()` — mode 1
|
||||
- `CountVec` — mode 2 (u16 or bit-packed)
|
||||
- `BitMatrix` — mode 3
|
||||
- `CountMatrix` — mode 4
|
||||
|
||||
This parameterisation is not yet implemented; the current code uses a fixed `Counts` field.
|
||||
|
||||
---
|
||||
|
||||
## Three-level hierarchy
|
||||
|
||||
```
|
||||
@@ -15,14 +68,14 @@ index_root/ ← LayeredMap (collection)
|
||||
layer_0/ ← Layer
|
||||
mphf.bin
|
||||
unitigs.bin
|
||||
unitigs.bin.idx
|
||||
evidence.bin
|
||||
counts.bin
|
||||
presence.bin
|
||||
counts.bin [mode 2 only]
|
||||
presence.bin [mode 3/4 only]
|
||||
layer_1/
|
||||
...
|
||||
part_00001/
|
||||
layer_0/
|
||||
layer_1/
|
||||
...
|
||||
```
|
||||
|
||||
@@ -38,19 +91,15 @@ index_root/ ← LayeredMap (collection)
|
||||
|
||||
```
|
||||
layer_N/
|
||||
mphf.bin — ptr_hash MPHF (exact key count, phase-2 construction)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (concatenated, variable-length)
|
||||
unitig_offsets.bin — u32 per unitig: nucleotide offset of unitig j in unitigs.bin
|
||||
mphf.bin — ptr_hash MPHF (epserde, ptr_hash native format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
|
||||
counts.bin — u32 per MPHF slot (total kmer occurrences)
|
||||
presence.bin — bit matrix: n_slots × n_samples [optional]
|
||||
counts.bin — [optional] u16 per MPHF slot (mode 2)
|
||||
presence.bin — [optional] bit matrix n_slots × n_samples (mode 3/4)
|
||||
```
|
||||
|
||||
Unitigs have variable lengths. Each record in `unitigs.bin` is self-delimiting: it begins with a varint `seql` (sequence length in nucleotides) followed by `(seql+3)/4` packed bytes — the streaming format defined in `obiskio`. Sequential scan is always possible using the in-record `seql`.
|
||||
|
||||
For O(1) random access, `unitig_offsets.bin` is a **precomputed derived index**: a u32 array of byte offsets into `unitigs.bin`, with n_unitigs + 1 entries (sentinel = total byte size). Built once at construction by a single sequential scan; reconstructible from `unitigs.bin` if lost. Access: `unitigs.bin[offsets[j] .. offsets[j+1]]`.
|
||||
|
||||
All files except `mphf.bin` are flat arrays of fixed-size elements, serialised with **rkyv** for zero-copy mmap access. `mphf.bin` uses ptr_hash's native serialisation; rkyv integration is deferred (see open questions).
|
||||
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
|
||||
### Evidence encoding
|
||||
|
||||
@@ -68,9 +117,29 @@ kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
|
||||
|
||||
`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.
|
||||
|
||||
### Presence/absence matrix
|
||||
---
|
||||
|
||||
Column-major bit matrix: column j (sample j) is a contiguous `n_slots`-bit array. This layout makes per-sample operations (union, intersection, diff over a column) cache-friendly. For large matrices (e.g. 10 M slots × 1 000 samples ≈ 1.25 GB per partition), rkyv + mmap avoids loading the full matrix at open time.
|
||||
## ptr_hash configuration
|
||||
|
||||
The MPHF per layer is configured as:
|
||||
|
||||
```rust
|
||||
type Mphf = PtrHash<
|
||||
u64, // key type: canonical kmer raw encoding
|
||||
CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5)
|
||||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
|
||||
Xx64, // hasher: XXH3-64 with seed, handles structured keys
|
||||
Vec<u8>, // pilots
|
||||
>;
|
||||
```
|
||||
|
||||
**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.
|
||||
|
||||
**Bucket function — `CubicEps` with `PtrHashParams::<CubicEps>::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.
|
||||
|
||||
**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
|
||||
|
||||
**Construction:** `new_from_par_iter` avoids materialising all keys as `Vec<u64>`. MPHF correctness is verified inline during the second pass (evidence/counts fill) using an n/8-byte bitset; any duplicate slot or out-of-bounds index returns `OLMError::Mphf`.
|
||||
|
||||
---
|
||||
|
||||
@@ -79,14 +148,14 @@ Column-major bit matrix: column j (sample j) is a contiguous `n_slots`-bit array
|
||||
A kmer query routes through all three levels:
|
||||
|
||||
1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
|
||||
2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.query(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
|
||||
3. **Data access**: read `counts[slot]` and/or `presence[slot]` from the matching layer.
|
||||
2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.index(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
|
||||
3. **Data access**: read payload at `slot` from the matching layer (counts, presence row, etc.).
|
||||
|
||||
```
|
||||
fn query(kmer) → Option<Hit>:
|
||||
part = partition_of(kmer)
|
||||
for (i, layer) in part.layers.iter().enumerate():
|
||||
slot = layer.mphf.query(kmer)
|
||||
slot = layer.mphf.index(&kmer.raw())
|
||||
if layer.evidence.decode(slot) == kmer:
|
||||
return Some(Hit { layer: i, slot })
|
||||
return None
|
||||
@@ -102,7 +171,7 @@ When adding dataset B to an existing index:
|
||||
|
||||
1. For each partition, iterate kmers of B routed to that partition.
|
||||
2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
|
||||
3. Build a new layer from `B \ index` using the two-phase pipeline (FMPHGO provisional → ptr_hash definitive).
|
||||
3. Build a new layer from `B \ index`.
|
||||
4. Append the new layer directory under each `part_XXXXX/`.
|
||||
5. Update `meta.json` (layer count, sample registry).
|
||||
|
||||
@@ -110,59 +179,25 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
|
||||
---
|
||||
|
||||
## Core API (sketch)
|
||||
|
||||
```rust
|
||||
// Open an existing index
|
||||
let map = LayeredMap::open(path)?;
|
||||
|
||||
// Query a canonical kmer across all partitions and layers
|
||||
match map.query(kmer) {
|
||||
Some(hit) => {
|
||||
let count = hit.count();
|
||||
let present = hit.presence_row(); // bit slice over samples
|
||||
}
|
||||
None => { /* absent */ }
|
||||
}
|
||||
|
||||
// Non-destructive extension with a new dataset
|
||||
// unitigs produced by the two-phase pipeline, one per partition
|
||||
let layer_idx = map.add_layer(unitigs_dir, counts_dir, presence_path)?;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
| crate | role |
|
||||
|---|---|
|
||||
| `ptr_hash` | phase-2 MPHF per layer |
|
||||
| `ph` (FMPHGO) | phase-1 provisional MPHF during layer construction |
|
||||
| `rkyv` | zero-copy serialisation of flat arrays (evidence, counts, presence) |
|
||||
| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
|
||||
| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
|
||||
| `epserde 0.8` | zero-copy serialisation of MPHF |
|
||||
| `memmap2` | mmap of layer files |
|
||||
| `bitm` | bit-packed presence matrix |
|
||||
| `obiskio` | unitig file writer/reader |
|
||||
|
||||
---
|
||||
|
||||
## Serialisation strategy
|
||||
|
||||
All flat arrays use `rkyv::Archive`:
|
||||
|
||||
```rust
|
||||
#[derive(Archive, Serialize, Deserialize)]
|
||||
struct Evidence { slots: Vec<u32> } // packed (unitig_id: 25 | rank: 7)
|
||||
|
||||
#[derive(Archive, Serialize, Deserialize)]
|
||||
struct Counts { data: Vec<u32> }
|
||||
```
|
||||
|
||||
At open time, each file is mmapped and cast to its archived type — no allocation, no copy. The MPHF is loaded via ptr_hash's own API; a rkyv wrapper is a future refinement.
|
||||
Flat arrays (evidence, counts, presence) use a simple custom binary format (raw bytes, fixed element size) opened via mmap — no additional serialisation crate required.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **ptr_hash + rkyv**: ptr_hash's internals are flat bit arrays; a rkyv-compatible wrapper is structurally feasible. Assess upstream willingness or implement a thin newtype wrapper.
|
||||
- **Presence matrix layout**: column-major favours per-sample operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
|
||||
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation (store only non-zero entries) may be required; the access pattern and density threshold are not yet defined.
|
||||
- **Count vector encoding (mode 2)**: `Vec<u16>` vs bit-packed (B=14). Decision pending; depends on whether counts > 65 535 occur in practice for the target datasets.
|
||||
- **Presence matrix layout**: column-major favours per-genome operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
|
||||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path.
|
||||
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
|
||||
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
|
||||
|
||||
@@ -0,0 +1,186 @@
|
||||
# PersistentCompactIntVec
|
||||
|
||||
## Purpose
|
||||
|
||||
`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random access with minimal memory footprint and optimal cache behaviour.
|
||||
|
||||
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
|
||||
|
||||
---
|
||||
|
||||
## Design
|
||||
|
||||
Two-tier structure:
|
||||
|
||||
1. **Primary array** — `[u8; n]`, mmap'd as a flat file. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow structure** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return primary[slot]
|
||||
primary[slot] == 255 → binary search in overflow
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lifecycle
|
||||
|
||||
The structure has two distinct runtime roles with different APIs.
|
||||
|
||||
### Builder (`PersistentCompactIntVecBuilder`)
|
||||
|
||||
Used during layer construction. Holds the primary array and overflow map in memory; supports arbitrary reads and writes before finalisation.
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVecBuilder {
|
||||
primary: Vec<u8>, // in memory; written to disk at close()
|
||||
overflow: HashMap<u64, u32>, // O(1) get/set for values ≥ 255
|
||||
}
|
||||
```
|
||||
|
||||
**Phase 1 — `new(n: usize)`**
|
||||
Allocates `primary` of length `n` initialised to 0. `overflow` is empty.
|
||||
|
||||
**Phase 2 — fill (repeated `set` / `get`)**
|
||||
|
||||
```rust
|
||||
fn set(&mut self, slot: u64, value: u32) {
|
||||
if value < 255 {
|
||||
self.primary[slot] = value as u8;
|
||||
self.overflow.remove(&slot); // in case of downward mutation
|
||||
} else {
|
||||
self.primary[slot] = 255; // sentinel
|
||||
self.overflow.insert(slot, value);
|
||||
}
|
||||
}
|
||||
|
||||
fn get(&self, slot: u64) -> u32 {
|
||||
match self.primary[slot] {
|
||||
255 => *self.overflow.get(&slot).unwrap(),
|
||||
v => v as u32,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Reads and mutations are both O(1). Overflow entries can be created, updated, or removed freely during this phase.
|
||||
|
||||
**Phase 3 — `close(primary_path, overflow_path)`**
|
||||
|
||||
1. Write `primary` as raw bytes to `counts_primary.bin`.
|
||||
2. Collect `overflow` into `Vec<(u32, u32)>`, sort by slot.
|
||||
3. Compute `step` from `n_overflow` (see below).
|
||||
4. Build sparse index.
|
||||
5. Write `counts_overflow.bin`.
|
||||
6. Drop all in-memory state.
|
||||
|
||||
The `HashMap` is the only extra allocation: bounded by `n_overflow × (8 + 4 + overhead)` bytes, typically a few MB in practice.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentCompactIntVec`)
|
||||
|
||||
Used at query time. Both files are mmap'd; the sparse index is loaded into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVec {
|
||||
primary: Mmap, // mmap of counts_primary.bin
|
||||
index: Vec<(u32, u32)>, // sparse index, loaded into RAM at open
|
||||
data: Mmap, // mmap of overflow data region
|
||||
n_overflow: u32,
|
||||
step: u32,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(primary_path, overflow_path)`**
|
||||
Mmaps both files. Parses the overflow file header; copies the sparse index into a `Vec` (tiny, warm in cache). The data region stays mmap'd.
|
||||
|
||||
**`get(slot: u64) -> u32`** — see Lookup section.
|
||||
|
||||
---
|
||||
|
||||
## Overflow file format
|
||||
|
||||
```
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
n_overflow: u32
|
||||
step: u32 (0 if n_overflow ≤ L1_entries → no sparse index)
|
||||
[if step > 0]
|
||||
n_index: u32 = ⌈n_overflow / step⌉
|
||||
index: [(slot: u32, pos: u32); n_index] ← loaded into RAM at open
|
||||
data: [(slot: u32, value: u32); n_overflow] sorted by slot, mmap'd
|
||||
```
|
||||
|
||||
`index[i]` stores the slot value and data-array position of the `i × step`-th overflow entry.
|
||||
|
||||
---
|
||||
|
||||
## Step computation
|
||||
|
||||
The step is chosen at `close()` time, once `n_overflow` is known:
|
||||
|
||||
```
|
||||
L1_SIZE = 32 * 1024 // 32 KB conservative target
|
||||
INDEX_ENTRY = 8 // bytes: (u32, u32)
|
||||
L1_entries = L1_SIZE / INDEX_ENTRY = 4096
|
||||
|
||||
if n_overflow ≤ L1_entries:
|
||||
step = 0 // no sparse index; data itself fits in a few cache lines
|
||||
else:
|
||||
step = ⌈n_overflow / L1_entries⌉
|
||||
```
|
||||
|
||||
For the Betula nana reference (359 044 overflows): step = 88, index = 4 080 entries = 31.9 KB.
|
||||
|
||||
---
|
||||
|
||||
## Lookup
|
||||
|
||||
```
|
||||
fn get(slot: u64) -> u32:
|
||||
if primary[slot] < 255:
|
||||
return primary[slot] as u32
|
||||
|
||||
if step == 0:
|
||||
return binary_search(data[0..n_overflow], slot)
|
||||
|
||||
// 1. binary search in index (Vec, L1-resident)
|
||||
i = upper_bound(index[..].slot, slot) - 1
|
||||
pos_start = index[i].pos
|
||||
pos_end = if i+1 < n_index { index[i+1].pos } else { n_overflow }
|
||||
|
||||
// 2. binary search in contiguous block (mmap'd)
|
||||
return binary_search(data[pos_start..pos_end], slot)
|
||||
```
|
||||
|
||||
Cache behaviour: step 1 is entirely within the L1-resident `Vec<(u32,u32)>`; step 2 loads a contiguous block of ≤ `step × 8` bytes from the mmap.
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
layer_N/
|
||||
counts_primary.bin — [u8; n_slots], raw bytes
|
||||
counts_overflow.bin — PCIV header + sparse index + sorted data
|
||||
(absent if n_overflow == 0)
|
||||
```
|
||||
|
||||
If `counts_overflow.bin` is absent, no slot has value ≥ 255; all reads go directly to the primary array.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
| `set` / `get` (builder) | O(1) | HashMap for overflow |
|
||||
| `get` (no overflow) | O(1) | single byte read |
|
||||
| `get` (overflow, with index) | O(log step) | ~2 memory regions |
|
||||
| `get` (overflow, no index) | O(log n_overflow) | data fits in a few cache lines |
|
||||
| `close` | O(n_overflow log n_overflow) | sort + index build |
|
||||
| `open` | O(n_index) | index copy into Vec |
|
||||
|
||||
---
|
||||
|
||||
## Generalisation
|
||||
|
||||
The sentinel (255) and primary type (u8) are fixed. The overflow value type is u32, sufficient for any realistic k-mer count. For a count matrix (mode 4), one `PersistentCompactIntVec` per genome column shares the primary array layout.
|
||||
Reference in New Issue
Block a user