feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
This commit is contained in:
@@ -49,7 +49,7 @@ Build a new MPHF over the filtered kmer set only, with the exact key count avail
|
||||
|
||||
**Phase 1** (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.
|
||||
|
||||
**Phase 2** (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.
|
||||
**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.
|
||||
|
||||
boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.
|
||||
|
||||
@@ -73,9 +73,64 @@ All three are in-memory structures. Their internal representation is flat bit ar
|
||||
|
||||
No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
|
||||
|
||||
---
|
||||
|
||||
## Multilayer index architecture
|
||||
|
||||
### Motivation
|
||||
|
||||
An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.
|
||||
|
||||
### Layer structure
|
||||
|
||||
Each layer is a self-contained unit:
|
||||
|
||||
```
|
||||
layer_i/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences
|
||||
mphf.bin — ptr_hash index (phase-2, exact key count)
|
||||
evidence.bin — [(unitig_id, rank)] per MPHF slot (see unitig_evidence.md)
|
||||
counts.bin — [u32] per MPHF slot
|
||||
```
|
||||
|
||||
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:
|
||||
|
||||
1. For each kmer in B: query layer 0 — if found, accumulate count into `counts_0[MPHF_0(kmer)]`.
|
||||
2. Collect all kmers of B not present in any existing layer → set `B \ A`.
|
||||
3. Build layer 1 from `B \ A` using the standard two-phase pipeline (spectrum, filter, ptr_hash).
|
||||
|
||||
Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from `C \ A \ B`.
|
||||
|
||||
### Membership verification
|
||||
|
||||
ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(unitig_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
|
||||
|
||||
This makes the evidence layer load-bearing for correctness, not only for locality.
|
||||
|
||||
### Query algorithm
|
||||
|
||||
```
|
||||
fn query(kmer) → Option<count>:
|
||||
for layer in layers:
|
||||
slot = layer.mphf.query(kmer)
|
||||
if layer.evidence.decode(slot) == kmer:
|
||||
return Some(layer.counts[slot])
|
||||
return None
|
||||
```
|
||||
|
||||
Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.
|
||||
|
||||
### Layer count and probe cost
|
||||
|
||||
Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.
|
||||
|
||||
### Merging layers
|
||||
|
||||
Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
|
||||
|
||||
## Open questions
|
||||
|
||||
- Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
|
||||
- Revisit ptr_hash for phase 2 once the crate has broader production track record.
|
||||
- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.
|
||||
- **rkyv integration**: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to `rkyv::Archive` — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.
|
||||
- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
|
||||
- Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.
|
||||
|
||||
Reference in New Issue
Block a user