feat: implement persistent layered index and chunked binary format

Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:20:08 +08:00
parent 8c17bf958b
commit 5169f65dc9
24 changed files with 1342 additions and 382 deletions
@@ -49,7 +49,7 @@ Build a new MPHF over the filtered kmer set only, with the exact key count avail

 **Phase 1** (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.

-**Phase 2** (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.
+**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.

 boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.

@@ -73,9 +73,64 @@ All three are in-memory structures. Their internal representation is flat bit ar

 No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.

+---
+
+## Multilayer index architecture
+
+### Motivation
+
+An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.
+
+### Layer structure
+
+Each layer is a self-contained unit:
+
+```
+layer_i/
+  unitigs.bin     — packed 2-bit nucleotide sequences
+  mphf.bin        — ptr_hash index (phase-2, exact key count)
+  evidence.bin    — [(unitig_id, rank)] per MPHF slot  (see unitig_evidence.md)
+  counts.bin      — [u32] per MPHF slot
+```
+
+Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:
+
+1. For each kmer in B: query layer 0 — if found, accumulate count into `counts_0[MPHF_0(kmer)]`.
+2. Collect all kmers of B not present in any existing layer → set `B \ A`.
+3. Build layer 1 from `B \ A` using the standard two-phase pipeline (spectrum, filter, ptr_hash).
+
+Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from `C \ A \ B`.
+
+### Membership verification
+
+ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(unitig_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
+
+This makes the evidence layer load-bearing for correctness, not only for locality.
+
+### Query algorithm
+
+```
+fn query(kmer) → Option<count>:
+    for layer in layers:
+        slot = layer.mphf.query(kmer)
+        if layer.evidence.decode(slot) == kmer:
+            return Some(layer.counts[slot])
+    return None
+```
+
+Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.
+
+### Layer count and probe cost
+
+Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.
+
+### Merging layers
+
+Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
+
 ## Open questions

 - Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
- Revisit ptr_hash for phase 2 once the crate has broader production track record.
- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.
+- **rkyv integration**: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to `rkyv::Archive` — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.
 - Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
+- Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.
@@ -0,0 +1,168 @@
+# obilayeredmap — layered kmer index crate
+
+## Purpose
+
+`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
+
+---
+
+## Three-level hierarchy
+
+```
+index_root/                        ← LayeredMap (collection)
+  meta.json
+  part_00000/                      ← Partition
+    layer_0/                       ← Layer
+      mphf.bin
+      unitigs.bin
+      evidence.bin
+      counts.bin
+      presence.bin
+    layer_1/
+      ...
+  part_00001/
+    layer_0/
+    layer_1/
+    ...
+```
+
+**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.
+
+**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.
+
+**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
+
+---
+
+## Layer file layout
+
+```
+layer_N/
+  mphf.bin            — ptr_hash MPHF (exact key count, phase-2 construction)
+  unitigs.bin         — packed 2-bit nucleotide sequences (concatenated, variable-length)
+  unitig_offsets.bin  — u32 per unitig: nucleotide offset of unitig j in unitigs.bin
+  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
+  counts.bin          — u32 per MPHF slot (total kmer occurrences)
+  presence.bin        — bit matrix: n_slots × n_samples  [optional]
+```
+
+Unitigs have variable lengths. Each record in `unitigs.bin` is self-delimiting: it begins with a varint `seql` (sequence length in nucleotides) followed by `(seql+3)/4` packed bytes — the streaming format defined in `obiskio`. Sequential scan is always possible using the in-record `seql`.
+
+For O(1) random access, `unitig_offsets.bin` is a **precomputed derived index**: a u32 array of byte offsets into `unitigs.bin`, with n_unitigs + 1 entries (sentinel = total byte size). Built once at construction by a single sequential scan; reconstructible from `unitigs.bin` if lost. Access: `unitigs.bin[offsets[j] .. offsets[j+1]]`.
+
+All files except `mphf.bin` are flat arrays of fixed-size elements, serialised with **rkyv** for zero-copy mmap access. `mphf.bin` uses ptr_hash's native serialisation; rkyv integration is deferred (see open questions).
+
+### Evidence encoding
+
+Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
+
+```
+slot s  →  unitig_id: u25  |  rank: u7
+```
+
+Packed into a `u32` (29 bits used, 3 spare). Decoding:
+
+```
+kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice
+```
+
+`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.
+
+### Presence/absence matrix
+
+Column-major bit matrix: column j (sample j) is a contiguous `n_slots`-bit array. This layout makes per-sample operations (union, intersection, diff over a column) cache-friendly. For large matrices (e.g. 10 M slots × 1 000 samples ≈ 1.25 GB per partition), rkyv + mmap avoids loading the full matrix at open time.
+
+---
+
+## Query path
+
+A kmer query routes through all three levels:
+
+1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
+2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.query(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
+3. **Data access**: read `counts[slot]` and/or `presence[slot]` from the matching layer.
+
+```
+fn query(kmer) → Option<Hit>:
+    part = partition_of(kmer)
+    for (i, layer) in part.layers.iter().enumerate():
+        slot = layer.mphf.query(kmer)
+        if layer.evidence.decode(slot) == kmer:
+            return Some(Hit { layer: i, slot })
+    return None
+```
+
+Expected probe depth: 1 for kmers in layer 0, increasing for later layers. In practice the dominant dataset should be layer 0.
+
+---
+
+## Add-layer algorithm
+
+When adding dataset B to an existing index:
+
+1. For each partition, iterate kmers of B routed to that partition.
+2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
+3. Build a new layer from `B \ index` using the two-phase pipeline (FMPHGO provisional → ptr_hash definitive).
+4. Append the new layer directory under each `part_XXXXX/`.
+5. Update `meta.json` (layer count, sample registry).
+
+Each partition's new layer is built independently; the operation is fully parallel across partitions.
+
+---
+
+## Core API (sketch)
+
+```rust
+// Open an existing index
+let map = LayeredMap::open(path)?;
+
+// Query a canonical kmer across all partitions and layers
+match map.query(kmer) {
+    Some(hit) => {
+        let count   = hit.count();
+        let present = hit.presence_row();  // bit slice over samples
+    }
+    None => { /* absent */ }
+}
+
+// Non-destructive extension with a new dataset
+// unitigs produced by the two-phase pipeline, one per partition
+let layer_idx = map.add_layer(unitigs_dir, counts_dir, presence_path)?;
+```
+
+---
+
+## Dependencies
+
+| crate | role |
+|---|---|
+| `ptr_hash` | phase-2 MPHF per layer |
+| `ph` (FMPHGO) | phase-1 provisional MPHF during layer construction |
+| `rkyv` | zero-copy serialisation of flat arrays (evidence, counts, presence) |
+| `memmap2` | mmap of layer files |
+| `bitm` | bit-packed presence matrix |
+
+---
+
+## Serialisation strategy
+
+All flat arrays use `rkyv::Archive`:
+
+```rust
+#[derive(Archive, Serialize, Deserialize)]
+struct Evidence { slots: Vec<u32> }   // packed (unitig_id: 25 | rank: 7)
+
+#[derive(Archive, Serialize, Deserialize)]
+struct Counts { data: Vec<u32> }
+```
+
+At open time, each file is mmapped and cast to its archived type — no allocation, no copy. The MPHF is loaded via ptr_hash's own API; a rkyv wrapper is a future refinement.
+
+---
+
+## Open questions
+
+- **ptr_hash + rkyv**: ptr_hash's internals are flat bit arrays; a rkyv-compatible wrapper is structurally feasible. Assess upstream willingness or implement a thin newtype wrapper.
+- **Presence matrix layout**: column-major favours per-sample operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
+- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path.
+- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
@@ -69,7 +69,9 @@ Consequence for `u8` capacity:
 | nucleotides | 255 nuc | 225 kmers |
 | **kmers** | **255 kmers** | **285 nuc** |

-On *Betula nana* (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average; no unitig length distribution data measured yet. The `rank` field (kmer index within the unitig) fits in a `u8` as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below.
+**Structural maximum from superkmer construction.** For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k − m + 1 = **21 kmers** (the minimiser traverses from position k−m to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range.
+
+On *Betula nana* (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The `rank` field (kmer index within the unitig) fits in a `u8` as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed).

 ### Split strategy for long unitigs