docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -2,40 +2,66 @@

 ## Purpose

-`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
+`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

 ---

-## Four usage modes
+## Three usage modes

-The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
+The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.

 | Mode | Description | Payload type | Storage |
 |---|---|---|---|
 | 1. Set | membership test only | `()` | — |
 | 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
-| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
-| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntMatrix` | `counts/` directory |
+| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |

-Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. Mode 3 has a build path (`Layer::<PersistentBitMatrix>::build_presence`); mode 4 is not yet implemented.
-
-### Payload for modes 2/4: PersistentCompactIntMatrix
-
-`PersistentCompactIntMatrix` is a column-major matrix stored in a directory: one `col_NNNNNN.pciv` file per column, plus a `meta.json`. Each column is a `PersistentCompactIntVec` — a mmap'd PCIV file with a `u8` primary array (255 = overflow sentinel), a sorted overflow section of `(slot: u64, value: u32)` entries, and a sparse L1-fitting index.
-
-Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). `read(slot)` returns `Box<[u32]>` — the full row across all columns.
-
-### Payload for mode 3: PersistentBitMatrix
-
-`PersistentBitMatrix` is a column-major bit matrix stored in a directory: one `col_NNNNNN.pbiv` per genome, plus `meta.json`. Each column is a `PersistentBitVec` — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). `read(slot)` returns `Box<[bool]>` — the presence vector across all genomes.
-
-Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.
+Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.

 ---

-## Payload architecture
+## MphfLayer — autonomous kmer → slot mapping

-The payload is orthogonal to the MPHF + evidence layer. `Layer` is parameterised by `D: LayerData`:
+`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
+
+```rust
+pub struct MphfLayer {
+    mphf:     Mphf,
+    evidence: Evidence,
+    unitigs:  UnitigFileReader,
+    n:        usize,   // number of indexed kmers = number of MPHF slots
+}
+```
+
+Public API:
+
+```rust
+impl MphfLayer {
+    pub fn open(dir: &Path) -> OLMResult<Self>
+    pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>   // Some(slot) or None
+    pub fn n(&self) -> usize
+    pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
+    pub(crate) fn build(
+        dir: &Path,
+        fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
+    ) -> OLMResult<usize>
+}
+```
+
+`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
+
+`build` runs two sequential passes over `unitigs.bin`:
+
+1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
+2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
+
+For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`.
+
+---
+
+## Layer\<D: LayerData\> — MPHF + payload
+
+`Layer<D>` pairs an `MphfLayer` with one payload store.

 ```rust
 pub trait LayerData: Sized {
@@ -45,10 +71,8 @@ pub trait LayerData: Sized {
 }

 pub struct Layer<D: LayerData = ()> {
-    mphf:     Mphf,
-    evidence: Evidence,
-    unitigs:  UnitigFileReader,
-    data:     D,
+    mphf: MphfLayer,
+    data: D,
 }

 pub struct Hit<T = ()> {
@@ -57,115 +81,15 @@ pub struct Hit<T = ()> {
 }
 ```

-`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
-
-Implemented concrete types:
+`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait.

 | Type | `Item` | Description |
 |---|---|---|
 | `()` | `()` | mode 1 — membership only |
-| `PersistentCompactIntMatrix` | `Box<[u32]>` | modes 2/4 — one count per column |
-| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — one presence bit per column |
+| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
+| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |

-`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.
-
---
-
-## Three-level hierarchy
-
-```
-index_root/                        ← LayeredMap (collection)
-  meta.json
-  part_00000/                      ← Partition
-    layer_0/                       ← Layer
-      mphf.bin
-      unitigs.bin
-      unitigs.bin.idx
-      evidence.bin
-      counts/              [modes 2/4]
-        meta.json          {"n": N, "n_cols": 1}
-        col_000000.pciv
-      presence/            [mode 3]
-        meta.json          {"n": N, "n_cols": G}
-        col_000000.pbiv
-        col_000001.pbiv
-        ...
-    layer_1/
-      ...
-  part_00001/
-    layer_0/
-    ...
-```
-
-**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.
-
-**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.
-
-**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
-
---
-
-## Layer file layout
-
-```
-layer_N/
-  mphf.bin            — ptr_hash MPHF (epserde, ptr_hash native format)
-  unitigs.bin         — packed 2-bit nucleotide sequences (obiskio binary format)
-  unitigs.bin.idx     — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
-  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
-  counts/             — [modes 2/4] PersistentCompactIntMatrix
-  presence/           — [mode 3] PersistentBitMatrix
-```
-
-`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
-
-### Evidence encoding
-
-Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
-
-```
-slot s  →  unitig_id: u25  |  rank: u7
-```
-
-Packed into a `u32` (29 bits used, 3 spare). Decoding:
-
-```
-kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice
-```
-
-`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.
-
---
-
-## ptr_hash configuration
-
-The MPHF per layer is configured as:
-
-```rust
-type Mphf = PtrHash<
-    u64,                              // key type: canonical kmer raw encoding
-    CubicEps,                         // bucket fn: balanced (2.4 bits/key, λ=3.5)
-    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
-    Xx64,                             // hasher: XXH3-64 with seed, handles structured keys
-    Vec<u8>,                          // pilots
->;
-```
-
-**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.
-
-**Bucket function — `CubicEps` with `PtrHashParams::<CubicEps>::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.
-
-**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
-
---
-
-## Build path
-
-The build path is not part of `LayerData`. Each mode exposes its own `impl Layer<D>::build` with the exact signature it needs. Two private module-level helpers avoid code duplication:
-
-**`build_mphf(out_dir, n) -> OLMResult<Mphf>`**: first pass — opens `unitigs.bin`, iterates all canonical kmers in parallel via `new_from_par_iter`, stores `mphf.bin`. O(n).
-
-**`build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>`**: second pass — opens `unitigs.bin` again, fills `evidence.bin` and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls `fill_slot(slot, kmer)` once per kmer for the mode-specific payload. O(n).
+**Build signatures:**

 ```rust
 // mode 1
@@ -173,7 +97,7 @@ impl Layer<()> {
    pub fn build(out_dir: &Path) -> OLMResult<usize>
 }

-// modes 2/4
+// mode 2
 impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
@@ -189,33 +113,119 @@ impl Layer<PersistentBitMatrix> {
 }
 ```

-Mode 2 creates a `PersistentCompactIntMatrixBuilder` with 1 column and fills it via `build_second_pass`. Mode 3 creates a `PersistentBitMatrixBuilder` with `n_genomes` columns and fills all columns in a single pass.
+All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`.

-Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.
+---
+
+## LayeredStore\<S\> and aggregation traits
+
+`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
+
+```rust
+pub struct LayeredStore<S>(pub Vec<S>);
+
+impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }  // Σ col_weights across inner stores
+impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }  // element-wise Σ partials
+impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }  // element-wise Σ partials
+```
+
+Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
+
+**Aggregation hierarchy:**
+
+```
+PersistentCompactIntMatrix                  implements CountPartials
+LayeredStore<PersistentCompactIntMatrix>         via blanket impl  (one partition)
+LayeredStore<LayeredStore<…>>                    via blanket impl  (partitioned index)
+```
+
+**Leaf implementors** (in `obicompactvec`):
+
+| Type | Traits |
+|---|---|
+| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
+| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
+
+`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
+
+See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
+
+---
+
+## On-disk structure
+
+```
+index_root/                        ← LayeredMap (collection)
+  meta.json
+  part_00000/                      ← Partition
+    layer_0/                       ← Layer
+      mphf.bin           — ptr_hash MPHF (epserde format)
+      unitigs.bin        — packed 2-bit nucleotide sequences
+      unitigs.bin.idx    — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
+      evidence.bin       — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
+      counts/            [mode 2] PersistentCompactIntMatrix
+        meta.json          {"n": N, "n_cols": 1}
+        col_000000.pciv
+      presence/          [mode 3] PersistentBitMatrix
+        meta.json          {"n": N, "n_cols": G}
+        col_000000.pbiv
+        …
+    layer_1/
+      …
+  part_00001/
+    …
+```
+
+**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
+
+**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
+
+---
+
+## Evidence encoding
+
+`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
+
+```
+bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
+bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)
+```
+
+Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`.
+
+For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
+
+---
+
+## ptr_hash configuration
+
+```rust
+type Mphf = PtrHash<
+    u64,                              // key type: canonical kmer raw encoding
+    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
+    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
+    Xx64,                             // hasher: XXH3-64 with seed
+    Vec<u8>,                          // pilots
+>;
+```
+
+`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
+
+`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space.

 ---

 ## Query path

-A kmer query routes through all three levels:
-
-1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
-2. **Layer probing**: iterate layers in order; for each layer compute `slot = mphf.index(kmer)`, decode evidence, compare to query. First match wins.
-3. **Data access**: `layer.data.read(slot)` returns `D::Item`.
-
 ```rust
-// pseudo-code
-fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
-    for (i, layer) in self.layers.iter().enumerate():
-        slot = layer.mphf.index(&kmer.raw())
-        if layer.evidence.decode(slot) == kmer:
-            return Some((i, Hit { slot, data: layer.data.read(slot) }))
-    return None
+pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
+    self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
+}
 ```

-Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
+`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.

-For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the count. For mode 3, `hit.data` is `Box<[bool]>` with G elements, one per genome.
+In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.

 ---

@@ -223,11 +233,11 @@ For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the coun

 When adding dataset B to an existing index:

-1. For each partition, iterate kmers of B routed to that partition.
-2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
-3. Build a new layer from `B \ index`.
-4. Append the new layer directory under each `part_XXXXX/`.
-5. Update `meta.json` (layer count, sample registry).
+1. For each partition, probe existing layers for kmers of B routed to that partition.
+2. Collect kmers absent from all layers → `B \ index`.
+3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`.
+4. Call `Layer<D>::build` on the new directory.
+5. Update `meta.json`.

 Each partition's new layer is built independently; the operation is fully parallel across partitions.

@@ -237,24 +247,11 @@ Each partition's new layer is built independently; the operation is fully parall

 | crate | role |
 |---|---|
-| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
-| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
-| `epserde 0.8` | zero-copy serialisation of MPHF |
-| `memmap2` | mmap of layer files |
+| `ptr_hash 1.1` | MPHF per layer |
+| `cacheline-ef 1.1` | compact remap inside ptr_hash |
+| `epserde 0.8` | zero-copy MPHF serialisation |
+| `memmap2 0.9` | mmap of evidence and payload files |
 | `obiskio` | unitig file writer/reader |
-| `obicompactvec` | payload types: `PersistentCompactIntMatrix`, `PersistentBitMatrix` |
-
---
-
-## Relationship to target architecture
-
-The target architecture (see [Kmer index architecture](../architecture/index_architecture.md)) separates `MphfLayer` from data stores entirely and introduces a `PartitionedIndex` with parallel dispatch and an `Aggregator` pattern. The current implementation is a stepping stone: `obicompactvec` types are already fully decoupled from the MPHF; the remaining refactoring is within `obilayeredmap` itself.
-
---
-
-## Open questions
-
- **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
+| `obicompactvec` | payload types + aggregation traits |
+| `rayon 1` | parallel MPHF construction pass |
+| `ndarray 0.16` | aggregation output arrays |