Files
obikmer/docmd/implementation/obilayeredmap.md
T
Eric Coissac f36b095ce2 docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 15:59:10 +08:00

258 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# obilayeredmap — layered kmer index crate
## Purpose
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
---
## Three usage modes
The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.
| Mode | Description | Payload type | Storage |
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
---
## MphfLayer — autonomous kmer → slot mapping
`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
```rust
pub struct MphfLayer {
mphf: Mphf,
evidence: Evidence,
unitigs: UnitigFileReader,
n: usize, // number of indexed kmers = number of MPHF slots
}
```
Public API:
```rust
impl MphfLayer {
pub fn open(dir: &Path) -> OLMResult<Self>
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> // Some(slot) or None
pub fn n(&self) -> usize
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
pub(crate) fn build(
dir: &Path,
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
) -> OLMResult<usize>
}
```
`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
`build` runs two sequential passes over `unitigs.bin`:
1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`.
---
## Layer\<D: LayerData\> — MPHF + payload
`Layer<D>` pairs an `MphfLayer` with one payload store.
```rust
pub trait LayerData: Sized {
type Item;
fn open(layer_dir: &Path) -> OLMResult<Self>;
fn read(&self, slot: usize) -> Self::Item;
}
pub struct Layer<D: LayerData = ()> {
mphf: MphfLayer,
data: D,
}
pub struct Hit<T = ()> {
pub slot: usize,
pub data: T,
}
```
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait.
| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
**Build signatures:**
```rust
// mode 1
impl Layer<()> {
pub fn build(out_dir: &Path) -> OLMResult<usize>
}
// mode 2
impl Layer<PersistentCompactIntMatrix> {
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
// mode 3
impl Layer<PersistentBitMatrix> {
pub fn build_presence(
out_dir: &Path,
n_genomes: usize,
present_in: impl Fn(CanonicalKmer, usize) -> bool,
) -> OLMResult<usize>
}
```
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`.
---
## LayeredStore\<S\> and aggregation traits
`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
```rust
pub struct LayeredStore<S>(pub Vec<S>);
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { } // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials for LayeredStore<S> { } // element-wise Σ partials
impl<S: BitPartials> BitPartials for LayeredStore<S> { } // element-wise Σ partials
```
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
**Aggregation hierarchy:**
```
PersistentCompactIntMatrix implements CountPartials
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
```
**Leaf implementors** (in `obicompactvec`):
| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
---
## On-disk structure
```
index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
counts/ [mode 2] PersistentCompactIntMatrix
meta.json {"n": N, "n_cols": 1}
col_000000.pciv
presence/ [mode 3] PersistentBitMatrix
meta.json {"n": N, "n_cols": G}
col_000000.pbiv
layer_1/
part_00001/
```
**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
---
## Evidence encoding
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
```
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
```
Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`.
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
---
## ptr_hash configuration
```rust
type Mphf = PtrHash<
u64, // key type: canonical kmer raw encoding
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
Xx64, // hasher: XXH3-64 with seed
Vec<u8>, // pilots
>;
```
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space.
---
## Query path
```rust
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
}
```
`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.
In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
---
## Add-layer algorithm
When adding dataset B to an existing index:
1. For each partition, probe existing layers for kmers of B routed to that partition.
2. Collect kmers absent from all layers → `B \ index`.
3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`.
4. Call `Layer<D>::build` on the new directory.
5. Update `meta.json`.
Each partition's new layer is built independently; the operation is fully parallel across partitions.
---
## Dependencies
| crate | role |
|---|---|
| `ptr_hash 1.1` | MPHF per layer |
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
| `epserde 0.8` | zero-copy MPHF serialisation |
| `memmap2 0.9` | mmap of evidence and payload files |
| `obiskio` | unitig file writer/reader |
| `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass |
| `ndarray 0.16` | aggregation output arrays |