f36b095ce2
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
258 lines
9.4 KiB
Markdown
258 lines
9.4 KiB
Markdown
# obilayeredmap — layered kmer index crate
|
||
|
||
## Purpose
|
||
|
||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||
|
||
---
|
||
|
||
## Three usage modes
|
||
|
||
The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.
|
||
|
||
| Mode | Description | Payload type | Storage |
|
||
|---|---|---|---|
|
||
| 1. Set | membership test only | `()` | — |
|
||
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
|
||
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
|
||
|
||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
|
||
|
||
---
|
||
|
||
## MphfLayer — autonomous kmer → slot mapping
|
||
|
||
`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
|
||
|
||
```rust
|
||
pub struct MphfLayer {
|
||
mphf: Mphf,
|
||
evidence: Evidence,
|
||
unitigs: UnitigFileReader,
|
||
n: usize, // number of indexed kmers = number of MPHF slots
|
||
}
|
||
```
|
||
|
||
Public API:
|
||
|
||
```rust
|
||
impl MphfLayer {
|
||
pub fn open(dir: &Path) -> OLMResult<Self>
|
||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> // Some(slot) or None
|
||
pub fn n(&self) -> usize
|
||
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
|
||
pub(crate) fn build(
|
||
dir: &Path,
|
||
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
||
) -> OLMResult<usize>
|
||
}
|
||
```
|
||
|
||
`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
|
||
|
||
`build` runs two sequential passes over `unitigs.bin`:
|
||
|
||
1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
|
||
2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
|
||
|
||
For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`.
|
||
|
||
---
|
||
|
||
## Layer\<D: LayerData\> — MPHF + payload
|
||
|
||
`Layer<D>` pairs an `MphfLayer` with one payload store.
|
||
|
||
```rust
|
||
pub trait LayerData: Sized {
|
||
type Item;
|
||
fn open(layer_dir: &Path) -> OLMResult<Self>;
|
||
fn read(&self, slot: usize) -> Self::Item;
|
||
}
|
||
|
||
pub struct Layer<D: LayerData = ()> {
|
||
mphf: MphfLayer,
|
||
data: D,
|
||
}
|
||
|
||
pub struct Hit<T = ()> {
|
||
pub slot: usize,
|
||
pub data: T,
|
||
}
|
||
```
|
||
|
||
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait.
|
||
|
||
| Type | `Item` | Description |
|
||
|---|---|---|
|
||
| `()` | `()` | mode 1 — membership only |
|
||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
|
||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
|
||
|
||
**Build signatures:**
|
||
|
||
```rust
|
||
// mode 1
|
||
impl Layer<()> {
|
||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
||
}
|
||
|
||
// mode 2
|
||
impl Layer<PersistentCompactIntMatrix> {
|
||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||
}
|
||
|
||
// mode 3
|
||
impl Layer<PersistentBitMatrix> {
|
||
pub fn build_presence(
|
||
out_dir: &Path,
|
||
n_genomes: usize,
|
||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||
) -> OLMResult<usize>
|
||
}
|
||
```
|
||
|
||
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`.
|
||
|
||
---
|
||
|
||
## LayeredStore\<S\> and aggregation traits
|
||
|
||
`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
|
||
|
||
```rust
|
||
pub struct LayeredStore<S>(pub Vec<S>);
|
||
|
||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ col_weights across inner stores
|
||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // element-wise Σ partials
|
||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wise Σ partials
|
||
```
|
||
|
||
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
|
||
|
||
**Aggregation hierarchy:**
|
||
|
||
```
|
||
PersistentCompactIntMatrix implements CountPartials
|
||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
|
||
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
|
||
```
|
||
|
||
**Leaf implementors** (in `obicompactvec`):
|
||
|
||
| Type | Traits |
|
||
|---|---|
|
||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
|
||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
|
||
|
||
`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
|
||
|
||
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
|
||
|
||
---
|
||
|
||
## On-disk structure
|
||
|
||
```
|
||
index_root/ ← LayeredMap (collection)
|
||
meta.json
|
||
part_00000/ ← Partition
|
||
layer_0/ ← Layer
|
||
mphf.bin — ptr_hash MPHF (epserde format)
|
||
unitigs.bin — packed 2-bit nucleotide sequences
|
||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
|
||
counts/ [mode 2] PersistentCompactIntMatrix
|
||
meta.json {"n": N, "n_cols": 1}
|
||
col_000000.pciv
|
||
presence/ [mode 3] PersistentBitMatrix
|
||
meta.json {"n": N, "n_cols": G}
|
||
col_000000.pbiv
|
||
…
|
||
layer_1/
|
||
…
|
||
part_00001/
|
||
…
|
||
```
|
||
|
||
**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
|
||
|
||
**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
|
||
|
||
---
|
||
|
||
## Evidence encoding
|
||
|
||
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
|
||
|
||
```
|
||
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
|
||
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||
```
|
||
|
||
Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`.
|
||
|
||
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
|
||
|
||
---
|
||
|
||
## ptr_hash configuration
|
||
|
||
```rust
|
||
type Mphf = PtrHash<
|
||
u64, // key type: canonical kmer raw encoding
|
||
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
|
||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
|
||
Xx64, // hasher: XXH3-64 with seed
|
||
Vec<u8>, // pilots
|
||
>;
|
||
```
|
||
|
||
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
|
||
|
||
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space.
|
||
|
||
---
|
||
|
||
## Query path
|
||
|
||
```rust
|
||
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
|
||
self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
|
||
}
|
||
```
|
||
|
||
`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.
|
||
|
||
In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
|
||
|
||
---
|
||
|
||
## Add-layer algorithm
|
||
|
||
When adding dataset B to an existing index:
|
||
|
||
1. For each partition, probe existing layers for kmers of B routed to that partition.
|
||
2. Collect kmers absent from all layers → `B \ index`.
|
||
3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`.
|
||
4. Call `Layer<D>::build` on the new directory.
|
||
5. Update `meta.json`.
|
||
|
||
Each partition's new layer is built independently; the operation is fully parallel across partitions.
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
| crate | role |
|
||
|---|---|
|
||
| `ptr_hash 1.1` | MPHF per layer |
|
||
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
|
||
| `epserde 0.8` | zero-copy MPHF serialisation |
|
||
| `memmap2 0.9` | mmap of evidence and payload files |
|
||
| `obiskio` | unitig file writer/reader |
|
||
| `obicompactvec` | payload types + aggregation traits |
|
||
| `rayon 1` | parallel MPHF construction pass |
|
||
| `ndarray 0.16` | aggregation output arrays |
|