Files
obikmer/docmd/implementation/obilayeredmap.md
T
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00

391 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# obilayeredmap — layered kmer index crate
## Purpose
`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
---
## Three usage modes
The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.
| Mode | Description | Payload type | Storage |
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
---
## Evidence kinds
Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
```rust
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
```
`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L k z + 2 is the number of windows in a read of length L. No `.idx` written or required.
---
## MphfLayer — autonomous kmer → slot mapping
`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
```rust
pub struct MphfLayer {
mphf: Mphf,
ev: LayerEvidence, // loaded at open() time
n: usize,
}
```
`LayerEvidence` is an internal enum, not public:
```rust
enum LayerEvidence {
Exact { evidence: Evidence, unitigs: UnitigFileReader },
Approx { fingerprint: FingerprintVec },
}
```
### Query API
Three public query methods, all returning `Option<usize>` (slot index):
```rust
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
```
- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
### Build surface
```rust
// Full MPHF + exact evidence build (two-pass, parallel)
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
// Evidence-only builds (MPHF already present in dir)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize> // dispatch
```
`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
---
## Layer\<D: LayerData\> — MPHF + payload
`Layer<D>` pairs an `MphfLayer` with one payload store.
```rust
pub trait LayerData: Sized {
type Item;
fn open(layer_dir: &Path) -> OLMResult<Self>;
fn read(&self, slot: usize) -> Self::Item;
}
pub struct Layer<D: LayerData = ()> {
mphf: MphfLayer,
data: D,
}
pub struct Hit<T = ()> {
pub slot: usize,
pub data: T,
}
```
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait.
| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
### Build signatures
```rust
// mode 1
impl Layer<()> {
pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
}
// mode 2
impl Layer<PersistentCompactIntMatrix> {
pub fn build(out_dir: &Path, block_bits: u8,
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, block_bits: u8,
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
// mode 3
impl Layer<PersistentBitMatrix> {
pub fn build_presence(out_dir: &Path, block_bits: u8,
n_genomes: usize,
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
}
```
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
### Evidence build helpers on Layer
```rust
impl<D: LayerData> Layer<D> {
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
}
```
These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
---
## FingerprintVec and FingerprintVecWriter
Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
```
fingerprint.bin format:
magic: b"FPVF" (4 bytes)
b: u8 (bits per fingerprint, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
```
```rust
impl FingerprintVec {
pub fn open(path: &Path) -> OLMResult<Self>
pub fn get(&self, slot: usize) -> u64
pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
pub fn n(&self) -> usize
pub fn b(&self) -> u8
}
```
`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`.
---
## LayeredMap\<D\> — collection of layers
`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.
```rust
pub struct LayeredMap<D: LayerData = ()> {
root: PathBuf,
meta: PartitionMeta,
layers: Vec<Layer<D>>,
}
```
`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`.
### Common methods
```rust
pub fn open(root: &Path) -> OLMResult<Self>
pub fn create(root: &Path) -> OLMResult<Self>
pub fn n_layers(&self) -> usize
pub fn layer(&self, i: usize) -> &Layer<D>
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
```
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
### push_layer
`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`:
```rust
// mode 1
impl LayeredMap<()> {
pub fn push_layer(&mut self) -> OLMResult<usize>
}
// mode 2
impl LayeredMap<PersistentCompactIntMatrix> {
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
```
Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer<PersistentBitMatrix>::build_presence`.
---
## LayeredStore\<S\> and aggregation traits
`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
```rust
pub struct LayeredStore<S>(pub Vec<S>);
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { } // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials for LayeredStore<S> { } // element-wise Σ partials
impl<S: BitPartials> BitPartials for LayeredStore<S> { } // element-wise Σ partials
```
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
**Leaf implementors** (in `obicompactvec`):
| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
---
## On-disk structure
```
partition_root/ ← LayeredMap (one partition)
meta.json — {"n_layers": N}
layer_0/ ← Layer
layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index (exact evidence only)
evidence.bin — [u32; n], LE (exact evidence only)
fingerprint.bin — packed b-bit array (approx evidence only)
counts/ [mode 2] PersistentCompactIntMatrix
meta.json
col_000000.pciv
presence/ [mode 3] PersistentBitMatrix
meta.json
col_000000.pbiv …
layer_1/
```
`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
---
## Evidence encoding (exact)
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
```
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
```
`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
---
## ptr_hash configuration
```rust
type Mphf = PtrHash<
u64, // key type: canonical kmer raw encoding
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
Xx64, // hasher: XXH3-64 with seed
Vec<u8>, // pilots
>;
```
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space.
---
## Column append and merge support
These methods extend existing layers with new genome columns without touching the MPHF.
### Layer-level genome column append
```rust
impl Layer<PersistentBitMatrix> {
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
impl Layer<PersistentCompactIntMatrix> {
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}
```
Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n).
### Presence matrix initialisation
```rust
impl Layer<()> {
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}
```
Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
### Why the MPHF is never rebuilt
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update.
---
## Add-layer algorithm
When adding dataset B to an existing index:
1. For each partition, probe existing layers for kmers of B routed to that partition.
2. Collect kmers absent from all layers → `B \ index`.
3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`.
4. Call `Layer<D>::build` (or `build_presence`) on the new layer directory.
5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`.
Each partition's new layer is built independently; the operation is fully parallel across partitions.
---
## Dependencies
| crate | role |
|---|---|
| `ptr_hash 1.1` | MPHF per layer |
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
| `epserde 0.8` | zero-copy MPHF serialisation |
| `memmap2 0.9` | mmap of evidence and fingerprint files |
| `bitvec` | packed b-bit fingerprint storage |
| `obiskio` | unitig file writer/reader + `.idx` build |
| `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass |
| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |