98c14aade9
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
397 lines
14 KiB
Markdown
397 lines
14 KiB
Markdown
# obilayeredmap — layered kmer index crate
|
||
|
||
## Purpose
|
||
|
||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||
|
||
---
|
||
|
||
## Three usage modes
|
||
|
||
The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.
|
||
|
||
| Mode | Description | Payload type | Storage |
|
||
|---|---|---|---|
|
||
| 1. Set | membership test only | `()` | — |
|
||
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
|
||
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
|
||
|
||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
|
||
|
||
---
|
||
|
||
## Index mode (homogeneity invariant)
|
||
|
||
A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at `LayeredMap::open()` from `PartitionMeta.mode` and passed to each `Layer::open()` — no per-layer file is read.
|
||
|
||
```rust
|
||
#[derive(Serialize, Deserialize, Default)]
|
||
#[serde(tag = "type", rename_all = "snake_case")]
|
||
pub enum IndexMode {
|
||
#[default]
|
||
Exact,
|
||
Approx { b: u8, z: u8 },
|
||
Hybrid { b: u8, z: u8 },
|
||
}
|
||
```
|
||
|
||
`IndexMode` is stored once in `PartitionMeta` (`meta.json` at partition root). There is no `layer_meta.json`.
|
||
|
||
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
|
||
- **Approx**: writes `fingerprint.bin` only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No `.idx` written or required.
|
||
- **Hybrid**: writes both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (fast, O(1)); `find_strict()` uses exact evidence.
|
||
|
||
---
|
||
|
||
## MphfLayer — autonomous kmer → slot mapping
|
||
|
||
`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
|
||
|
||
```rust
|
||
pub struct MphfLayer {
|
||
mphf: Mphf,
|
||
ev: LayerEvidence, // loaded at open() time
|
||
n: usize,
|
||
}
|
||
```
|
||
|
||
`LayerEvidence` is an internal enum, not public:
|
||
|
||
```rust
|
||
enum LayerEvidence {
|
||
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
||
Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
|
||
Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
|
||
}
|
||
```
|
||
|
||
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read.
|
||
|
||
### Query API
|
||
|
||
Two public query methods, both returning `Option<usize>` (slot index):
|
||
|
||
```rust
|
||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
|
||
pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize>
|
||
```
|
||
|
||
- `find`: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.
|
||
- `find_strict`: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no `.idx`).
|
||
|
||
There are no `find_exact`/`find_approx` methods; panicking dispatch is eliminated.
|
||
|
||
### Build surface
|
||
|
||
```rust
|
||
// Full MPHF + evidence build (two-pass)
|
||
pub(crate) fn build(dir, block_bits, mode: &IndexMode, fill_slot) -> OLMResult<usize>
|
||
|
||
// Evidence-only post-hoc builds (MPHF already present)
|
||
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
|
||
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
|
||
```
|
||
|
||
`MphfLayer::build` runs two passes over `unitigs.bin`:
|
||
|
||
1. **Pass 1** (parallel via rayon): a `CanonicalKmerIter` (clonable, `Arc<Mmap>`, no file reopening) is passed to `new_from_par_iter` via `par_bridge()`. Produces `mphf.bin`. No `.idx` is read or created at this stage.
|
||
2. **Pass 2** (sequential): fill evidence files; call `fill_slot(slot, kmer)` per kmer. `.idx` is written last for Exact/Hybrid modes (query-time only).
|
||
|
||
There is no `build_evidence` dispatch wrapper — callers invoke `build_exact_evidence` or `build_approx_evidence` directly.
|
||
|
||
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
|
||
|
||
---
|
||
|
||
## Layer\<D: LayerData\> — MPHF + payload
|
||
|
||
`Layer<D>` pairs an `MphfLayer` with one payload store.
|
||
|
||
```rust
|
||
pub trait LayerData: Sized {
|
||
type Item;
|
||
fn open(layer_dir: &Path) -> OLMResult<Self>;
|
||
fn read(&self, slot: usize) -> Self::Item;
|
||
}
|
||
|
||
pub struct Layer<D: LayerData = ()> {
|
||
mphf: MphfLayer,
|
||
data: D,
|
||
}
|
||
|
||
pub struct Hit<T = ()> {
|
||
pub slot: usize,
|
||
pub data: T,
|
||
}
|
||
```
|
||
|
||
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait.
|
||
|
||
| Type | `Item` | Description |
|
||
|---|---|---|
|
||
| `()` | `()` | mode 1 — membership only |
|
||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
|
||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
|
||
|
||
### Build signatures
|
||
|
||
```rust
|
||
// mode 1
|
||
impl Layer<()> {
|
||
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize>
|
||
}
|
||
|
||
// mode 2
|
||
impl Layer<PersistentCompactIntMatrix> {
|
||
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode,
|
||
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||
pub fn build_from_map(out_dir: &Path, block_bits: u8, mode: &IndexMode,
|
||
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||
}
|
||
|
||
// mode 3
|
||
impl Layer<PersistentBitMatrix> {
|
||
pub fn build_presence(out_dir: &Path, block_bits: u8, mode: &IndexMode,
|
||
n_genomes: usize,
|
||
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
|
||
}
|
||
```
|
||
|
||
All build impls delegate to `MphfLayer::build` via a mode-specific `fill_slot` callback. The `mode` parameter is forwarded directly — no `LayerMeta` is written.
|
||
|
||
Evidence-only post-hoc builds are accessible directly on `Layer<D>`:
|
||
|
||
```rust
|
||
impl<D: LayerData> Layer<D> {
|
||
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
||
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
|
||
}
|
||
```
|
||
|
||
There is no `build_evidence` dispatch wrapper.
|
||
|
||
---
|
||
|
||
## FingerprintVec and FingerprintVecWriter
|
||
|
||
Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
|
||
|
||
```
|
||
fingerprint.bin format:
|
||
magic: b"FPVF" (4 bytes)
|
||
b: u8 (bits per fingerprint, 1..=64)
|
||
padding: [0u8; 3]
|
||
n: u64 LE (number of slots)
|
||
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
|
||
```
|
||
|
||
```rust
|
||
impl FingerprintVec {
|
||
pub fn open(path: &Path) -> OLMResult<Self>
|
||
pub fn get(&self, slot: usize) -> u64
|
||
pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
|
||
pub fn n(&self) -> usize
|
||
pub fn b(&self) -> u8
|
||
}
|
||
```
|
||
|
||
`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`.
|
||
|
||
---
|
||
|
||
## LayeredMap\<D\> — collection of layers
|
||
|
||
`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.
|
||
|
||
```rust
|
||
pub struct LayeredMap<D: LayerData = ()> {
|
||
root: PathBuf,
|
||
meta: PartitionMeta,
|
||
layers: Vec<Layer<D>>,
|
||
}
|
||
```
|
||
|
||
`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`.
|
||
|
||
### Common methods
|
||
|
||
```rust
|
||
pub fn open(root: &Path) -> OLMResult<Self>
|
||
pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self>
|
||
pub fn n_layers(&self) -> usize
|
||
pub fn layer(&self, i: usize) -> &Layer<D>
|
||
pub fn mode(&self) -> &IndexMode
|
||
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
|
||
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
|
||
```
|
||
|
||
`open` reads `PartitionMeta` once, extracts `mode`, and passes it to every `Layer::open` — no per-layer file is read. `create` stores the given mode in `PartitionMeta`.
|
||
|
||
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
|
||
|
||
### push_layer
|
||
|
||
`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`:
|
||
|
||
```rust
|
||
// mode 1
|
||
impl LayeredMap<()> {
|
||
pub fn push_layer(&mut self) -> OLMResult<usize>
|
||
}
|
||
|
||
// mode 2
|
||
impl LayeredMap<PersistentCompactIntMatrix> {
|
||
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||
pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||
}
|
||
```
|
||
|
||
Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer<PersistentBitMatrix>::build_presence`.
|
||
|
||
---
|
||
|
||
## LayeredStore\<S\> and aggregation traits
|
||
|
||
`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
|
||
|
||
```rust
|
||
pub struct LayeredStore<S>(pub Vec<S>);
|
||
|
||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ col_weights across inner stores
|
||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // element-wise Σ partials
|
||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wise Σ partials
|
||
```
|
||
|
||
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
|
||
|
||
**Leaf implementors** (in `obicompactvec`):
|
||
|
||
| Type | Traits |
|
||
|---|---|
|
||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
|
||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
|
||
|
||
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
|
||
|
||
---
|
||
|
||
## On-disk structure
|
||
|
||
```
|
||
partition_root/ ← LayeredMap (one partition)
|
||
meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
|
||
layer_0/ ← Layer
|
||
mphf.bin — ptr_hash MPHF (epserde format)
|
||
unitigs.bin — packed 2-bit nucleotide sequences
|
||
unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
|
||
evidence.bin — [u32; n], LE (Exact/Hybrid only)
|
||
fingerprint.bin — packed b-bit array (Approx/Hybrid only)
|
||
counts/ [mode 2] PersistentCompactIntMatrix
|
||
meta.json
|
||
col_000000.pciv
|
||
presence/ [mode 3] PersistentBitMatrix
|
||
meta.json
|
||
col_000000.pbiv …
|
||
layer_1/
|
||
…
|
||
```
|
||
|
||
There is no `layer_meta.json`. The mode is stored once in `PartitionMeta` and is valid for all layers. `unitigs.bin.idx` is built at the end of `build_exact_evidence` — never during MPHF construction — and is consumed at query time only.
|
||
|
||
---
|
||
|
||
## Evidence encoding (exact)
|
||
|
||
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
|
||
|
||
```
|
||
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
|
||
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||
```
|
||
|
||
`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).
|
||
|
||
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
|
||
|
||
---
|
||
|
||
## ptr_hash configuration
|
||
|
||
```rust
|
||
type Mphf = PtrHash<
|
||
u64, // key type: canonical kmer raw encoding
|
||
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
|
||
CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
|
||
Xx64, // hasher: XXH3-64 with seed
|
||
Vec<u8>, // pilots
|
||
>;
|
||
```
|
||
|
||
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
|
||
|
||
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space.
|
||
|
||
---
|
||
|
||
## Column append and merge support
|
||
|
||
These methods extend existing layers with new genome columns without touching the MPHF.
|
||
|
||
### Layer-level genome column append
|
||
|
||
```rust
|
||
impl Layer<PersistentBitMatrix> {
|
||
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||
}
|
||
|
||
impl Layer<PersistentCompactIntMatrix> {
|
||
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||
}
|
||
```
|
||
|
||
Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n).
|
||
|
||
### Presence matrix initialisation
|
||
|
||
```rust
|
||
impl Layer<()> {
|
||
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
||
}
|
||
```
|
||
|
||
Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
|
||
|
||
### Why the MPHF is never rebuilt
|
||
|
||
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update.
|
||
|
||
---
|
||
|
||
## Add-layer algorithm
|
||
|
||
When adding dataset B to an existing index:
|
||
|
||
1. For each partition, probe existing layers for kmers of B routed to that partition.
|
||
2. Collect kmers absent from all layers → `B \ index`.
|
||
3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`.
|
||
4. Call `Layer<D>::build` (or `build_presence`) on the new layer directory.
|
||
5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`.
|
||
|
||
Each partition's new layer is built independently; the operation is fully parallel across partitions.
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
| crate | role |
|
||
|---|---|
|
||
| `ptr_hash 1.1` | MPHF per layer |
|
||
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
|
||
| `epserde 0.8` | zero-copy MPHF serialisation |
|
||
| `memmap2 0.9` | mmap of evidence and fingerprint files |
|
||
| `bitvec` | packed b-bit fingerprint storage |
|
||
| `obiskio` | unitig file writer/reader + `.idx` build |
|
||
| `obicompactvec` | payload types + aggregation traits |
|
||
| `rayon 1` | parallel MPHF construction pass |
|
||
| `serde / serde_json` | `PartitionMeta` serialisation |
|