Files
obikmer/docmd/implementation/obilayeredmap.md
T
Eric Coissac f48f7500cd refactor(obilayeredmap): support generic payload types
Replace the hardcoded `Counts` module with a generic `LayerData` trait, parameterizing `Layer` and `LayeredMap` over arbitrary payload types. This decouples read-path access from build-path logic, enabling both set membership and count-based indexing via `PersistentCompactIntVec`. Adds the `obicompactvec` dependency, implements streaming layer construction, and expands test coverage for persistence and multi-layer resolution.
2026-05-14 09:33:18 +08:00

236 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# obilayeredmap — layered kmer index crate
## Purpose
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
---
## Four usage modes
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
| Mode | Description | Payload type | File |
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
| 2. Set with count | occurrences per kmer per sample | `PersistentCompactIntVec` | `counts.pciv` |
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitVec` per genome | `presence_N.pbiv` |
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntVec` per genome | `counts_N.pciv` |
Both `PersistentCompactIntVec` and `PersistentBitVec` come from the `obicompactvec` crate. Modes 3 and 4 are not yet implemented; the per-genome multi-file layout and query API remain to be designed.
### Payload for mode 2: PersistentCompactIntVec
`PersistentCompactIntVec` (PCIV) stores one `u32` count per MPHF slot in a single mmap'd `.pciv` file. Its encoding: a primary `u8` array (value 255 = overflow sentinel) backed by a sorted overflow section of `(slot: u64, value: u32)` entries and a sparse L1-fitting index for fast binary search. This handles the geometric count distribution efficiently — most values fit in 1 byte, overflow entries are rare.
Capacity: 0 to u32::MAX per slot. No separate decision needed on bit-width: PCIV adapts to the data.
### Payload for mode 3/4: PersistentBitVec / PersistentCompactIntVec
`PersistentBitVec` (PBIV) stores one bit per MPHF slot in a mmap'd `.pbiv` file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). One PBIV per genome gives a column-major presence/absence matrix, making per-genome set operations cache-friendly.
Mode 4 replaces PBIV with PCIV per genome. Multi-file layout and query API are not yet designed.
---
## Payload architecture
The payload is orthogonal to the MPHF + evidence layer. `Layer` is parameterised by `D: LayerData`:
```rust
pub trait LayerData: Sized {
type Item;
fn open(layer_dir: &Path) -> OLMResult<Self>;
fn read(&self, slot: usize) -> Self::Item;
}
pub struct Layer<D: LayerData = ()> {
mphf: Mphf,
evidence: Evidence,
unitigs: UnitigFileReader,
data: D,
}
pub struct Hit<T = ()> {
pub slot: usize,
pub data: T,
}
```
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes (mode 1 takes no extra argument, mode 2 takes a `count_of` closure) and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
Implemented concrete types:
| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
| `PersistentCompactIntVec` | `u32` | mode 2 — per-slot count |
`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.
---
## Three-level hierarchy
```
index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
mphf.bin
unitigs.bin
unitigs.bin.idx
evidence.bin
counts.pciv [mode 2 only]
presence_N.pbiv [mode 3/4, one per genome — not yet implemented]
layer_1/
...
part_00001/
layer_0/
...
```
**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.
**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.
**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
---
## Layer file layout
```
layer_N/
mphf.bin — ptr_hash MPHF (epserde, ptr_hash native format)
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
counts.pciv — [mode 2] PersistentCompactIntVec: one u32 per slot
```
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
### Evidence encoding
Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
```
slot s → unitig_id: u25 | rank: u7
```
Packed into a `u32` (29 bits used, 3 spare). Decoding:
```
kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
```
`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0127) is sufficient.
---
## ptr_hash configuration
The MPHF per layer is configured as:
```rust
type Mphf = PtrHash<
u64, // key type: canonical kmer raw encoding
CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5)
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
Xx64, // hasher: XXH3-64 with seed, handles structured keys
Vec<u8>, // pilots
>;
```
**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.
**Bucket function — `CubicEps` with `PtrHashParams::<CubicEps>::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.
**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
---
## Build path
The build path is not part of `LayerData`. Each mode exposes its own `impl Layer<D>::build` with the exact signature it needs. Two private module-level helpers avoid code duplication:
**`build_mphf(out_dir, n) -> OLMResult<Mphf>`**: first pass — opens `unitigs.bin`, iterates all canonical kmers in parallel via `new_from_par_iter`, stores `mphf.bin`. O(n).
**`build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>`**: second pass — opens `unitigs.bin` again, fills `evidence.bin` and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls `fill_slot(slot, kmer)` once per kmer for the mode-specific payload. O(n).
```rust
// mode 1
impl Layer<()> {
pub fn build(out_dir: &Path) -> OLMResult<usize>
}
// mode 2
impl Layer<PersistentCompactIntVec> {
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
```
Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.
---
## Query path
A kmer query routes through all three levels:
1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
2. **Layer probing**: iterate layers in order; for each layer compute `slot = mphf.index(kmer)`, decode evidence, compare to query. First match wins.
3. **Data access**: `layer.data.read(slot)` returns `D::Item`.
```rust
// pseudo-code
fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
for (i, layer) in self.layers.iter().enumerate():
slot = layer.mphf.index(&kmer.raw())
if layer.evidence.decode(slot) == kmer:
return Some((i, Hit { slot, data: layer.data.read(slot) }))
return None
```
Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
---
## Add-layer algorithm
When adding dataset B to an existing index:
1. For each partition, iterate kmers of B routed to that partition.
2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
3. Build a new layer from `B \ index`.
4. Append the new layer directory under each `part_XXXXX/`.
5. Update `meta.json` (layer count, sample registry).
Each partition's new layer is built independently; the operation is fully parallel across partitions.
---
## Dependencies
| crate | role |
|---|---|
| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
| `epserde 0.8` | zero-copy serialisation of MPHF |
| `memmap2` | mmap of layer files |
| `obiskio` | unitig file writer/reader |
| `obicompactvec` | payload types: `PersistentCompactIntVec`, `PersistentBitVec` |
---
## Open questions
- **Mode 3/4 multi-file layout**: one PBIV/PCIV per genome per layer means O(n_layers × n_genomes) files. Directory layout, open strategy, and query API are not yet designed.
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation may be required; access pattern and density threshold are not yet defined.
- **Presence matrix layout**: column-major (one PBIV per genome) favours per-genome operations; row-major favours per-kmer queries. Dominant access pattern not yet characterised.
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.