Files
obikmer/docmd/implementation/obilayeredmap.md
T
Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:38:29 +08:00

169 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# obilayeredmap — layered kmer index crate
## Purpose
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
---
## Three-level hierarchy
```
index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
mphf.bin
unitigs.bin
evidence.bin
counts.bin
presence.bin
layer_1/
...
part_00001/
layer_0/
layer_1/
...
```
**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.
**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.
**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
---
## Layer file layout
```
layer_N/
mphf.bin — ptr_hash MPHF (exact key count, phase-2 construction)
unitigs.bin — packed 2-bit nucleotide sequences (concatenated, variable-length)
unitig_offsets.bin — u32 per unitig: nucleotide offset of unitig j in unitigs.bin
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
counts.bin — u32 per MPHF slot (total kmer occurrences)
presence.bin — bit matrix: n_slots × n_samples [optional]
```
Unitigs have variable lengths. Each record in `unitigs.bin` is self-delimiting: it begins with a varint `seql` (sequence length in nucleotides) followed by `(seql+3)/4` packed bytes — the streaming format defined in `obiskio`. Sequential scan is always possible using the in-record `seql`.
For O(1) random access, `unitig_offsets.bin` is a **precomputed derived index**: a u32 array of byte offsets into `unitigs.bin`, with n_unitigs + 1 entries (sentinel = total byte size). Built once at construction by a single sequential scan; reconstructible from `unitigs.bin` if lost. Access: `unitigs.bin[offsets[j] .. offsets[j+1]]`.
All files except `mphf.bin` are flat arrays of fixed-size elements, serialised with **rkyv** for zero-copy mmap access. `mphf.bin` uses ptr_hash's native serialisation; rkyv integration is deferred (see open questions).
### Evidence encoding
Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
```
slot s → unitig_id: u25 | rank: u7
```
Packed into a `u32` (29 bits used, 3 spare). Decoding:
```
kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
```
`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0127) is sufficient.
### Presence/absence matrix
Column-major bit matrix: column j (sample j) is a contiguous `n_slots`-bit array. This layout makes per-sample operations (union, intersection, diff over a column) cache-friendly. For large matrices (e.g. 10 M slots × 1 000 samples ≈ 1.25 GB per partition), rkyv + mmap avoids loading the full matrix at open time.
---
## Query path
A kmer query routes through all three levels:
1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.query(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
3. **Data access**: read `counts[slot]` and/or `presence[slot]` from the matching layer.
```
fn query(kmer) → Option<Hit>:
part = partition_of(kmer)
for (i, layer) in part.layers.iter().enumerate():
slot = layer.mphf.query(kmer)
if layer.evidence.decode(slot) == kmer:
return Some(Hit { layer: i, slot })
return None
```
Expected probe depth: 1 for kmers in layer 0, increasing for later layers. In practice the dominant dataset should be layer 0.
---
## Add-layer algorithm
When adding dataset B to an existing index:
1. For each partition, iterate kmers of B routed to that partition.
2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
3. Build a new layer from `B \ index` using the two-phase pipeline (FMPHGO provisional → ptr_hash definitive).
4. Append the new layer directory under each `part_XXXXX/`.
5. Update `meta.json` (layer count, sample registry).
Each partition's new layer is built independently; the operation is fully parallel across partitions.
---
## Core API (sketch)
```rust
// Open an existing index
let map = LayeredMap::open(path)?;
// Query a canonical kmer across all partitions and layers
match map.query(kmer) {
Some(hit) => {
let count = hit.count();
let present = hit.presence_row(); // bit slice over samples
}
None => { /* absent */ }
}
// Non-destructive extension with a new dataset
// unitigs produced by the two-phase pipeline, one per partition
let layer_idx = map.add_layer(unitigs_dir, counts_dir, presence_path)?;
```
---
## Dependencies
| crate | role |
|---|---|
| `ptr_hash` | phase-2 MPHF per layer |
| `ph` (FMPHGO) | phase-1 provisional MPHF during layer construction |
| `rkyv` | zero-copy serialisation of flat arrays (evidence, counts, presence) |
| `memmap2` | mmap of layer files |
| `bitm` | bit-packed presence matrix |
---
## Serialisation strategy
All flat arrays use `rkyv::Archive`:
```rust
#[derive(Archive, Serialize, Deserialize)]
struct Evidence { slots: Vec<u32> } // packed (unitig_id: 25 | rank: 7)
#[derive(Archive, Serialize, Deserialize)]
struct Counts { data: Vec<u32> }
```
At open time, each file is mmapped and cast to its archived type — no allocation, no copy. The MPHF is loaded via ptr_hash's own API; a rkyv wrapper is a future refinement.
---
## Open questions
- **ptr_hash + rkyv**: ptr_hash's internals are flat bit arrays; a rkyv-compatible wrapper is structurally feasible. Assess upstream willingness or implement a thin newtype wrapper.
- **Presence matrix layout**: column-major favours per-sample operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path.
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.