Files
obikmer/docmd/implementation/unitig_evidence.md
T
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00

182 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Unitig-based MPHF evidence encoding
## Role of unitigs in the index
The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The **evidence file** supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.
Unitigs are the natural compact representation: a run of L nucleotides encodes L k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.
---
## Binary file formats
### `unitigs.bin` — sequence chunks
A sequence of binary records. Each record:
```
[u8: seql k] [ceil(seql / 4) bytes: 2-bit packed nucleotides]
```
- `seql k` (0255): nucleotide length minus k, so `seql = byte[0] + k` and `n_kmers = byte[0] + 1`.
- Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
- Byte count for packed sequence: `ceil(seql / 4)`.
Unitigs with more than `MAX_KMERS_PER_CHUNK = 256` k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= `seql k + 1 ≤ 256`); consecutive chunks overlap by k1 nucleotides so no kmer is lost:
```
chunk 1: nucleotides [0, MAX_KMERS_PER_CHUNK + k 2] (256 kmers)
chunk 2: nucleotides [256, end] (remaining kmers)
overlap: k1 nucleotides shared between the two chunks
```
### `unitigs.bin.idx` — block-sampled offset index
```
magic : 4 bytes = "UIX3"
block_bits: u32 LE — granularity parameter (031)
n_unitigs : u32 LE — total number of chunks in unitigs.bin
n_kmers : u64 LE — total number of kmers across all chunks
offsets : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel
```
One offset entry is stored every `2^block_bits` chunks; the array is sentinel-terminated (last entry = file size). `DEFAULT_BLOCK_BITS = 0` stores one offset per chunk (exact table, no scan).
### `evidence.bin` — per-slot MPHF evidence
A flat array of u32 values, one per MPHF slot, no header:
```
bits [31:7] = chunk_id (25 bits)
bits [6:0] = rank (7 bits, 0127)
```
File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record in `unitigs.bin`; `rank` is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: `raw = (chunk_id << 7) | (rank & 0x7F)`. Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
---
## Building and reading the index
### `build_unitig_idx(path, block_bits)`
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
### `open()` vs `open_sequential()`
`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
### `chunk_start(i)` — random access
```rust
fn chunk_start(&self, i: usize) -> usize {
// block_bits=0: single table lookup, O(1) — hot path
if self.block_bits == 0 {
return self.block_offsets[i] as usize;
}
// block_bits>0: lookup block, then scan at most 2^block_bits 1 records
let block = i >> self.block_bits;
let rem = i & self.mask;
let mut offset = self.block_offsets[block] as usize;
for _ in 0..rem {
let seql_minus_k = self.mmap[offset] as usize;
offset += 1 + (seql_minus_k + self.k + 3) / 4;
}
offset
}
```
With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
### Decoding a kmer from slot `s`
```rust
let (chunk_id, rank) = evidence.decode(s); // u32 → (chunk_id: u32, rank: u8)
let kmer = unitigs.raw_kmer(chunk_id, rank); // 2-bit packed slice → left-aligned u64
```
Two memory accesses: one 4-byte read from `evidence.bin`, one packed-bit extraction from `unitigs.bin` via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).
---
## Field widths and capacity
| field | bits | range | capacity check (*B. nana*, 256 partitions) |
|------------|------|---------------|---------------------------------------------|
| `seql k` | 8 | 0255 | max `n_kmers` per chunk = 256 = `MAX_KMERS_PER_CHUNK` |
| `rank` | 7 | 0127 | observed max ~46 kmers/chunk; structural max km+1 = 21 |
| `chunk_id` | 25 | 033 554 431 | avg U ≈ 275 k chunks/partition |
The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.
---
## Evidence bit-cost
Strategy B (chunk_id + rank) is the implemented strategy. For *B. nana* (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):
| field | theoretical cost | value |
|------------|-------------------------|---------|
| chunk_id | ⌈log₂ U⌉ | 19 bits |
| rank | ⌈log₂ m_u⌉ (≈ fixed) | 6 bits |
| **stored** | aligned u32 | **32 bits/slot** |
The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.
Comparison with strategy A (global nucleotide offset): `⌈log₂(P · (1 + (k1)/m_u))⌉ = 25 bits`. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is **locality** (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.
---
## Unitig decomposition non-determinism
The unitig extraction from `GraphDeBruijn` is **not deterministic**: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.
The hash map (`hashbrown::HashMap` with `Xxh3Builder`) has run-dependent iteration order. The `start_iter` first pass emits every node where `can_extend_left` is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.
**Example** — fork topology (k = 31):
```
A → B ← C
D
```
B has two left neighbours, so `can_extend_left = false`. Two valid tilings:
| iteration order | unitigs | count |
|---|---|---|
| A first | ABD, C | 2 |
| B first | BD, A, C | 3 |
Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).
This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.
---
## Partition-size tradeoff
Measured on *B. nana* (k=31, m=11), summing across all partitions:
| N partitions | m_u |
|---|---|
| 1 | 41.89 |
| 16 | 38.19 |
| 256 | 37.90 |
| 1 024 | 37.89 |
`m_u` is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 161024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.
Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via `log₂ U = log₂(P/m_u)`). The sequence storage term `2 · (1 + (k1)/m_u) ≈ 3.6 bits/kmer` is approximately constant.
---
## Alternative: fingerprint evidence
`evidence.bin` can be replaced by `fingerprint.bin` at index build time (`--approx`) or after the fact (`reindex --approx`). The fingerprint stores b bits per MPHF slot (the low b bits of `kmer.seq_hash()`); verification becomes a single bitfield comparison instead of a unitig dereference. False-positive rate per k-mer query: 1/2^b. With the Findere z parameter, z consecutive k-mers must all match, reducing the effective window FP rate to 1/2^(b·z) while skipping z1 of every z k-mers. No `.idx` file is written or read in approx mode.
See [Approximate evidence (Findere fingerprint)](evidence_elimination.md) for the full design and CLI parameters.