obikmer/docmd/implementation/unitig_evidence.md

# Unitig-based MPHF evidence encoding

## Role of unitigs in the index

The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The **evidence file** supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.

Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.

---

## Binary file formats

### `unitigs.bin` — sequence chunks

A sequence of binary records. Each record:

```
[u8: seql − k]  [ceil(seql / 4) bytes: 2-bit packed nucleotides]
```

- `seql − k` (0–255): nucleotide length minus k, so `seql = byte[0] + k` and `n_kmers = byte[0] + 1`.
- Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
- Byte count for packed sequence: `ceil(seql / 4)`.

Unitigs with more than `MAX_KMERS_PER_CHUNK = 256` k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= `seql − k + 1 ≤ 256`); consecutive chunks overlap by k−1 nucleotides so no kmer is lost:

```
chunk 1: nucleotides [0,   MAX_KMERS_PER_CHUNK + k − 2]   (256 kmers)
chunk 2: nucleotides [256, end]                             (remaining kmers)
overlap: k−1 nucleotides shared between the two chunks
```

### `unitigs.bin.idx` — block-sampled offset index

```
magic     : 4 bytes  = "UIX3"
block_bits: u32 LE   — granularity parameter (0–31)
n_unitigs : u32 LE   — total number of chunks in unitigs.bin
n_kmers   : u64 LE   — total number of kmers across all chunks
offsets   : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel
```

One offset entry is stored every `2^block_bits` chunks; the array is sentinel-terminated (last entry = file size). `DEFAULT_BLOCK_BITS = 0` stores one offset per chunk (exact table, no scan).

### `evidence.bin` — per-slot MPHF evidence

A flat array of u32 values, one per MPHF slot, no header:

```
bits [31:7] = chunk_id  (25 bits)
bits  [6:0] = rank      (7 bits, 0–127)
```

File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record in `unitigs.bin`; `rank` is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: `raw = (chunk_id << 7) | (rank & 0x7F)`. Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.

---

## Building and reading the index

### `build_unitig_idx(path, block_bits)`

Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) − 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.

### `open()`, `open_sequential()`, `open_direct_access()`

`UnitigFileReader` has three constructors:

- `open(path)` — smart default: if `unitigs.bin.idx` exists, delegates to `open_direct_access`; otherwise delegates to `open_sequential`. Prefer this in call sites that don't require one specific mode.
- `open_sequential(path)` — never reads `.idx`. Sequential iterators only; `chunk_start(i)` falls back to an O(i) mmap scan rather than panicking.
- `open_direct_access(path)` — requires `.idx` to be present. Enables O(1) or O(2^block_bits) `chunk_start(i)`, used by `verify_canonical_kmer` at query time.

`CanonicalKmerIter` — a clonable sequential iterator returned by `UnitigFileReader::iter_canonical_kmers()`. It holds an `Arc<Mmap>` so cloning resets the cursor to the start without reopening the file. This makes it usable with `par_bridge()` for parallel MPHF construction without random access.

### `chunk_start(i)` — access modes

When `.idx` is loaded (`open_direct_access`):

- `block_bits = 0`: single array lookup, O(1).
- `block_bits > 0`: lookup block, then scan ≤ 2^block_bits records, O(2^block_bits).

When `.idx` is absent (`open_sequential`): `chunk_start(i)` performs an O(i) sequential mmap scan from offset 0. No panic — the function degrades gracefully. This degraded path is used by `find_strict()` on Approx layers (sequential scan of all canonical kmers).

### Decoding a kmer from slot `s`

```rust
let (chunk_id, rank) = evidence.decode(s);      // u32 → (chunk_id: u32, rank: u8)
let kmer = unitigs.raw_kmer(chunk_id, rank);    // 2-bit packed slice → left-aligned u64
```

Two memory accesses: one 4-byte read from `evidence.bin`, one packed-bit extraction from `unitigs.bin` via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).

---

## Field widths and capacity

| field      | bits | range         | capacity check (*B. nana*, 256 partitions) |
|------------|------|---------------|---------------------------------------------|
| `seql − k` | 8    | 0–255         | max `n_kmers` per chunk = 256 = `MAX_KMERS_PER_CHUNK` |
| `rank`     | 7    | 0–127         | observed max ~46 kmers/chunk; structural max k−m+1 = 21 |
| `chunk_id` | 25   | 0–33 554 431  | avg U ≈ 275 k chunks/partition |

The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.

---

## Evidence bit-cost

Strategy B (chunk_id + rank) is the implemented strategy. For *B. nana* (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):

| field      | theoretical cost        | value   |
|------------|-------------------------|---------|
| chunk_id   | ⌈log₂ U⌉               | 19 bits |
| rank       | ⌈log₂ m_u⌉ (≈ fixed)   | 6 bits  |
| **stored** | aligned u32             | **32 bits/slot** |

The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.

Comparison with strategy A (global nucleotide offset): `⌈log₂(P · (1 + (k−1)/m_u))⌉ = 25 bits`. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is **locality** (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.

---

## Unitig decomposition non-determinism

The unitig extraction from `GraphDeBruijn` is **not deterministic**: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.

The hash map (`hashbrown::HashMap` with `Xxh3Builder`) has run-dependent iteration order. The `start_iter` first pass emits every node where `can_extend_left` is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.

**Example** — fork topology (k = 31):

```
A → B ← C
    ↓
    D
```

B has two left neighbours, so `can_extend_left = false`. Two valid tilings:

| iteration order | unitigs | count |
|---|---|---|
| A first | ABD, C | 2 |
| B first | BD, A, C | 3 |

Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).

This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.

---

## Partition-size tradeoff

Measured on *B. nana* (k=31, m=11), summing across all partitions:

| N partitions | m_u  |
|---|---|
| 1            | 41.89 |
| 16           | 38.19 |
| 256          | 37.90 |
| 1 024        | 37.89 |

`m_u` is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 16–1024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.

Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via `log₂ U = log₂(P/m_u)`). The sequence storage term `2 · (1 + (k−1)/m_u) ≈ 3.6 bits/kmer` is approximately constant.

---

## Alternative: fingerprint evidence

`evidence.bin` can be replaced by `fingerprint.bin` at index build time (`--approx`) or after the fact (`reindex --approx`). The fingerprint stores b bits per MPHF slot (the low b bits of `kmer.seq_hash()`); verification becomes a single bitfield comparison instead of a unitig dereference. False-positive rate per k-mer query: 1/2^b. With the Findere z parameter, z consecutive k-mers must all match, reducing the effective window FP rate to 1/2^(b·z) while skipping z−1 of every z k-mers. No `.idx` file is written or read in approx mode.

See [Approximate evidence (Findere fingerprint)](evidence_elimination.md) for the full design and CLI parameters.