Files
obikmer/docmd/implementation/unitig_evidence.md
T
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00

8.8 KiB
Raw Blame History

Unitig-based MPHF evidence encoding

Role of unitigs in the index

The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The evidence file supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.

Unitigs are the natural compact representation: a run of L nucleotides encodes L k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.


Binary file formats

unitigs.bin — sequence chunks

A sequence of binary records. Each record:

[u8: seql  k]  [ceil(seql / 4) bytes: 2-bit packed nucleotides]
  • seql k (0255): nucleotide length minus k, so seql = byte[0] + k and n_kmers = byte[0] + 1.
  • Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
  • Byte count for packed sequence: ceil(seql / 4).

Unitigs with more than MAX_KMERS_PER_CHUNK = 256 k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= seql k + 1 ≤ 256); consecutive chunks overlap by k1 nucleotides so no kmer is lost:

chunk 1: nucleotides [0,   MAX_KMERS_PER_CHUNK + k  2]   (256 kmers)
chunk 2: nucleotides [256, end]                             (remaining kmers)
overlap: k1 nucleotides shared between the two chunks

unitigs.bin.idx — block-sampled offset index

magic     : 4 bytes  = "UIX3"
block_bits: u32 LE   — granularity parameter (031)
n_unitigs : u32 LE   — total number of chunks in unitigs.bin
n_kmers   : u64 LE   — total number of kmers across all chunks
offsets   : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel

One offset entry is stored every 2^block_bits chunks; the array is sentinel-terminated (last entry = file size). DEFAULT_BLOCK_BITS = 0 stores one offset per chunk (exact table, no scan).

evidence.bin — per-slot MPHF evidence

A flat array of u32 values, one per MPHF slot, no header:

bits [31:7] = chunk_id  (25 bits)
bits  [6:0] = rank      (7 bits, 0127)

File size = n_slots × 4 bytes. chunk_id is the 0-based index of the record in unitigs.bin; rank is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: raw = (chunk_id << 7) | (rank & 0x7F). Decoding: chunk_id = raw >> 7, rank = raw & 0x7F.


Building and reading the index

build_unitig_idx(path, block_bits)

Scans unitigs.bin sequentially: for each chunk at byte offset offset, if chunk_count & mask == 0 (where mask = (1 << block_bits) 1), appends offset as u32 to block_offsets. After the scan, appends a sentinel (= total file size), then writes the .idx file. Called after the unitig file is fully written and closed.

open() vs open_sequential()

UnitigFileReader::open(path) loads the .idx file into block_offsets: Vec<u32> and memory-maps unitigs.bin. Enables random access via chunk_start(i), unitig(i), raw_kmer(i, j), and verify_canonical_kmer(i, j, q).

UnitigFileReader::open_sequential(path) does not read .idx. It scans unitigs.bin once to count chunks and kmers, then leaves block_offsets empty. Only sequential iterators work: iter_unitigs, iter_kmers, iter_canonical_kmers, iter_indexed_canonical_kmers. Any call to chunk_start() panics with a diagnostic message.

chunk_start(i) — random access

fn chunk_start(&self, i: usize) -> usize {
    // block_bits=0: single table lookup, O(1) — hot path
    if self.block_bits == 0 {
        return self.block_offsets[i] as usize;
    }
    // block_bits>0: lookup block, then scan at most 2^block_bits  1 records
    let block = i >> self.block_bits;
    let rem   = i &  self.mask;
    let mut offset = self.block_offsets[block] as usize;
    for _ in 0..rem {
        let seql_minus_k = self.mmap[offset] as usize;
        offset += 1 + (seql_minus_k + self.k + 3) / 4;
    }
    offset
}

With block_bits = 0 (the default), every chunk has a direct entry in block_offsets: lookup is a single array index, O(1), with no sequential scan. The if self.block_bits == 0 branch is explicit in the code and handles this hot path first.

With block_bits > 0, one offset covers 2^block_bits consecutive chunks; access cost is O(2^block_bits) sequential mmap reads.

Decoding a kmer from slot s

let (chunk_id, rank) = evidence.decode(s);      // u32 → (chunk_id: u32, rank: u8)
let kmer = unitigs.raw_kmer(chunk_id, rank);    // 2-bit packed slice → left-aligned u64

Two memory accesses: one 4-byte read from evidence.bin, one packed-bit extraction from unitigs.bin via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).


Field widths and capacity

field bits range capacity check (B. nana, 256 partitions)
seql k 8 0255 max n_kmers per chunk = 256 = MAX_KMERS_PER_CHUNK
rank 7 0127 observed max ~46 kmers/chunk; structural max km+1 = 21
chunk_id 25 033 554 431 avg U ≈ 275 k chunks/partition

The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.


Evidence bit-cost

Strategy B (chunk_id + rank) is the implemented strategy. For B. nana (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):

field theoretical cost value
chunk_id ⌈log₂ U⌉ 19 bits
rank ⌈log₂ m_u⌉ (≈ fixed) 6 bits
stored aligned u32 32 bits/slot

The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.

Comparison with strategy A (global nucleotide offset): ⌈log₂(P · (1 + (k1)/m_u))⌉ = 25 bits. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is locality (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.


Unitig decomposition non-determinism

The unitig extraction from GraphDeBruijn is not deterministic: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.

The hash map (hashbrown::HashMap with Xxh3Builder) has run-dependent iteration order. The start_iter first pass emits every node where can_extend_left is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.

Example — fork topology (k = 31):

A → B ← C
    ↓
    D

B has two left neighbours, so can_extend_left = false. Two valid tilings:

iteration order unitigs count
A first ABD, C 2
B first BD, A, C 3

Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).

This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.


Partition-size tradeoff

Measured on B. nana (k=31, m=11), summing across all partitions:

N partitions m_u
1 41.89
16 38.19
256 37.90
1 024 37.89

m_u is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 161024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.

Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via log₂ U = log₂(P/m_u)). The sequence storage term 2 · (1 + (k1)/m_u) ≈ 3.6 bits/kmer is approximately constant.


Open questions

  • Cross-partition set operations: strategy B allows unitig-level operations (mark entire chunks present/absent) rather than kmer-level, reducing cost by a factor of m_u.
  • Eliminating evidence.bin: at ~66% of per-layer lookup footprint, evidence.bin dominates index size. See Evidence elimination design discussion.