Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
8.9 KiB
Unitig-based MPHF evidence encoding
Role of unitigs in the index
The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The evidence file supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.
Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.
Binary file formats
unitigs.bin — sequence chunks
A sequence of binary records. Each record:
[u8: seql − k] [ceil(seql / 4) bytes: 2-bit packed nucleotides]
seql − k(0–255): nucleotide length minus k, soseql = byte[0] + kandn_kmers = byte[0] + 1.- Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
- Byte count for packed sequence:
ceil(seql / 4).
Unitigs with more than MAX_KMERS_PER_CHUNK = 256 k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= seql − k + 1 ≤ 256); consecutive chunks overlap by k−1 nucleotides so no kmer is lost:
chunk 1: nucleotides [0, MAX_KMERS_PER_CHUNK + k − 2] (256 kmers)
chunk 2: nucleotides [256, end] (remaining kmers)
overlap: k−1 nucleotides shared between the two chunks
unitigs.bin.idx — block-sampled offset index
magic : 4 bytes = "UIX3"
block_bits: u32 LE — granularity parameter (0–31)
n_unitigs : u32 LE — total number of chunks in unitigs.bin
n_kmers : u64 LE — total number of kmers across all chunks
offsets : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel
One offset entry is stored every 2^block_bits chunks; the array is sentinel-terminated (last entry = file size). DEFAULT_BLOCK_BITS = 0 stores one offset per chunk (exact table, no scan).
evidence.bin — per-slot MPHF evidence
A flat array of u32 values, one per MPHF slot, no header:
bits [31:7] = chunk_id (25 bits)
bits [6:0] = rank (7 bits, 0–127)
File size = n_slots × 4 bytes. chunk_id is the 0-based index of the record in unitigs.bin; rank is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: raw = (chunk_id << 7) | (rank & 0x7F). Decoding: chunk_id = raw >> 7, rank = raw & 0x7F.
Building and reading the index
build_unitig_idx(path, block_bits)
Scans unitigs.bin sequentially: for each chunk at byte offset offset, if chunk_count & mask == 0 (where mask = (1 << block_bits) − 1), appends offset as u32 to block_offsets. After the scan, appends a sentinel (= total file size), then writes the .idx file. Called after the unitig file is fully written and closed.
open(), open_sequential(), open_direct_access()
UnitigFileReader has three constructors:
open(path)— smart default: ifunitigs.bin.idxexists, delegates toopen_direct_access; otherwise delegates toopen_sequential. Prefer this in call sites that don't require one specific mode.open_sequential(path)— never reads.idx. Sequential iterators only;chunk_start(i)falls back to an O(i) mmap scan rather than panicking.open_direct_access(path)— requires.idxto be present. Enables O(1) or O(2^block_bits)chunk_start(i), used byverify_canonical_kmerat query time.
CanonicalKmerIter — a clonable sequential iterator returned by UnitigFileReader::iter_canonical_kmers(). It holds an Arc<Mmap> so cloning resets the cursor to the start without reopening the file. This makes it usable with par_bridge() for parallel MPHF construction without random access.
chunk_start(i) — access modes
When .idx is loaded (open_direct_access):
block_bits = 0: single array lookup, O(1).block_bits > 0: lookup block, then scan ≤ 2^block_bits records, O(2^block_bits).
When .idx is absent (open_sequential): chunk_start(i) performs an O(i) sequential mmap scan from offset 0. No panic — the function degrades gracefully. This degraded path is used by find_strict() on Approx layers (sequential scan of all canonical kmers).
Decoding a kmer from slot s
let (chunk_id, rank) = evidence.decode(s); // u32 → (chunk_id: u32, rank: u8)
let kmer = unitigs.raw_kmer(chunk_id, rank); // 2-bit packed slice → left-aligned u64
Two memory accesses: one 4-byte read from evidence.bin, one packed-bit extraction from unitigs.bin via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).
Field widths and capacity
| field | bits | range | capacity check (B. nana, 256 partitions) |
|---|---|---|---|
seql − k |
8 | 0–255 | max n_kmers per chunk = 256 = MAX_KMERS_PER_CHUNK |
rank |
7 | 0–127 | observed max ~46 kmers/chunk; structural max k−m+1 = 21 |
chunk_id |
25 | 0–33 554 431 | avg U ≈ 275 k chunks/partition |
The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.
Evidence bit-cost
Strategy B (chunk_id + rank) is the implemented strategy. For B. nana (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):
| field | theoretical cost | value |
|---|---|---|
| chunk_id | ⌈log₂ U⌉ | 19 bits |
| rank | ⌈log₂ m_u⌉ (≈ fixed) | 6 bits |
| stored | aligned u32 | 32 bits/slot |
The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.
Comparison with strategy A (global nucleotide offset): ⌈log₂(P · (1 + (k−1)/m_u))⌉ = 25 bits. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is locality (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.
Unitig decomposition non-determinism
The unitig extraction from GraphDeBruijn is not deterministic: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.
The hash map (hashbrown::HashMap with Xxh3Builder) has run-dependent iteration order. The start_iter first pass emits every node where can_extend_left is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.
Example — fork topology (k = 31):
A → B ← C
↓
D
B has two left neighbours, so can_extend_left = false. Two valid tilings:
| iteration order | unitigs | count |
|---|---|---|
| A first | ABD, C | 2 |
| B first | BD, A, C | 3 |
Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).
This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.
Partition-size tradeoff
Measured on B. nana (k=31, m=11), summing across all partitions:
| N partitions | m_u |
|---|---|
| 1 | 41.89 |
| 16 | 38.19 |
| 256 | 37.90 |
| 1 024 | 37.89 |
m_u is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 16–1024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.
Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via log₂ U = log₂(P/m_u)). The sequence storage term 2 · (1 + (k−1)/m_u) ≈ 3.6 bits/kmer is approximately constant.
Alternative: fingerprint evidence
evidence.bin can be replaced by fingerprint.bin at index build time (--approx) or after the fact (reindex --approx). The fingerprint stores b bits per MPHF slot (the low b bits of kmer.seq_hash()); verification becomes a single bitfield comparison instead of a unitig dereference. False-positive rate per k-mer query: 1/2^b. With the Findere z parameter, z consecutive k-mers must all match, reducing the effective window FP rate to 1/2^(b·z) while skipping z−1 of every z k-mers. No .idx file is written or read in approx mode.
See Approximate evidence (Findere fingerprint) for the full design and CLI parameters.