Skip to content

Unitig-based MPHF evidence encoding

Role of unitigs in the index

The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The evidence file supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.

Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.


Binary file formats

unitigs.bin — sequence chunks

A sequence of binary records. Each record:

[u8: seql − k]  [ceil(seql / 4) bytes: 2-bit packed nucleotides]
  • seql − k (0–255): nucleotide length minus k, so seql = byte[0] + k and n_kmers = byte[0] + 1.
  • Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
  • Byte count for packed sequence: ceil(seql / 4).

Unitigs with more than MAX_KMERS_PER_CHUNK = 256 k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= seql − k + 1 ≤ 256); consecutive chunks overlap by k−1 nucleotides so no kmer is lost:

chunk 1: nucleotides [0,   MAX_KMERS_PER_CHUNK + k − 2]   (256 kmers)
chunk 2: nucleotides [256, end]                             (remaining kmers)
overlap: k−1 nucleotides shared between the two chunks

unitigs.bin.idx — block-sampled offset index

magic     : 4 bytes  = "UIX3"
block_bits: u32 LE   — granularity parameter (0–31)
n_unitigs : u32 LE   — total number of chunks in unitigs.bin
n_kmers   : u64 LE   — total number of kmers across all chunks
offsets   : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel

One offset entry is stored every 2^block_bits chunks; the array is sentinel-terminated (last entry = file size). DEFAULT_BLOCK_BITS = 0 stores one offset per chunk (exact table, no scan).

evidence.bin — per-slot MPHF evidence

A flat array of u32 values, one per MPHF slot, no header:

bits [31:7] = chunk_id  (25 bits)
bits  [6:0] = rank      (7 bits, 0–127)

File size = n_slots × 4 bytes. chunk_id is the 0-based index of the record in unitigs.bin; rank is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: raw = (chunk_id << 7) | (rank & 0x7F). Decoding: chunk_id = raw >> 7, rank = raw & 0x7F.


Building and reading the index

build_unitig_idx(path, block_bits)

Scans unitigs.bin sequentially: for each chunk at byte offset offset, if chunk_count & mask == 0 (where mask = (1 << block_bits) − 1), appends offset as u32 to block_offsets. After the scan, appends a sentinel (= total file size), then writes the .idx file. Called after the unitig file is fully written and closed.

open(), open_sequential(), open_direct_access()

UnitigFileReader has three constructors:

  • open(path) — smart default: if unitigs.bin.idx exists, delegates to open_direct_access; otherwise delegates to open_sequential. Prefer this in call sites that don't require one specific mode.
  • open_sequential(path) — never reads .idx. Sequential iterators only; chunk_start(i) falls back to an O(i) mmap scan rather than panicking.
  • open_direct_access(path) — requires .idx to be present. Enables O(1) or O(2^block_bits) chunk_start(i), used by verify_canonical_kmer at query time.

CanonicalKmerIter — a clonable sequential iterator returned by UnitigFileReader::iter_canonical_kmers(). It holds an Arc<Mmap> so cloning resets the cursor to the start without reopening the file. This makes it usable with par_bridge() for parallel MPHF construction without random access.

chunk_start(i) — access modes

When .idx is loaded (open_direct_access):

  • block_bits = 0: single array lookup, O(1).
  • block_bits > 0: lookup block, then scan ≤ 2^block_bits records, O(2^block_bits).

When .idx is absent (open_sequential): chunk_start(i) performs an O(i) sequential mmap scan from offset 0. No panic — the function degrades gracefully. This degraded path is used by find_strict() on Approx layers (sequential scan of all canonical kmers).

Decoding a kmer from slot s

let (chunk_id, rank) = evidence.decode(s);      // u32 → (chunk_id: u32, rank: u8)
let kmer = unitigs.raw_kmer(chunk_id, rank);    // 2-bit packed slice → left-aligned u64

Two memory accesses: one 4-byte read from evidence.bin, one packed-bit extraction from unitigs.bin via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).


Field widths and capacity

field bits range capacity check (B. nana, 256 partitions)
seql − k 8 0–255 max n_kmers per chunk = 256 = MAX_KMERS_PER_CHUNK
rank 7 0–127 observed max ~46 kmers/chunk; structural max k−m+1 = 21
chunk_id 25 0–33 554 431 avg U ≈ 275 k chunks/partition

The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.


Evidence bit-cost

Strategy B (chunk_id + rank) is the implemented strategy. For B. nana (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):

field theoretical cost value
chunk_id ⌈log₂ U⌉ 19 bits
rank ⌈log₂ m_u⌉ (≈ fixed) 6 bits
stored aligned u32 32 bits/slot

The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.

Comparison with strategy A (global nucleotide offset): ⌈log₂(P · (1 + (k−1)/m_u))⌉ = 25 bits. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is locality (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.


Unitig decomposition non-determinism

The unitig extraction from GraphDeBruijn is not deterministic: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.

The hash map (hashbrown::HashMap with Xxh3Builder) has run-dependent iteration order. The start_iter first pass emits every node where can_extend_left is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.

Example — fork topology (k = 31):

A → B ← C
    ↓
    D

B has two left neighbours, so can_extend_left = false. Two valid tilings:

iteration order unitigs count
A first ABD, C 2
B first BD, A, C 3

Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).

This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.


Partition-size tradeoff

Measured on B. nana (k=31, m=11), summing across all partitions:

N partitions m_u
1 41.89
16 38.19
256 37.90
1 024 37.89

m_u is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 16–1024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.

Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via log₂ U = log₂(P/m_u)). The sequence storage term 2 · (1 + (k−1)/m_u) ≈ 3.6 bits/kmer is approximately constant.


Alternative: fingerprint evidence

evidence.bin can be replaced by fingerprint.bin at index build time (--approx) or after the fact (reindex --approx). The fingerprint stores b bits per MPHF slot (the low b bits of kmer.seq_hash()); verification becomes a single bitfield comparison instead of a unitig dereference. False-positive rate per k-mer query: 1/2^b. With the Findere z parameter, z consecutive k-mers must all match, reducing the effective window FP rate to 1/2^(b·z) while skipping z−1 of every z k-mers. No .idx file is written or read in approx mode.

See Approximate evidence (Findere fingerprint) for the full design and CLI parameters.