Construction pipeline
All phases after scatter are embarrassingly parallel across partitions.
Phase 0 — Parameter estimation
The construction parameters p, n, and min_count depend on the kmer frequency spectrum of the dataset. Estimating this spectrum before construction avoids costly re-partitioning if p is badly chosen.
Two approaches are supported:
- External estimation (preferred): run NT-CARD on the input files and pass its histogram output to
obikmer build. NT-CARD produces a kmer frequency histogram in a single streaming pass using ntHash and a Flajolet-Martin-style estimator; obikmer reads this file and derives p, n, and min_count automatically. - Internal estimation (future): an
obikmer estimatesubcommand for users who prefer a single-tool workflow. The implementation would combine two components: (1) ntHash, a rolling hash that updates the kmer hash in O(1) per nucleotide by incrementally adding the incoming base and removing the outgoing one — Rust crates exist; (2) a Flajolet-Martin-style streaming estimator that maintains a small table of minimum hash values and infers the frequency histogram from their statistical distribution, as described in the NT-CARD paper (Mohamadi et al. 2017)1.
The histogram gives:
- F0 (number of distinct kmers) → sets p (target ~10M kmers/partition → p = ⌈log₂(F0 / 10M)⌉)
- frequency distribution → sets n (choose n so that fewer than 1% of kmers overflow)
- error valley → suggests min_count (typically the local minimum between the error peak and the coverage peak)
Phase 1 — Scatter
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.
Input files are read via open_nuc_stream, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of NucPage buffers. Each NucPage is a flat 64 KB buffer of normalised bytes (ACGT + \x00 separators), carrying a k−1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.
For each read fragment within a page:
- Ambiguous base filter: cut at any non-ACGT base; discard fragments shorter than k.
- Entropy filter: scan each fragment with a sliding window of size k. When the kmer \(K_i = S[i \mathinner{..} i+k-1]\) ended by nucleotide \(S[j]\) (with \(j = i+k-1\)) has entropy below threshold \(\theta\), emit the current segment and start a new one (see algorithm below). \(K_i\) belongs to neither segment, and no valid kmer is lost.
- Length filter: discard any segment shorter than k produced by step 2.
- Super-kmer extraction: for each clean segment, slide a minimizer window and group consecutive kmers sharing the same canonical minimizer; canonise each super-kmer by lexicographic comparison with its reverse complement (early exit).
- Partition routing:
hash(canonical_minimizer) → PART→ append super-kmer topartition/superkmers.bin.gz.
Segmentation behavior:
When \(K_i\) (ended by \(S[j]\), \(j = i+k-1\)) fails the entropy threshold:
- Current segment \(S[\textit{seg_start} \mathinner{..} j-1]\) is emitted (last valid kmer = \(K_{i-1}\))
- New segment starts at \(S[i+1]\) (first new kmer = \(K_{i+1}\))
- \(K_i\) is excluded: current segment lacks \(S[j]\), new segment lacks \(S[i]\)
- Overlap = \(S[i+1 \mathinner{..} j-1]\) = \(k-2\) nucleotides
Algorithm — Entropy filter: sliding window segmentation
procedure EntropyFilter(S, N, k, θ):
seg_start ← 0
window ← []
for j ← 0 to N−1:
window.push(S[j])
if |window| < k: continue
i ← j − k + 1
if entropy(window) ≤ θ:
emit S[seg_start .. j−1]
seg_start ← i + 1
window ← S[i+1 .. j]
else:
window.pop_front()
emit S[seg_start .. N−1]
Writes are sequential and append-only — IO-friendly. Gzip applied at write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets header overhead).
Phase 2 — Dereplication
Performed independently per partition. Identical super-kmers are consolidated and their COUNT accumulated — analogous to amplicon dereplication in metabarcoding. Uses external bucket sort to stay within RAM bounds:
Pass 1 (streaming): hash the nucleotide payload of each super-kmer, route to one of B bucket files:
hash(sequence) % B → bucket_i.bin
Pass 2: for each bucket, load into an in-memory HashMap<sequence, COUNT>, dereplicate by summing COUNT values, write consolidated super-kmers.
After dereplication: at Nx coverage the partition shrinks by ~x (errors aside). The COUNT field in each super-kmer header = number of times that exact super-kmer sequence was observed across all input reads.
Important: super-kmer COUNT ≠ individual kmer count. A kmer can appear in multiple distinct super-kmers (same partition, different flanking context); its true count = sum of COUNT of all super-kmers containing it. A super-kmer with COUNT=1 may contain only high-abundance kmers, each appearing in many other super-kmers. Abundance filtering therefore cannot be applied at this phase.
Phase 3 — Per-kmer count aggregation and quorum filtering
For each dereplicated super-kmer, enumerate its kmers and accumulate counts:
for each super-kmer (sequence, COUNT):
for each kmer in sequence:
kmer_counts[canonical(kmer)] += COUNT
Implemented as a three-step pipeline in count_partition():
- External sort (
kmer_sort::sort_unique_kmers): read dereplicated superkmers, extract canonical kmer rawu64values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup →sorted_unique.bin. f0 is now known exactly. - Provisional MPHF (ptr_hash): built from
sorted_unique.binvianew_from_par_iter(f0, ...). Stored tomphf1.bin;sorted_unique.bindeleted immediately. - Accumulation pass: re-read dereplicated superkmers; for each kmer,
slot = mphf.index(kmer.raw()), incrementcounts1[slot]by the superkmer COUNT. Stored in aPersistentCompactIntVec(counts1.bin).
At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (spectrums/{label}.json) is written to the index root.
No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.
Phase 4 — Super-kmer compaction
The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:
for each dereplicated super-kmer:
scan kmer by kmer
kmer not in valid set → break point (terminates current super-kmer)
kmer in valid set → extend current super-kmer
Three cases per super-kmer:
- All kmers valid → copied as-is
- No kmer valid → discarded
- Mixed → split into sub-super-kmers at invalid boundaries; each sub-super-kmer inherits the original COUNT
After splitting, re-apply dereplication (bucket sort, phase 2 method) — splitting can produce new identical super-kmers. This re-dereplication is cheap: the volume is already greatly reduced.
Output: a clean super-kmer file where every kmer passes quorum. This file feeds phase 5.
Phase 5 — Local de Bruijn graph and unitig construction
Within each partition, build a local de Bruijn graph from the valid kmer set and compute its unitigs. All operations are local to the partition — no cross-partition communication.
valid kmers → HashSet<u64>
for each kmer K:
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
internal node ↔ in_degree=1 AND out_degree=1
branching / dead-end → unitig start or end
Traverse non-branching paths to assemble unitigs. Kmers whose neighbours fall in other partitions appear as dead ends locally — they terminate the unitig. The result: each kmer appears in exactly one unitig within the partition.
The partition size (controlled by p) must be calibrated so that the HashSet fits in RAM during this phase.
Output: unitigs.bin — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.
Scope of local unitigs: these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.
Phase 6 — MPHF construction and index finalisation
build_index_layer is called per partition (in parallel via build_layers) with the following parameters sourced from IndexConfig:
block_bits— fromIndexConfig::block_bits; controls the.idxblock size (2^block_bits unitig chunks per block) for exact evidenceevidence—EvidenceKind::ExactorEvidenceKind::Approx { b, z }; propagated unchanged fromIndexConfig::evidencemin_ab/max_ab— abundance bounds applied before graph constructionwith_counts— whether to store kmer counts alongside set membership
Abundance filtering: when min_ab > 1 or max_ab.is_some(), the provisional mphf1.bin and counts1.bin produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in counts1 satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).
for each kmer in dereplicated super-kmer:
ab = counts1[mphf1.index(kmer.raw())]
if ab < min_ab || ab > max_ab: skip
graph.push(kmer)
Graph build and unitig write:
The surviving kmers are fed into GraphDeBruijn, which computes degrees and yields unitigs. Unitigs are written to layer_0/unitigs.bin via a UnitigFileWriter.
MPHF and evidence build:
Layer::build (membership-only) or Layer::<PersistentCompactIntMatrix>::build (with counts) is called next. Internally, MphfLayer::build performs two passes:
- Pass 1 (parallel): build
unitigs.bin.idx(block size = 2^block_bits) then construct the MPHF from all canonical kmers inunitigs.bin; store tomphf.bin. - Pass 2 (sequential): for each kmer in
unitigs.bin, compute its slot and writeevidence.bin(chunk_id: 25 bits | rank: 7 bitspacked into au32); also invoke the payload callback (fill_slot) to populatecounts/ifwith_counts.
After Layer::build completes, layer_meta.json records EvidenceKind::Exact.
Approximate evidence override:
If evidence is EvidenceKind::Approx { b, z }, build_approx_evidence is called immediately after Layer::build. It overwrites the exact evidence bundle with fingerprint.bin (b-bit hash per slot) and rewrites layer_meta.json with EvidenceKind::Approx { b, z }. No .idx file is needed at query time in this mode.
// Exact path → evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
// Approx path → fingerprint.bin + layer_meta.json(Approx{b,z})
// (evidence.bin left on disk but not used)
Partition metadata:
After all layer files are written, PartitionMeta { n_layers: 1 } is serialised to index/meta.json inside the partition directory. This file is required by LayeredMap::open for subsequent merge operations.
File layout per partition after phase 6:
part_XXXXX/
index/
meta.json ← PartitionMeta { n_layers: 1 }
layer_0/
unitigs.bin ← permanent evidence (all modes)
unitigs.bin.idx ← block index (exact mode only)
mphf.bin ← MPHF
evidence.bin ← exact evidence (exact mode)
fingerprint.bin ← b-bit fingerprints (approx mode)
layer_meta.json ← EvidenceKind tag
counts/ ← PersistentCompactIntMatrix (with_counts only)
Cleanup: unless --keep-intermediate is set, remove_build_artifacts deletes dereplicated.skmer.zst, mphf1.bin, and counts1.bin after all partitions are indexed.
See obilayeredmap and MPHF selection for data structure details.
Query path (exact evidence):
query kmer q
→ canonical_minimizer(q) → hash → PART → part_XXXXX/
→ MPHF(q) → slot s
→ evidence[s] = (chunk_id, rank)
→ read k nucleotides at rank in unitigs[chunk_id] → compare with q
→ match : return payload[s] ← exact hit
→ no match: kmer absent ← MPHF collision on absent kmer
Query path (approximate evidence):
query kmer q
→ MPHF(q) → slot s
→ fingerprint[s] matches seq_hash(q)?
→ yes : probable hit (FP rate = 1/2^b per kmer, 1/2^(b·z) per z-window)
→ no : kmer absent
-
Mohamadi, H., Khan, H. & Birol, I. (2017). ntCard: A streaming algorithm for cardinality estimation in genomics data. Bioinformatics (Oxford, England), 33, 1324--1330. ↩