<p>The construction parameters p, n, and min_count depend on the kmer frequency spectrum of the dataset. Estimating this spectrum before construction avoids costly re-partitioning if p is badly chosen.</p>
<p>Two approaches are supported:</p>
<ul>
<li><strong>External estimation (preferred):</strong> run <ahref="https://github.com/bcgsc/ntCard">NT-CARD</a> on the input files and pass its histogram output to <code>obikmer build</code>. NT-CARD produces a kmer frequency histogram in a single streaming pass using ntHash and a Flajolet-Martin-style estimator; obikmer reads this file and derives p, n, and min_count automatically.</li>
<li><strong>Internal estimation (future):</strong> an <code>obikmer estimate</code> subcommand for users who prefer a single-tool workflow. The implementation would combine two components: (1) <strong>ntHash</strong>, a rolling hash that updates the kmer hash in O(1) per nucleotide by incrementally adding the incoming base and removing the outgoing one — Rust crates exist; (2) a <strong>Flajolet-Martin-style streaming estimator</strong> that maintains a small table of minimum hash values and infers the frequency histogram from their statistical distribution, as described in the NT-CARD paper (Mohamadi <em>et al.</em> 2017)<supid="fnref:Mohamadi2017-ok"><aclass="footnote-ref"href="#fn:Mohamadi2017-ok">1</a></sup>.</li>
</ul>
<p>The histogram gives:</p>
<ul>
<li><strong>F0</strong> (number of distinct kmers) → sets p (target ~10M kmers/partition → p = ⌈log₂(F0 / 10M)⌉)</li>
<li><strong>frequency distribution</strong> → sets n (choose n so that fewer than 1% of kmers overflow)</li>
<li><strong>error valley</strong> → suggests min_count (typically the local minimum between the error peak and the coverage peak)</li>
</ul>
<h2id="phase-1-scatter">Phase 1 — Scatter</h2>
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:</p>
<ol>
<li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base; discard fragments shorter than k.</li>
<li><strong>Entropy filter</strong>: scan each fragment with a sliding window of size k. When the kmer <spanclass="arithmatex">\(K_i = S[i \mathinner{..} i+k-1]\)</span> ended by nucleotide <spanclass="arithmatex">\(S[j]\)</span> (with <spanclass="arithmatex">\(j = i+k-1\)</span>) has entropy below threshold <spanclass="arithmatex">\(\theta\)</span>, emit the current segment and start a new one (see algorithm below). <spanclass="arithmatex">\(K_i\)</span> belongs to neither segment, and no valid kmer is lost.</li>
<li><strong>Length filter</strong>: discard any segment shorter than k produced by step 2.</li>
<li><strong>Super-kmer extraction</strong>: for each clean segment, slide a minimizer window and group consecutive kmers sharing the same canonical minimizer; canonise each super-kmer by lexicographic comparison with its reverse complement (early exit).</li>
<li><strong>Partition routing</strong>: <code>hash(canonical_minimizer) → PART</code> → append super-kmer to <code>partition/superkmers.bin.gz</code>.</li>
</ol>
<p><strong>Segmentation behavior:</strong></p>
<p>When <spanclass="arithmatex">\(K_i\)</span> (ended by <spanclass="arithmatex">\(S[j]\)</span>, <spanclass="arithmatex">\(j = i+k-1\)</span>) fails the entropy threshold:</p>
<li>New segment starts at <spanclass="arithmatex">\(S[i+1]\)</span> (first new kmer = <spanclass="arithmatex">\(K_{i+1}\)</span>)</li>
<li><spanclass="arithmatex">\(K_i\)</span> is excluded: current segment lacks <spanclass="arithmatex">\(S[j]\)</span>, new segment lacks <spanclass="arithmatex">\(S[i]\)</span></li>
<divclass="highlight"><pre><span></span><code>procedure EntropyFilter(S, N, k, θ):
seg_start ← 0
window ← []
for j ← 0 to N−1:
window.push(S[j])
if |window| < k: continue
i ← j − k + 1
if entropy(window) ≤ θ:
emit S[seg_start .. j−1]
seg_start ← i + 1
window ← S[i+1 .. j]
else:
window.pop_front()
emit S[seg_start .. N−1]
</code></pre></div>
</div>
<p>Writes are sequential and append-only — IO-friendly. Gzip applied at write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets header overhead).</p>
<p>Performed independently per partition. Identical super-kmers are consolidated and their COUNT accumulated — analogous to amplicon dereplication in metabarcoding. Uses external bucket sort to stay within RAM bounds:</p>
<p><strong>Pass 1</strong> (streaming): hash the nucleotide payload of each super-kmer, route to one of B bucket files:
<divclass="highlight"><pre><span></span><code>hash(sequence) % B → bucket_i.bin
</code></pre></div>
B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
<p><strong>Pass 2</strong>: for each bucket, load into an in-memory <code>HashMap<sequence, COUNT></code>, dereplicate by summing COUNT values, write consolidated super-kmers.</p>
<p>After dereplication: at Nx coverage the partition shrinks by ~x (errors aside). The COUNT field in each super-kmer header = number of times that exact super-kmer sequence was observed across all input reads.</p>
<p><strong>Important:</strong> super-kmer COUNT ≠ individual kmer count. A kmer can appear in multiple distinct super-kmers (same partition, different flanking context); its true count = sum of COUNT of all super-kmers containing it. A super-kmer with COUNT=1 may contain only high-abundance kmers, each appearing in many other super-kmers. Abundance filtering therefore cannot be applied at this phase.</p>
<h2id="phase-3-per-kmer-count-aggregation-and-quorum-filtering">Phase 3 — Per-kmer count aggregation and quorum filtering</h2>
<p>For each dereplicated super-kmer, enumerate its kmers and accumulate counts:</p>
<divclass="highlight"><pre><span></span><code>for each super-kmer (sequence, COUNT):
for each kmer in sequence:
kmer_counts[canonical(kmer)] += COUNT
</code></pre></div>
<p>Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.</p>
<p>Abundance filter applied here: kmers with <code>total_count < q</code> are discarded. <code>q</code> is a collection parameter (0 = keep all, including singletons for ≤1x data).</p>
<p>No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.</p>
<li><strong>Mixed</strong> → split into sub-super-kmers at invalid boundaries; each sub-super-kmer inherits the original COUNT</li>
</ul>
<p>After splitting, re-apply dereplication (bucket sort, phase 2 method) — splitting can produce new identical super-kmers. This re-dereplication is cheap: the volume is already greatly reduced.</p>
<p>Output: a clean super-kmer file where every kmer passes quorum. This file feeds phase 5.</p>
<h2id="phase-5-local-de-bruijn-graph-and-unitig-construction">Phase 5 — Local de Bruijn graph and unitig construction</h2>
<p>Within each partition, build a <strong>local de Bruijn graph</strong> from the valid kmer set and compute its unitigs. All operations are local to the partition — no cross-partition communication.</p>
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
internal node ↔ in_degree=1 AND out_degree=1
branching / dead-end → unitig start or end
</code></pre></div>
<p>Traverse non-branching paths to assemble unitigs. Kmers whose neighbours fall in other partitions appear as dead ends locally — they terminate the unitig. The result: <strong>each kmer appears in exactly one unitig</strong> within the partition.</p>
<p>The partition size (controlled by p) must be calibrated so that the HashSet fits in RAM during this phase.</p>
<p>Output: <code>unitigs.bin</code> — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.</p>
<p><strong>Scope of local unitigs:</strong> these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.</p>
<h2id="phase-6-mphf-construction-and-index-finalisation">Phase 6 — MPHF construction and index finalisation</h2>
<p>Built once on the definitive kmer set (all kmers in all unitigs of the partition). See <ahref="../obilayeredmap/">obilayeredmap</a> and <ahref="../mphf/">MPHF selection</a> for the current implementation.</p>
<p>The MPHF is built in two passes over <code>unitigs.bin</code>: parallel pass for <code>mphf.bin</code>, sequential pass for <code>evidence.bin</code> and payload. The exact kmer count is available from the unitig index (<code>unitigs.bin.idx</code>) before the passes begin.</p>
<p><code>unitigs.bin</code> serves as the evidence structure. The MPHF maps every input to <code>[0, N)</code> including absent kmers — the unitig read-back (via <code>evidence.bin</code>) is the only correct membership test.</p>
<p>Mohamadi, H., Khan, H. & Birol, I. (2017). <ahref="https://doi.org/10.1093/bioinformatics/btw832">ntCard: A streaming algorithm for cardinality estimation in genomics data</a>. <em>Bioinformatics (Oxford, England)</em>, 33, 1324--1330. <aclass="footnote-backref"href="#fnref:Mohamadi2017-ok"title="Jump back to footnote 1 in the text">↩</a></p>
<scriptid="__config"type="application/json">{"annotate":null,"base":"../..","features":[],"search":"../../assets/javascripts/workers/search.2c215733.min.js","tags":null,"translations":{"clipboard.copied":"Copied to clipboard","clipboard.copy":"Copy to clipboard","search.result.more.one":"1 more on this page","search.result.more.other":"# more on this page","search.result.none":"No matching documents","search.result.one":"1 matching document","search.result.other":"# matching documents","search.result.placeholder":"Type to start searching","search.result.term.missing":"Missing","select.version":"Select version"},"version":null}</script>