Files

661 lines
33 KiB
HTML
Raw Permalink Normal View History

2026-04-16 22:38:20 +02:00
<h1 id="obikmer">obikmer</h1>
<p><code>obikmer</code> is a Rust tool for manipulation, counting,
indexing, and set operations on DNA sequences represented as kmer
sets.</p>
<h2 id="constraints">Constraints</h2>
<ul>
<li>Target scale: metagenomic data, tens of Gbases, billions of
kmers</li>
<li>Maximum efficiency in computation, memory, and disk usage</li>
<li>k is odd, k ∈ [11, 31], fixed at runtime</li>
<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li>
</ul>
<h2 id="input-processing">Input processing</h2>
<p>For sequence files (FASTA/FASTQ):</p>
<ul>
<li>Cut sequence on non-ACGT nucleotides.</li>
<li>Discard fragments &lt; k.</li>
<li>For sequences &gt; 256 nt: cut at 256, start new with k-1 overlap
(preserves all kmers, no duplicates).</li>
</ul>
<h2 id="super-kmers">Super-kmers</h2>
<p>The fundamental stored unit is a <strong>canonical
super-kmer</strong>: a maximal run of consecutive kmers sharing the same
<strong>non-canonical</strong> (raw forward) minimizer, canonized at the
end.</p>
<h3 id="construction">Construction</h3>
<p>Super-kmers are built using the <strong>non-canonical
minimizer</strong> (minimum forward m-mer, no revcomp comparison). Two
consecutive kmers belong to the same super-kmer if and only if their raw
forward minimizer is identical.</p>
<p>At the end of each super-kmer, the minimizer is canonized:
<code>canonical(M) = min(M, revcomp(M))</code>.</p>
<ul>
<li>If <code>canonical(M) == M</code>: the super-kmer is already in
canonical orientation — keep as-is.</li>
<li>If <code>canonical(M) == revcomp(M)</code>: reverse-complement the
entire super-kmer (and its minimizer).</li>
</ul>
<p>The reverse-complement step does not cause any additional splitting —
it only flips the orientation.</p>
<h3 id="why-non-canonical-minimizer">Why non-canonical minimizer?</h3>
<p>Using the raw forward minimizer (rather than the canonical minimizer)
means fewer candidates are considered at each position → the minimizer
changes more often → super-kmers are <strong>shorter or equal</strong>,
never longer, compared to the canonical-minimizer definition.</p>
<p>The key benefit is at dereplication: a forward read and its
reverse-complement covering the same genomic region both produce the
<strong>same canonical super-kmer</strong> after the canonization step.
Under the canonical-minimizer definition, they would share the same
canonical minimizer but have different sequences (one being the revcomp
of the other) — they would not dereplicate. Here they do, giving a
compression factor of ~2× on paired or double-stranded data.</p>
<h3 id="length">Length</h3>
<p>For a random minimizer of length m over k-mers of length k, the
density of minimizer positions is approximately 2/(km+2) <span
class="citation" data-cites="Zheng2020-ji Golan2025-xf">[@Zheng2020-ji;
@Golan2025-xf]</span>, giving an expected super-kmer length of (k+m)/2
nucleotides. For k=31, m=13: expected ≈ 22 nt. In practice super-kmers
rarely exceed a few dozen nucleotides — the 256 nt cap is almost never
reached. When it is, the super-kmer is split at 256 nt with a k1
overlap, preserving all kmers without duplication.</p>
<h2 id="binary-storage-format">Binary storage format</h2>
<p>Memory-mappable binary format for sequence collections: store
super-kmers end-to-end with u64-aligned headers (add padding if needed
for 8-byte alignment). Include index blocks every B super-kmers
(B=1000-10000) for fast jumping to position J (O(N/B + B) scan
vs. separate index O(1) direct access), where N is total super-kmers in
file. Represent header as union/struct for field access.</p>
<h2 id="partitioning-and-indexing-architecture">Partitioning and
indexing architecture</h2>
<p>The canonical minimizer of a super-kmer is hashed to produce a
<strong>p-bit routing value</strong> (p is a collection-level
parameter):</p>
<pre><code>canonical minimizer → hash(minimizer) → p-bit value → PART → partition directory</code></pre>
<p>PART is computed once at phase 1 to open the correct partition file,
then discarded. It is recomputed on the fly at query time. It is never
stored in the super-kmer header.</p>
<p>Each partition holds one BBHash instance (phase 6) that indexes kmers
as plain u64 values — the minimizer plays no role inside the
partition.</p>
<p>Properties: - <strong>2^p partitions</strong>: p freely tunable at
collection creation; not constrained to powers of 4 - <strong>Uniform
distribution</strong>: hashing the canonical minimizer corrects the
lexicographic bias inherent in minimizer selection <span
class="citation"
data-cites="Zheng2020-ji Zheng2021-cc Pan2024-hb Kille2023-px Golan2025-xf">[@Zheng2020-ji;
@Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]</span> - The
odd-length constraint on m only ensures canonical form correctness
during minimizer computation — it does not affect the partition
layout</p>
<p><strong>Entropy constraint:</strong> the canonical minimizer has 2m
bits of entropy. Requesting p bits from the hash requires 2m ≥ p,
i.e. <strong>p ≤ 2m</strong>:</p>
<table>
<thead>
<tr>
<th>m</th>
<th>p max</th>
<th>Partitions max</th>
</tr>
</thead>
<tbody>
<tr>
<td>9</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>11</td>
<td>6</td>
<td>64</td>
</tr>
<tr>
<td>13</td>
<td>10</td>
<td>1 024</td>
</tr>
<tr>
<td>15</td>
<td>14</td>
<td>16 384</td>
</tr>
</tbody>
</table>
<p>For the typical configuration k=31, m=13: up to 1 024 partitions.
Choosing p &gt; 2m16 is possible but produces artificially colliding
PART values with no gain in routing granularity.</p>
<h2 id="construction-pipeline">Construction pipeline</h2>
<p>All phases after scatter are embarrassingly parallel across
partitions.</p>
<h3 id="phase-1-scatter">Phase 1 — Scatter</h3>
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ
quality scores are ignored. For each read:</p>
<ol type="1">
<li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base;
discard fragments shorter than k.</li>
<li><strong>Entropy filter</strong>: scan each fragment with a sliding
window of size k. When the kmer <span class="math inline">$K_i = S[i
\mathinner{..} i+k-1]$</span> ended by nucleotide <span
class="math inline"><em>S</em>[<em>j</em>]</span> (with <span
class="math inline"><em>j</em>=<em>i</em>+<em>k</em> 1</span>) has
entropy below threshold <span class="math inline"><em>θ</em></span>,
emit the current segment and start a new one (see algorithm below).
<span class="math inline"><em>K</em><sub><em>i</em></sub></span> belongs
to neither segment, and no valid kmer is lost.</li>
<li><strong>Super-kmer extraction</strong>: for each clean segment,
slide a minimizer window and group consecutive kmers sharing the same
raw minimizer; canonise at the end.</li>
<li><strong>Partition routing</strong>:
<code>hash(canonical_minimizer) → PART</code> → append super-kmer to
<code>partition/superkmers.bin.gz</code>.</li>
</ol>
<p><strong>Entropy filter algorithm:</strong></p>
<p>When <span class="math inline"><em>K</em><sub><em>i</em></sub></span>
(ended by <span class="math inline"><em>S</em>[<em>j</em>]</span>, <span
class="math inline"><em>j</em>=<em>i</em>+<em>k</em> 1</span>)
fails the entropy threshold:</p>
<ul>
<li>Current segment <span class="math inline">$S[\textit{seg_start}
\mathinner{..} j-1]$</span> is emitted (last valid kmer = <span
class="math inline"><em>K</em><sub><em>i</em> 1</sub></span>)</li>
<li>New segment starts at <span
class="math inline"><em>S</em>[<em>i</em>+1]</span> (first new kmer =
<span
class="math inline"><em>K</em><sub><em>i</em>+1</sub></span>)</li>
<li><span class="math inline"><em>K</em><sub><em>i</em></sub></span> is
excluded: current segment lacks <span
class="math inline"><em>S</em>[<em>j</em>]</span>, new segment lacks
<span class="math inline"><em>S</em>[<em>i</em>]</span></li>
<li>Overlap = <span class="math inline">$S[i+1 \mathinner{..}
j-1]$</span> = <span class="math inline"><em>k</em> 2</span>
nucleotides</li>
</ul>
<pre class="pseudocode"><code>\begin{algorithm}
\caption{Entropy filter — sliding window segmentation}
\begin{algorithmic}
\PROCEDURE{EntropyFilter}{S, N, k, theta}
\STATE seg_start &lt;- 0
\STATE window &lt;- []
\FOR{j &lt;- 0 \TO N-1}
\STATE window.push(S[j])
\IF{|window| &lt; k}
\STATE continue
\ENDIF
\STATE i &lt;- j - k + 1
\IF{entropy(window) &lt;= theta}
\STATE emit S[seg_start .. j-1]
\STATE seg_start &lt;- i + 1
\STATE window &lt;- S[i+1 .. j]
\ELSE
\STATE window.pop_front()
\ENDIF
\ENDFOR
\STATE emit S[seg_start .. N-1]
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}</code></pre>
<p>Writes are sequential and append-only — IO-friendly. Gzip applied at
write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets
header overhead).</p>
<h3 id="phase-2-dereplication">Phase 2 — Dereplication</h3>
<p>Performed independently per partition. Identical super-kmers are
consolidated and their COUNT accumulated — analogous to amplicon
dereplication in metabarcoding. Uses external bucket sort to stay within
RAM bounds:</p>
<p><strong>Pass 1</strong> (streaming): hash the nucleotide payload of
each super-kmer, route to one of B bucket files:</p>
<pre><code>hash(sequence) % B → bucket_i.bin</code></pre>
<p>B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
<p><strong>Pass 2</strong>: for each bucket, load into an in-memory
<code>HashMap&lt;sequence, COUNT&gt;</code>, dereplicate by summing
COUNT values, write consolidated super-kmers.</p>
<p>After dereplication: at Nx coverage the partition shrinks by ~x
(errors aside). The COUNT field in each super-kmer header = number of
times that exact super-kmer sequence was observed across all input
reads.</p>
<p><strong>Important:</strong> super-kmer COUNT ≠ individual kmer count.
A kmer can appear in multiple distinct super-kmers (same partition,
different flanking context); its true count = sum of COUNT of all
super-kmers containing it. A super-kmer with COUNT=1 may contain only
high-abundance kmers, each appearing in many other super-kmers.
Abundance filtering therefore cannot be applied at this phase.</p>
<h3 id="phase-3-per-kmer-count-aggregation-and-quorum-filtering">Phase 3
— Per-kmer count aggregation and quorum filtering</h3>
<p>For each dereplicated super-kmer, enumerate its kmers and accumulate
counts:</p>
<pre><code>for each super-kmer (sequence, COUNT):
for each kmer in sequence:
kmer_counts[canonical(kmer)] += COUNT</code></pre>
<p>Implemented as an external sort or a temporary HashMap, depending on
partition size. At the end of this phase, each distinct canonical kmer
has its exact total count.</p>
<p>Abundance filter applied here: kmers with
<code>total_count &lt; q</code> are discarded. <code>q</code> is a
collection parameter (0 = keep all, including singletons for ≤1x
data).</p>
<p>No pre-filter on super-kmer COUNT is possible at phase 2: a
super-kmer with COUNT=1 may contain only high-abundance kmers, each
present in many other super-kmers across the partition.</p>
<h3 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer
compaction</h3>
<p>The valid kmer set from phase 3 is used as a mask to rewrite the
super-kmer files:</p>
<pre><code>for each dereplicated super-kmer:
scan kmer by kmer
kmer not in valid set → break point (terminates current super-kmer)
kmer in valid set → extend current super-kmer</code></pre>
<p>Three cases per super-kmer: - <strong>All kmers valid</strong>
copied as-is - <strong>No kmer valid</strong> → discarded -
<strong>Mixed</strong> → split into sub-super-kmers at invalid
boundaries; each sub-super-kmer inherits the original COUNT</p>
<p>After splitting, re-apply dereplication (bucket sort, phase 2 method)
— splitting can produce new identical super-kmers. This re-dereplication
is cheap: the volume is already greatly reduced.</p>
<p>Output: a clean super-kmer file where every kmer passes quorum. This
file feeds phase 5.</p>
<h3 id="phase-5-local-de-bruijn-graph-and-unitig-construction">Phase 5 —
Local de Bruijn graph and unitig construction</h3>
<p>Within each partition, build a <strong>local de Bruijn graph</strong>
from the valid kmer set and compute its unitigs. All operations are
local to the partition — no cross-partition communication.</p>
<pre><code>valid kmers → HashSet&lt;u64&gt;
for each kmer K:
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
internal node ↔ in_degree=1 AND out_degree=1
branching / dead-end → unitig start or end</code></pre>
<p>Traverse non-branching paths to assemble unitigs. Kmers whose
neighbours fall in other partitions appear as dead ends locally — they
terminate the unitig. The result: <strong>each kmer appears in exactly
one unitig</strong> within the partition.</p>
<p>The partition size (controlled by p) must be calibrated so that the
HashSet fits in RAM during this phase.</p>
<p>Output: <code>unitigs.bin</code> — the permanent evidence structure
for the partition. Each kmer in the partition appears at exactly one
(unitig_id, offset) location.</p>
<p><strong>Scope of local unitigs:</strong> these are unitigs of the
partitions local de Bruijn graph, not global unitigs. A kmer whose k-1
successor or predecessor falls in another partition appears as a dead
end locally and terminates the unitig. This does not affect correctness
of verification but means partition-local unitigs cannot be directly
reused for global assembly.</p>
<h3 id="phase-6-bbhash-construction-and-index-finalisation">Phase 6 —
BBHash construction and index finalisation</h3>
<p>Built once on the definitive kmer set (all kmers in all unitigs of
the partition):</p>
<pre><code>kmers from unitigs → BBHash → bbhash.bin
→ counts.bin : packed n-bit array (or 1-bit for presence mode)
→ refs.bin : (unitig_id: u32, kmer_offset: u8) per kmer</code></pre>
<p>BBHash is built once — no rebuild. The n-bit width for
<code>counts.bin</code> is chosen from the observed count distribution
(n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts
exceeding 2ⁿ−1 go into <code>overflow.bin</code> as sorted
<code>(bbhash_index: u32, count: u32)</code> pairs.</p>
<p><strong>Exact verification via unitig evidence:</strong></p>
<p><code>unitigs.bin</code> serves as the evidence structure: for any
query kmer, the stored unitig provides the ground truth to confirm or
deny its presence. BBHash maps every input to [0, N) including absent
kmers — the unitig read-back is the only way to guarantee exactness.</p>
<pre><code>query kmer q
→ minimizer(q) → hash → PART, KEY → partition
→ BBHash(q) → index i
→ refs[i] = (unitig_id, offset)
→ read unitig unitig_id from unitigs.bin
→ extract kmer at offset → compare with q
→ match : return counts[i] ← exact hit
→ no match: kmer absent ← BBHash collision on absent kmer</code></pre>
<p>One random disk access into <code>unitigs.bin</code> per query; the
unitig is the minimal, non-redundant evidence (each kmer stored
once).</p>
<h2 id="on-disk-collection-structure">On-disk collection structure</h2>
<p>Collections are too large to hold in RAM (hundreds of genomes,
billions of kmers). The collection lives on disk as a directory of
memory-mapped files:</p>
<pre><code>collection/
metadata.toml — collection parameters (see below)
superkmers.bin — super-kmer sequences, end-to-end, memory-mappable
superkmers.idx — index blocks every B super-kmers for fast seeking
part_XXXX/
bbhash.bin — BBHash function for this partition
counts.bin — packed n-bit count array (or 1-bit presence array)
refs.bin — back-references for exact verification</code></pre>
<p><strong>Collection parameters</strong> (stored in
<code>metadata.toml</code>):</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th>Role</th>
</tr>
</thead>
<tbody>
<tr>
<td>k</td>
<td>kmer length</td>
</tr>
<tr>
<td>m</td>
<td>minimizer length (odd, &lt; k)</td>
</tr>
<tr>
<td>p</td>
<td>partition bits (0 ≤ p ≤ min(14, 2m16))</td>
</tr>
<tr>
<td>mode</td>
<td><code>presence</code> (1 bit/kmer) or <code>count</code> (n
bits/kmer)</td>
</tr>
<tr>
<td>n</td>
<td>bits per kmer in count mode (chosen at construction)</td>
</tr>
<tr>
<td>min_count</td>
<td>singleton filtering threshold (0 = keep all)</td>
</tr>
<tr>
<td>hash_fn</td>
<td>hash function identifier</td>
</tr>
<tr>
<td>hash_seed</td>
<td>seed for the hash function</td>
</tr>
</tbody>
</table>
<p><strong>Exact verification via back-reference:</strong></p>
<p>BBHash maps any input to [0, N) — including absent kmers. To avoid
false positives, <code>refs.bin</code> stores at each BBHash index a
back-reference <code>(superkmer_id: u32, kmer_offset: u8)</code> = 5
bytes. Query protocol:</p>
<pre><code>query kmer q
→ minimizer(q) → hash → PART, KEY → partition
→ BBHash(q) → index i
→ refs[i] = (sk_id, offset)
→ read super-kmer sk_id from superkmers.bin
→ extract kmer at offset → compare with q
→ if match: return counts[i] (exact)
→ if no match: kmer absent</code></pre>
<p><strong>Count storage — two modes:</strong></p>
<p><em>Presence mode</em> (coverage ≤ 1x, or when only presence/absence
matters): - <code>counts.bin</code> is a packed 1-bit array — all bits
set to 1 for indexed kmers - Singletons are the signal, not filtered</p>
<p><em>Count mode</em> (coverage &gt; 1x): - <code>counts.bin</code> is
a packed n-bit array; n chosen at construction from the observed
distribution - Value 0: absent sentinel; values 1..2ⁿ−2: direct count;
value 2ⁿ−1: overflow - Overflow counts stored in a separate
<code>overflow.bin</code> as sorted
<code>(index: u32, count: u32)</code> pairs - Empirically (k=31, 15x
coverage): n=5 covers 97% of real kmers, n=6 covers 99% - min_count
threshold filters low-frequency kmers (errors) before indexing; for ≤1x,
min_count=0</p>
<p><strong>BBHash:</strong> operates on kmer u64 values directly within
each partition. The minimizer plays no role inside the partition —
BBHash receives the kmer and builds the perfect hash naturally.</p>
<h2 id="kmer-entropy-filter">Kmer entropy filter</h2>
<p>Low-complexity kmers (polyA, polyT, tandem repeats) are detected and
excluded during phase 1. The filter computes a <strong>normalized
Shannon entropy</strong> over sub-words of multiple sizes, corrected for
two sources of bias: the small number of observations within a single
kmer, and the unequal sizes of circular equivalence classes.</p>
<h3 id="sub-word-frequencies">Sub-word frequencies</h3>
<p>For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max,
typically ws_max = 6), extract the <span
class="math inline"><em>n</em><sub>words</sub>=<em>k</em><em>w</em><em>s</em>+1</span>
overlapping sub-words by sliding a window of length ws:</p>
<p><span class="math display">$$w_i = \text{kmer}[i \mathinner{..}
i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$</span></p>
<p>Each sub-word is mapped to its <strong>circular canonical
form</strong>: the lexicographic minimum among all cyclic rotations of
the word. Let <span
class="math inline"><em>s</em><sub><em>j</em></sub></span> be the size
of equivalence class <span class="math inline"><em>j</em></span> (number
of distinct raw words mapping to canonical form <span
class="math inline"><em>j</em></span>), and <span
class="math inline"><em>f</em><sub><em>j</em></sub></span> the count of
canonical form <span class="math inline"><em>j</em></span> among the
<span class="math inline"><em>n</em><sub>words</sub></span> sub-words
(<span
class="math inline"><sub><em>j</em></sub><em>f</em><sub><em>j</em></sub>=<em>n</em><sub>words</sub></span>).</p>
<h3 id="corrected-shannon-entropy">Corrected Shannon entropy</h3>
<p>The circular equivalence classes have unequal sizes: under a uniform
distribution over all <span
class="math inline">4<sup><em>w</em><em>s</em></sup></span> raw words,
class <span class="math inline"><em>j</em></span> is visited with
probability <span
class="math inline"><em>s</em><sub><em>j</em></sub>/4<sup><em>w</em><em>s</em></sup></span>,
not <span class="math inline">1/<em>n</em><sub><em>a</em></sub></span>.
Computing entropy directly over canonical classes therefore
underestimates the entropy of a random sequence.</p>
<p>The correction “unfolds” each canonical class back to its member raw
words, redistributing each observation of class <span
class="math inline"><em>j</em></span> equally among its <span
class="math inline"><em>s</em><sub><em>j</em></sub></span> members:</p>
<p><span class="math display">$$H_{\text{corr}} = \log(n_{\text{words}})
- \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j +
\frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$</span></p>
<p>The last term is the correction for unequal class sizes. For a
uniformly random sequence (<span
class="math inline"><em>f</em><sub><em>j</em></sub> ≈ <em>n</em><sub>words</sub> ⋅ <em>s</em><sub><em>j</em></sub>/4<sup><em>w</em><em>s</em></sup></span>),
this gives <span
class="math inline"><em>H</em><sub>corr</sub> ≈ log(4<sup><em>w</em><em>s</em></sup>)=2 ⋅ <em>w</em><em>s</em> ⋅ log2</span>,
the maximum entropy over raw words.</p>
<h3 id="maximum-entropy-correction-for-small-samples">Maximum entropy
correction for small samples</h3>
<p>With only <span class="math inline"><em>n</em><sub>words</sub></span>
observations over <span
class="math inline">4<sup><em>w</em><em>s</em></sup></span> possible raw
words, the achievable maximum entropy is bounded by the most uniform
integer distribution over <span
class="math inline">4<sup><em>w</em><em>s</em></sup></span>
categories.</p>
<p>Let <span
class="math inline"><em>c</em>= ⌊<em>n</em><sub>words</sub>/4<sup><em>w</em><em>s</em></sup></span>
and <span
class="math inline"><em>r</em>=<em>n</em><sub>words</sub> mod 4<sup><em>w</em><em>s</em></sup></span>.
The most uniform integer distribution assigns frequency <span
class="math inline"><em>c</em>+1</span> to <span
class="math inline"><em>r</em></span> categories and <span
class="math inline"><em>c</em></span> to the remaining <span
class="math inline">4<sup><em>w</em><em>s</em></sup><em>r</em></span>,
with the convention <span class="math inline">0log0=0</span>:</p>
<p><span class="math display">$$H_{\max} = -\left[(4^{ws} -
r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} +
r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$</span></p>
<p>When <span
class="math inline"><em>n</em><sub>words</sub>&lt;4<sup><em>w</em><em>s</em></sup></span>:
<span class="math inline"><em>c</em>=0</span>, <span
class="math inline"><em>r</em>=<em>n</em><sub>words</sub></span>, and
the formula reduces to <span
class="math inline"><em>H</em><sub>max</sub>=log(<em>n</em><sub>words</sub>)</span>
— a single unified expression covers both regimes. A truly random
sequence achieves <span
class="math inline"><em>H</em><sub>corr</sub> ≈ <em>H</em><sub>max</sub></span>.</p>
<h3 id="normalized-entropy">Normalized entropy</h3>
<p><span class="math display">$$\hat{H}(ws) =
\frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$</span></p>
<h3 id="final-score">Final score</h3>
<p>The filter computes <span
class="math inline"><em></em>(<em>w</em><em>s</em>)</span> for each
word size ws from 1 to ws_max and returns the
<strong>minimum</strong>:</p>
<p><span class="math display">$$\text{entropy}(kmer) =
\min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$</span></p>
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1
indicates high complexity. A kmer is rejected if <span
class="math inline">entropy(<em>k</em><em>m</em><em>e</em><em>r</em>) ≤ <em>θ</em></span>,
where <span class="math inline"><em>θ</em></span> is a collection
parameter. The minimum across word sizes ensures that any scale of
repetition is detected independently: polyA is caught at ws=1,
dinucleotide repeats at ws=2, etc.</p>
<h2 id="priority-operations">Priority operations</h2>
<ul>
<li>Kmer counting (frequencies)</li>
<li>Fast search / query</li>
<li>Set operations: union, intersection, difference</li>
</ul>
<h2 id="data-types">Data types</h2>
<p>Two types of DNA are represented:</p>
<ul>
<li><strong>Super-kmers</strong>: stored as a nucleotide sequence (2
bits/base, byte-aligned, no padding). Max length: 256 nucleotides (512
bits, 64 bytes). Each super-kmer has a <strong>32-bit
header</strong>:</li>
</ul>
<table>
<thead>
<tr>
<th>Field</th>
<th>Bits</th>
<th>Role</th>
</tr>
</thead>
<tbody>
<tr>
<td>COUNT</td>
<td>24</td>
<td>Occurrence count (≤ 16 M)</td>
</tr>
<tr>
<td>SEQL</td>
<td>8</td>
<td>Sequence length in nucleotides (≤ 256)</td>
</tr>
</tbody>
</table>
<p>Bit layout (MSB to LSB):</p>
<pre><code>[31:8] COUNT [7:0] SEQL</code></pre>
<p>Getters:</p>
<div class="sourceCode" id="cb12"><pre
class="sourceCode rust"><code class="sourceCode rust"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> seql(<span class="op">&amp;</span><span class="kw">self</span>) <span class="op">-&gt;</span> <span class="dt">u8</span> <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="kw">as</span> <span class="dt">u8</span> <span class="op">}</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> count(<span class="op">&amp;</span><span class="kw">self</span>) <span class="op">-&gt;</span> <span class="dt">u32</span> <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="op">&gt;&gt;</span> <span class="dv">8</span> <span class="op">}</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> increment(<span class="op">&amp;</span><span class="kw">mut</span> <span class="kw">self</span>) <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="op">+=</span> <span class="dv">1</span> <span class="op">&lt;&lt;</span> <span class="dv">8</span><span class="op">;</span> <span class="op">}</span></span></code></pre></div>
<p>PART and KEY were considered as routing fields derived from
<code>hash(minimizer)</code> but are single-use at phase 1 only —
neither is stored. The nucleotide sequence is always byte-aligned (no
offset): after reverse-complementing a super-kmer, a single bit-shift
realigns the sequence so that nucleotide 0 always starts at bit 0 of
word[0]. This shift is paid once at canonisation (phase 1, at most half
of super-kmers) and eliminates any need for an alignment offset field.
The direct consequence for dereplication: the byte array can be hashed
directly without any offset adjustment.</p>
<p>Note: minimizer selection (window minimum) biases the raw minimizer
distribution toward lexicographically small values <span
class="citation"
data-cites="Zheng2020-ji Zheng2021-cc Pan2024-hb Kille2023-px Golan2025-xf">[@Zheng2020-ji;
@Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]</span>;
hashing corrects this before partition routing.</p>
<ul>
<li><strong>Kmers</strong>: short DNA subsequences (k bases),
represented in <code>u64</code>.</li>
</ul>
<h2 id="kmer-representation">Kmer representation</h2>
<p>Each base is encoded on 2 bits, so any kmer with k ≤ 31 fits in a
<code>u64</code>.</p>
<h3 id="bit-layout">Bit layout</h3>
<p>Nucleotide <code>i</code> occupies bits <code>64-i</code> and
<code>63-i</code>. For a kmer of length k, 2k bits are used; the
low-order 64 - 2k bits of the u64 are kept at 0 and have no meaning.</p>
<ul>
<li>Extraction of nuc[i]:
<code>(word &gt;&gt; (63 - 2*i)) &amp; 0b11</code>.</li>
</ul>
<p>For kmers (k ≤ 31), the whole sequence fits in a single
<code>u64</code>. For longer sequences up to 256 nucleotides (512 bits),
the representation extends across multiple words.</p>
<h3 id="complement-and-reverse-complement">Complement and reverse
complement</h3>
<p>The encoding is chosen so that the Watson-Crick complement of any
base is its bitwise NOT (on 2 bits):</p>
<table>
<thead>
<tr>
<th>Base</th>
<th>Encoding</th>
<th>Complement</th>
<th>NOT</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td><code>00</code></td>
<td>T</td>
<td><code>11</code></td>
</tr>
<tr>
<td>C</td>
<td><code>01</code></td>
<td>G</td>
<td><code>10</code></td>
</tr>
<tr>
<td>G</td>
<td><code>10</code></td>
<td>C</td>
<td><code>01</code></td>
</tr>
<tr>
<td>T</td>
<td><code>11</code></td>
<td>A</td>
<td><code>00</code></td>
</tr>
</tbody>
</table>
<p>This means <code>complement(base) = ~base &amp; 0b11</code>, and the
reverse complement of a kmer reduces to two cheap operations:</p>
<pre><code>revcomp(kmer) = reverse_bases(~kmer &amp; mask_2k)</code></pre>
<p>where <code>mask_2k = (1 &lt;&lt; 2k) - 1</code> zeros the unused
high bits, and <code>reverse_bases</code> reorders the 2-bit chunks from
LSB to MSB. No lookup table needed.</p>
<p>DNA being double-stranded, each kmer and its reverse complement are
equivalent. The canonical form is defined as:</p>
<pre><code>canonical(kmer) = min(kmer, revcomp(kmer))</code></pre>
<p>This halves the kmer space and ensures strand-independent
counting.</p>
<h2 id="sequence-operations">Sequence operations</h2>
<h3 id="reverse-complement">Reverse complement</h3>
<p>Reverse complement of a super-kmer sequence (used at most once per
super-kmer, during phase 1 canonisation):</p>
<ol type="1">
<li>Reverse byte order</li>
<li>Apply 256-entry lookup table (precomputed reverse-complemented
bytes) to each byte</li>
<li>Bit-shift the result to realign nucleotide 0 to bit 0 of
word[0]</li>
</ol>
<p>The shift amount depends only on the sequence length:
<code>shift = (len % 4) * 2</code> bits. After the shift, the sequence
is byte-aligned with no offset — the same invariant as the original.</p>
<div class="sourceCode" id="cb15"><pre
class="sourceCode rust"><code class="sourceCode rust"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> reverse_complement(words<span class="op">:</span> <span class="op">&amp;</span>[<span class="dt">u64</span>]<span class="op">,</span> len<span class="op">:</span> <span class="dt">usize</span>) <span class="op">-&gt;</span> <span class="dt">Vec</span><span class="op">&lt;</span><span class="dt">u64</span><span class="op">&gt;</span> <span class="op">{</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> bytes <span class="op">=</span> words_to_bytes(words)<span class="op">;</span></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> <span class="kw">mut</span> rev<span class="op">:</span> <span class="dt">Vec</span><span class="op">&lt;</span><span class="dt">u8</span><span class="op">&gt;</span> <span class="op">=</span> bytes<span class="op">.</span>iter()<span class="op">.</span>rev()<span class="op">.</span>map(<span class="op">|&amp;</span>b<span class="op">|</span> LOOKUP_TABLE[b <span class="kw">as</span> <span class="dt">usize</span>])<span class="op">.</span>collect()<span class="op">;</span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> shift <span class="op">=</span> (len <span class="op">%</span> <span class="dv">4</span>) <span class="op">*</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> shift <span class="op">&gt;</span> <span class="dv">0</span> <span class="op">{</span> bit_shift_left(<span class="op">&amp;</span><span class="kw">mut</span> rev<span class="op">,</span> shift)<span class="op">;</span> <span class="op">}</span></span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a> bytes_to_words(rev)</span>
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h2 id="multithreading">Multithreading</h2>
<p>For HPC with 100+ cores: use Rayon for data parallelism (parallel
iterators), std::thread for custom threading, or async with Tokio for
I/O-bound tasks. Equivalent to Gos goroutines but with
ownership/borrowing safety.</p>
<p>Rayon: Rust crate for automatic data parallelism. Provides parallel
iterators (par_iter) that distribute work across CPU cores, e.g.,
vec.par_iter().map(|x| x*2).collect(). Handles thread
spawning/synchronization transparently. Widely used and maintained by
the Rust project. Orthogonal to Tokio (async runtime): Rayon for CPU
parallelism, Tokio for async I/O.</p>
<p>Data pipelines: use Crossbeam channels (MPMC) with threads/async,
equivalent to Gos channels + goroutines.</p>
<p>Channels: std::sync::mpsc (built-in, MPSC, simple); crossbeam (crate,
MPMC, faster, more features like select). Use Crossbeam for
multi-producer/multi-consumer scenarios.</p>
<h1 id="bibliographie">Bibliographie</h1>
<p>\bibliography</p>