Files
obikmer/xxx.HTML
T
2026-04-19 12:17:16 +02:00

661 lines
33 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<h1 id="obikmer">obikmer</h1>
<p><code>obikmer</code> is a Rust tool for manipulation, counting,
indexing, and set operations on DNA sequences represented as kmer
sets.</p>
<h2 id="constraints">Constraints</h2>
<ul>
<li>Target scale: metagenomic data, tens of Gbases, billions of
kmers</li>
<li>Maximum efficiency in computation, memory, and disk usage</li>
<li>k is odd, k ∈ [11, 31], fixed at runtime</li>
<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li>
</ul>
<h2 id="input-processing">Input processing</h2>
<p>For sequence files (FASTA/FASTQ):</p>
<ul>
<li>Cut sequence on non-ACGT nucleotides.</li>
<li>Discard fragments &lt; k.</li>
<li>For sequences &gt; 256 nt: cut at 256, start new with k-1 overlap
(preserves all kmers, no duplicates).</li>
</ul>
<h2 id="super-kmers">Super-kmers</h2>
<p>The fundamental stored unit is a <strong>canonical
super-kmer</strong>: a maximal run of consecutive kmers sharing the same
<strong>non-canonical</strong> (raw forward) minimizer, canonized at the
end.</p>
<h3 id="construction">Construction</h3>
<p>Super-kmers are built using the <strong>non-canonical
minimizer</strong> (minimum forward m-mer, no revcomp comparison). Two
consecutive kmers belong to the same super-kmer if and only if their raw
forward minimizer is identical.</p>
<p>At the end of each super-kmer, the minimizer is canonized:
<code>canonical(M) = min(M, revcomp(M))</code>.</p>
<ul>
<li>If <code>canonical(M) == M</code>: the super-kmer is already in
canonical orientation — keep as-is.</li>
<li>If <code>canonical(M) == revcomp(M)</code>: reverse-complement the
entire super-kmer (and its minimizer).</li>
</ul>
<p>The reverse-complement step does not cause any additional splitting —
it only flips the orientation.</p>
<h3 id="why-non-canonical-minimizer">Why non-canonical minimizer?</h3>
<p>Using the raw forward minimizer (rather than the canonical minimizer)
means fewer candidates are considered at each position → the minimizer
changes more often → super-kmers are <strong>shorter or equal</strong>,
never longer, compared to the canonical-minimizer definition.</p>
<p>The key benefit is at dereplication: a forward read and its
reverse-complement covering the same genomic region both produce the
<strong>same canonical super-kmer</strong> after the canonization step.
Under the canonical-minimizer definition, they would share the same
canonical minimizer but have different sequences (one being the revcomp
of the other) — they would not dereplicate. Here they do, giving a
compression factor of ~2× on paired or double-stranded data.</p>
<h3 id="length">Length</h3>
<p>For a random minimizer of length m over k-mers of length k, the
density of minimizer positions is approximately 2/(km+2) <span
class="citation" data-cites="Zheng2020-ji Golan2025-xf">[@Zheng2020-ji;
@Golan2025-xf]</span>, giving an expected super-kmer length of (k+m)/2
nucleotides. For k=31, m=13: expected ≈ 22 nt. In practice super-kmers
rarely exceed a few dozen nucleotides — the 256 nt cap is almost never
reached. When it is, the super-kmer is split at 256 nt with a k1
overlap, preserving all kmers without duplication.</p>
<h2 id="binary-storage-format">Binary storage format</h2>
<p>Memory-mappable binary format for sequence collections: store
super-kmers end-to-end with u64-aligned headers (add padding if needed
for 8-byte alignment). Include index blocks every B super-kmers
(B=1000-10000) for fast jumping to position J (O(N/B + B) scan
vs. separate index O(1) direct access), where N is total super-kmers in
file. Represent header as union/struct for field access.</p>
<h2 id="partitioning-and-indexing-architecture">Partitioning and
indexing architecture</h2>
<p>The canonical minimizer of a super-kmer is hashed to produce a
<strong>p-bit routing value</strong> (p is a collection-level
parameter):</p>
<pre><code>canonical minimizer → hash(minimizer) → p-bit value → PART → partition directory</code></pre>
<p>PART is computed once at phase 1 to open the correct partition file,
then discarded. It is recomputed on the fly at query time. It is never
stored in the super-kmer header.</p>
<p>Each partition holds one BBHash instance (phase 6) that indexes kmers
as plain u64 values — the minimizer plays no role inside the
partition.</p>
<p>Properties: - <strong>2^p partitions</strong>: p freely tunable at
collection creation; not constrained to powers of 4 - <strong>Uniform
distribution</strong>: hashing the canonical minimizer corrects the
lexicographic bias inherent in minimizer selection <span
class="citation"
data-cites="Zheng2020-ji Zheng2021-cc Pan2024-hb Kille2023-px Golan2025-xf">[@Zheng2020-ji;
@Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]</span> - The
odd-length constraint on m only ensures canonical form correctness
during minimizer computation — it does not affect the partition
layout</p>
<p><strong>Entropy constraint:</strong> the canonical minimizer has 2m
bits of entropy. Requesting p bits from the hash requires 2m ≥ p,
i.e. <strong>p ≤ 2m</strong>:</p>
<table>
<thead>
<tr>
<th>m</th>
<th>p max</th>
<th>Partitions max</th>
</tr>
</thead>
<tbody>
<tr>
<td>9</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>11</td>
<td>6</td>
<td>64</td>
</tr>
<tr>
<td>13</td>
<td>10</td>
<td>1 024</td>
</tr>
<tr>
<td>15</td>
<td>14</td>
<td>16 384</td>
</tr>
</tbody>
</table>
<p>For the typical configuration k=31, m=13: up to 1 024 partitions.
Choosing p &gt; 2m16 is possible but produces artificially colliding
PART values with no gain in routing granularity.</p>
<h2 id="construction-pipeline">Construction pipeline</h2>
<p>All phases after scatter are embarrassingly parallel across
partitions.</p>
<h3 id="phase-1-scatter">Phase 1 — Scatter</h3>
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ
quality scores are ignored. For each read:</p>
<ol type="1">
<li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base;
discard fragments shorter than k.</li>
<li><strong>Entropy filter</strong>: scan each fragment with a sliding
window of size k. When the kmer <span class="math inline">$K_i = S[i
\mathinner{..} i+k-1]$</span> ended by nucleotide <span
class="math inline"><em>S</em>[<em>j</em>]</span> (with <span
class="math inline"><em>j</em>=<em>i</em>+<em>k</em> 1</span>) has
entropy below threshold <span class="math inline"><em>θ</em></span>,
emit the current segment and start a new one (see algorithm below).
<span class="math inline"><em>K</em><sub><em>i</em></sub></span> belongs
to neither segment, and no valid kmer is lost.</li>
<li><strong>Super-kmer extraction</strong>: for each clean segment,
slide a minimizer window and group consecutive kmers sharing the same
raw minimizer; canonise at the end.</li>
<li><strong>Partition routing</strong>:
<code>hash(canonical_minimizer) → PART</code> → append super-kmer to
<code>partition/superkmers.bin.gz</code>.</li>
</ol>
<p><strong>Entropy filter algorithm:</strong></p>
<p>When <span class="math inline"><em>K</em><sub><em>i</em></sub></span>
(ended by <span class="math inline"><em>S</em>[<em>j</em>]</span>, <span
class="math inline"><em>j</em>=<em>i</em>+<em>k</em> 1</span>)
fails the entropy threshold:</p>
<ul>
<li>Current segment <span class="math inline">$S[\textit{seg_start}
\mathinner{..} j-1]$</span> is emitted (last valid kmer = <span
class="math inline"><em>K</em><sub><em>i</em> 1</sub></span>)</li>
<li>New segment starts at <span
class="math inline"><em>S</em>[<em>i</em>+1]</span> (first new kmer =
<span
class="math inline"><em>K</em><sub><em>i</em>+1</sub></span>)</li>
<li><span class="math inline"><em>K</em><sub><em>i</em></sub></span> is
excluded: current segment lacks <span
class="math inline"><em>S</em>[<em>j</em>]</span>, new segment lacks
<span class="math inline"><em>S</em>[<em>i</em>]</span></li>
<li>Overlap = <span class="math inline">$S[i+1 \mathinner{..}
j-1]$</span> = <span class="math inline"><em>k</em> 2</span>
nucleotides</li>
</ul>
<pre class="pseudocode"><code>\begin{algorithm}
\caption{Entropy filter — sliding window segmentation}
\begin{algorithmic}
\PROCEDURE{EntropyFilter}{S, N, k, theta}
\STATE seg_start &lt;- 0
\STATE window &lt;- []
\FOR{j &lt;- 0 \TO N-1}
\STATE window.push(S[j])
\IF{|window| &lt; k}
\STATE continue
\ENDIF
\STATE i &lt;- j - k + 1
\IF{entropy(window) &lt;= theta}
\STATE emit S[seg_start .. j-1]
\STATE seg_start &lt;- i + 1
\STATE window &lt;- S[i+1 .. j]
\ELSE
\STATE window.pop_front()
\ENDIF
\ENDFOR
\STATE emit S[seg_start .. N-1]
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}</code></pre>
<p>Writes are sequential and append-only — IO-friendly. Gzip applied at
write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets
header overhead).</p>
<h3 id="phase-2-dereplication">Phase 2 — Dereplication</h3>
<p>Performed independently per partition. Identical super-kmers are
consolidated and their COUNT accumulated — analogous to amplicon
dereplication in metabarcoding. Uses external bucket sort to stay within
RAM bounds:</p>
<p><strong>Pass 1</strong> (streaming): hash the nucleotide payload of
each super-kmer, route to one of B bucket files:</p>
<pre><code>hash(sequence) % B → bucket_i.bin</code></pre>
<p>B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
<p><strong>Pass 2</strong>: for each bucket, load into an in-memory
<code>HashMap&lt;sequence, COUNT&gt;</code>, dereplicate by summing
COUNT values, write consolidated super-kmers.</p>
<p>After dereplication: at Nx coverage the partition shrinks by ~x
(errors aside). The COUNT field in each super-kmer header = number of
times that exact super-kmer sequence was observed across all input
reads.</p>
<p><strong>Important:</strong> super-kmer COUNT ≠ individual kmer count.
A kmer can appear in multiple distinct super-kmers (same partition,
different flanking context); its true count = sum of COUNT of all
super-kmers containing it. A super-kmer with COUNT=1 may contain only
high-abundance kmers, each appearing in many other super-kmers.
Abundance filtering therefore cannot be applied at this phase.</p>
<h3 id="phase-3-per-kmer-count-aggregation-and-quorum-filtering">Phase 3
— Per-kmer count aggregation and quorum filtering</h3>
<p>For each dereplicated super-kmer, enumerate its kmers and accumulate
counts:</p>
<pre><code>for each super-kmer (sequence, COUNT):
for each kmer in sequence:
kmer_counts[canonical(kmer)] += COUNT</code></pre>
<p>Implemented as an external sort or a temporary HashMap, depending on
partition size. At the end of this phase, each distinct canonical kmer
has its exact total count.</p>
<p>Abundance filter applied here: kmers with
<code>total_count &lt; q</code> are discarded. <code>q</code> is a
collection parameter (0 = keep all, including singletons for ≤1x
data).</p>
<p>No pre-filter on super-kmer COUNT is possible at phase 2: a
super-kmer with COUNT=1 may contain only high-abundance kmers, each
present in many other super-kmers across the partition.</p>
<h3 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer
compaction</h3>
<p>The valid kmer set from phase 3 is used as a mask to rewrite the
super-kmer files:</p>
<pre><code>for each dereplicated super-kmer:
scan kmer by kmer
kmer not in valid set → break point (terminates current super-kmer)
kmer in valid set → extend current super-kmer</code></pre>
<p>Three cases per super-kmer: - <strong>All kmers valid</strong>
copied as-is - <strong>No kmer valid</strong> → discarded -
<strong>Mixed</strong> → split into sub-super-kmers at invalid
boundaries; each sub-super-kmer inherits the original COUNT</p>
<p>After splitting, re-apply dereplication (bucket sort, phase 2 method)
— splitting can produce new identical super-kmers. This re-dereplication
is cheap: the volume is already greatly reduced.</p>
<p>Output: a clean super-kmer file where every kmer passes quorum. This
file feeds phase 5.</p>
<h3 id="phase-5-local-de-bruijn-graph-and-unitig-construction">Phase 5 —
Local de Bruijn graph and unitig construction</h3>
<p>Within each partition, build a <strong>local de Bruijn graph</strong>
from the valid kmer set and compute its unitigs. All operations are
local to the partition — no cross-partition communication.</p>
<pre><code>valid kmers → HashSet&lt;u64&gt;
for each kmer K:
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
internal node ↔ in_degree=1 AND out_degree=1
branching / dead-end → unitig start or end</code></pre>
<p>Traverse non-branching paths to assemble unitigs. Kmers whose
neighbours fall in other partitions appear as dead ends locally — they
terminate the unitig. The result: <strong>each kmer appears in exactly
one unitig</strong> within the partition.</p>
<p>The partition size (controlled by p) must be calibrated so that the
HashSet fits in RAM during this phase.</p>
<p>Output: <code>unitigs.bin</code> — the permanent evidence structure
for the partition. Each kmer in the partition appears at exactly one
(unitig_id, offset) location.</p>
<p><strong>Scope of local unitigs:</strong> these are unitigs of the
partitions local de Bruijn graph, not global unitigs. A kmer whose k-1
successor or predecessor falls in another partition appears as a dead
end locally and terminates the unitig. This does not affect correctness
of verification but means partition-local unitigs cannot be directly
reused for global assembly.</p>
<h3 id="phase-6-bbhash-construction-and-index-finalisation">Phase 6 —
BBHash construction and index finalisation</h3>
<p>Built once on the definitive kmer set (all kmers in all unitigs of
the partition):</p>
<pre><code>kmers from unitigs → BBHash → bbhash.bin
→ counts.bin : packed n-bit array (or 1-bit for presence mode)
→ refs.bin : (unitig_id: u32, kmer_offset: u8) per kmer</code></pre>
<p>BBHash is built once — no rebuild. The n-bit width for
<code>counts.bin</code> is chosen from the observed count distribution
(n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts
exceeding 2ⁿ−1 go into <code>overflow.bin</code> as sorted
<code>(bbhash_index: u32, count: u32)</code> pairs.</p>
<p><strong>Exact verification via unitig evidence:</strong></p>
<p><code>unitigs.bin</code> serves as the evidence structure: for any
query kmer, the stored unitig provides the ground truth to confirm or
deny its presence. BBHash maps every input to [0, N) including absent
kmers — the unitig read-back is the only way to guarantee exactness.</p>
<pre><code>query kmer q
→ minimizer(q) → hash → PART, KEY → partition
→ BBHash(q) → index i
→ refs[i] = (unitig_id, offset)
→ read unitig unitig_id from unitigs.bin
→ extract kmer at offset → compare with q
→ match : return counts[i] ← exact hit
→ no match: kmer absent ← BBHash collision on absent kmer</code></pre>
<p>One random disk access into <code>unitigs.bin</code> per query; the
unitig is the minimal, non-redundant evidence (each kmer stored
once).</p>
<h2 id="on-disk-collection-structure">On-disk collection structure</h2>
<p>Collections are too large to hold in RAM (hundreds of genomes,
billions of kmers). The collection lives on disk as a directory of
memory-mapped files:</p>
<pre><code>collection/
metadata.toml — collection parameters (see below)
superkmers.bin — super-kmer sequences, end-to-end, memory-mappable
superkmers.idx — index blocks every B super-kmers for fast seeking
part_XXXX/
bbhash.bin — BBHash function for this partition
counts.bin — packed n-bit count array (or 1-bit presence array)
refs.bin — back-references for exact verification</code></pre>
<p><strong>Collection parameters</strong> (stored in
<code>metadata.toml</code>):</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th>Role</th>
</tr>
</thead>
<tbody>
<tr>
<td>k</td>
<td>kmer length</td>
</tr>
<tr>
<td>m</td>
<td>minimizer length (odd, &lt; k)</td>
</tr>
<tr>
<td>p</td>
<td>partition bits (0 ≤ p ≤ min(14, 2m16))</td>
</tr>
<tr>
<td>mode</td>
<td><code>presence</code> (1 bit/kmer) or <code>count</code> (n
bits/kmer)</td>
</tr>
<tr>
<td>n</td>
<td>bits per kmer in count mode (chosen at construction)</td>
</tr>
<tr>
<td>min_count</td>
<td>singleton filtering threshold (0 = keep all)</td>
</tr>
<tr>
<td>hash_fn</td>
<td>hash function identifier</td>
</tr>
<tr>
<td>hash_seed</td>
<td>seed for the hash function</td>
</tr>
</tbody>
</table>
<p><strong>Exact verification via back-reference:</strong></p>
<p>BBHash maps any input to [0, N) — including absent kmers. To avoid
false positives, <code>refs.bin</code> stores at each BBHash index a
back-reference <code>(superkmer_id: u32, kmer_offset: u8)</code> = 5
bytes. Query protocol:</p>
<pre><code>query kmer q
→ minimizer(q) → hash → PART, KEY → partition
→ BBHash(q) → index i
→ refs[i] = (sk_id, offset)
→ read super-kmer sk_id from superkmers.bin
→ extract kmer at offset → compare with q
→ if match: return counts[i] (exact)
→ if no match: kmer absent</code></pre>
<p><strong>Count storage — two modes:</strong></p>
<p><em>Presence mode</em> (coverage ≤ 1x, or when only presence/absence
matters): - <code>counts.bin</code> is a packed 1-bit array — all bits
set to 1 for indexed kmers - Singletons are the signal, not filtered</p>
<p><em>Count mode</em> (coverage &gt; 1x): - <code>counts.bin</code> is
a packed n-bit array; n chosen at construction from the observed
distribution - Value 0: absent sentinel; values 1..2ⁿ−2: direct count;
value 2ⁿ−1: overflow - Overflow counts stored in a separate
<code>overflow.bin</code> as sorted
<code>(index: u32, count: u32)</code> pairs - Empirically (k=31, 15x
coverage): n=5 covers 97% of real kmers, n=6 covers 99% - min_count
threshold filters low-frequency kmers (errors) before indexing; for ≤1x,
min_count=0</p>
<p><strong>BBHash:</strong> operates on kmer u64 values directly within
each partition. The minimizer plays no role inside the partition —
BBHash receives the kmer and builds the perfect hash naturally.</p>
<h2 id="kmer-entropy-filter">Kmer entropy filter</h2>
<p>Low-complexity kmers (polyA, polyT, tandem repeats) are detected and
excluded during phase 1. The filter computes a <strong>normalized
Shannon entropy</strong> over sub-words of multiple sizes, corrected for
two sources of bias: the small number of observations within a single
kmer, and the unequal sizes of circular equivalence classes.</p>
<h3 id="sub-word-frequencies">Sub-word frequencies</h3>
<p>For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max,
typically ws_max = 6), extract the <span
class="math inline"><em>n</em><sub>words</sub>=<em>k</em><em>w</em><em>s</em>+1</span>
overlapping sub-words by sliding a window of length ws:</p>
<p><span class="math display">$$w_i = \text{kmer}[i \mathinner{..}
i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$</span></p>
<p>Each sub-word is mapped to its <strong>circular canonical
form</strong>: the lexicographic minimum among all cyclic rotations of
the word. Let <span
class="math inline"><em>s</em><sub><em>j</em></sub></span> be the size
of equivalence class <span class="math inline"><em>j</em></span> (number
of distinct raw words mapping to canonical form <span
class="math inline"><em>j</em></span>), and <span
class="math inline"><em>f</em><sub><em>j</em></sub></span> the count of
canonical form <span class="math inline"><em>j</em></span> among the
<span class="math inline"><em>n</em><sub>words</sub></span> sub-words
(<span
class="math inline"><sub><em>j</em></sub><em>f</em><sub><em>j</em></sub>=<em>n</em><sub>words</sub></span>).</p>
<h3 id="corrected-shannon-entropy">Corrected Shannon entropy</h3>
<p>The circular equivalence classes have unequal sizes: under a uniform
distribution over all <span
class="math inline">4<sup><em>w</em><em>s</em></sup></span> raw words,
class <span class="math inline"><em>j</em></span> is visited with
probability <span
class="math inline"><em>s</em><sub><em>j</em></sub>/4<sup><em>w</em><em>s</em></sup></span>,
not <span class="math inline">1/<em>n</em><sub><em>a</em></sub></span>.
Computing entropy directly over canonical classes therefore
underestimates the entropy of a random sequence.</p>
<p>The correction “unfolds” each canonical class back to its member raw
words, redistributing each observation of class <span
class="math inline"><em>j</em></span> equally among its <span
class="math inline"><em>s</em><sub><em>j</em></sub></span> members:</p>
<p><span class="math display">$$H_{\text{corr}} = \log(n_{\text{words}})
- \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j +
\frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$</span></p>
<p>The last term is the correction for unequal class sizes. For a
uniformly random sequence (<span
class="math inline"><em>f</em><sub><em>j</em></sub> ≈ <em>n</em><sub>words</sub> ⋅ <em>s</em><sub><em>j</em></sub>/4<sup><em>w</em><em>s</em></sup></span>),
this gives <span
class="math inline"><em>H</em><sub>corr</sub> ≈ log(4<sup><em>w</em><em>s</em></sup>)=2 ⋅ <em>w</em><em>s</em> ⋅ log2</span>,
the maximum entropy over raw words.</p>
<h3 id="maximum-entropy-correction-for-small-samples">Maximum entropy
correction for small samples</h3>
<p>With only <span class="math inline"><em>n</em><sub>words</sub></span>
observations over <span
class="math inline">4<sup><em>w</em><em>s</em></sup></span> possible raw
words, the achievable maximum entropy is bounded by the most uniform
integer distribution over <span
class="math inline">4<sup><em>w</em><em>s</em></sup></span>
categories.</p>
<p>Let <span
class="math inline"><em>c</em>= ⌊<em>n</em><sub>words</sub>/4<sup><em>w</em><em>s</em></sup></span>
and <span
class="math inline"><em>r</em>=<em>n</em><sub>words</sub> mod 4<sup><em>w</em><em>s</em></sup></span>.
The most uniform integer distribution assigns frequency <span
class="math inline"><em>c</em>+1</span> to <span
class="math inline"><em>r</em></span> categories and <span
class="math inline"><em>c</em></span> to the remaining <span
class="math inline">4<sup><em>w</em><em>s</em></sup><em>r</em></span>,
with the convention <span class="math inline">0log0=0</span>:</p>
<p><span class="math display">$$H_{\max} = -\left[(4^{ws} -
r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} +
r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$</span></p>
<p>When <span
class="math inline"><em>n</em><sub>words</sub>&lt;4<sup><em>w</em><em>s</em></sup></span>:
<span class="math inline"><em>c</em>=0</span>, <span
class="math inline"><em>r</em>=<em>n</em><sub>words</sub></span>, and
the formula reduces to <span
class="math inline"><em>H</em><sub>max</sub>=log(<em>n</em><sub>words</sub>)</span>
— a single unified expression covers both regimes. A truly random
sequence achieves <span
class="math inline"><em>H</em><sub>corr</sub> ≈ <em>H</em><sub>max</sub></span>.</p>
<h3 id="normalized-entropy">Normalized entropy</h3>
<p><span class="math display">$$\hat{H}(ws) =
\frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$</span></p>
<h3 id="final-score">Final score</h3>
<p>The filter computes <span
class="math inline"><em></em>(<em>w</em><em>s</em>)</span> for each
word size ws from 1 to ws_max and returns the
<strong>minimum</strong>:</p>
<p><span class="math display">$$\text{entropy}(kmer) =
\min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$</span></p>
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1
indicates high complexity. A kmer is rejected if <span
class="math inline">entropy(<em>k</em><em>m</em><em>e</em><em>r</em>) ≤ <em>θ</em></span>,
where <span class="math inline"><em>θ</em></span> is a collection
parameter. The minimum across word sizes ensures that any scale of
repetition is detected independently: polyA is caught at ws=1,
dinucleotide repeats at ws=2, etc.</p>
<h2 id="priority-operations">Priority operations</h2>
<ul>
<li>Kmer counting (frequencies)</li>
<li>Fast search / query</li>
<li>Set operations: union, intersection, difference</li>
</ul>
<h2 id="data-types">Data types</h2>
<p>Two types of DNA are represented:</p>
<ul>
<li><strong>Super-kmers</strong>: stored as a nucleotide sequence (2
bits/base, byte-aligned, no padding). Max length: 256 nucleotides (512
bits, 64 bytes). Each super-kmer has a <strong>32-bit
header</strong>:</li>
</ul>
<table>
<thead>
<tr>
<th>Field</th>
<th>Bits</th>
<th>Role</th>
</tr>
</thead>
<tbody>
<tr>
<td>COUNT</td>
<td>24</td>
<td>Occurrence count (≤ 16 M)</td>
</tr>
<tr>
<td>SEQL</td>
<td>8</td>
<td>Sequence length in nucleotides (≤ 256)</td>
</tr>
</tbody>
</table>
<p>Bit layout (MSB to LSB):</p>
<pre><code>[31:8] COUNT [7:0] SEQL</code></pre>
<p>Getters:</p>
<div class="sourceCode" id="cb12"><pre
class="sourceCode rust"><code class="sourceCode rust"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> seql(<span class="op">&amp;</span><span class="kw">self</span>) <span class="op">-&gt;</span> <span class="dt">u8</span> <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="kw">as</span> <span class="dt">u8</span> <span class="op">}</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> count(<span class="op">&amp;</span><span class="kw">self</span>) <span class="op">-&gt;</span> <span class="dt">u32</span> <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="op">&gt;&gt;</span> <span class="dv">8</span> <span class="op">}</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> increment(<span class="op">&amp;</span><span class="kw">mut</span> <span class="kw">self</span>) <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="op">+=</span> <span class="dv">1</span> <span class="op">&lt;&lt;</span> <span class="dv">8</span><span class="op">;</span> <span class="op">}</span></span></code></pre></div>
<p>PART and KEY were considered as routing fields derived from
<code>hash(minimizer)</code> but are single-use at phase 1 only —
neither is stored. The nucleotide sequence is always byte-aligned (no
offset): after reverse-complementing a super-kmer, a single bit-shift
realigns the sequence so that nucleotide 0 always starts at bit 0 of
word[0]. This shift is paid once at canonisation (phase 1, at most half
of super-kmers) and eliminates any need for an alignment offset field.
The direct consequence for dereplication: the byte array can be hashed
directly without any offset adjustment.</p>
<p>Note: minimizer selection (window minimum) biases the raw minimizer
distribution toward lexicographically small values <span
class="citation"
data-cites="Zheng2020-ji Zheng2021-cc Pan2024-hb Kille2023-px Golan2025-xf">[@Zheng2020-ji;
@Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]</span>;
hashing corrects this before partition routing.</p>
<ul>
<li><strong>Kmers</strong>: short DNA subsequences (k bases),
represented in <code>u64</code>.</li>
</ul>
<h2 id="kmer-representation">Kmer representation</h2>
<p>Each base is encoded on 2 bits, so any kmer with k ≤ 31 fits in a
<code>u64</code>.</p>
<h3 id="bit-layout">Bit layout</h3>
<p>Nucleotide <code>i</code> occupies bits <code>64-i</code> and
<code>63-i</code>. For a kmer of length k, 2k bits are used; the
low-order 64 - 2k bits of the u64 are kept at 0 and have no meaning.</p>
<ul>
<li>Extraction of nuc[i]:
<code>(word &gt;&gt; (63 - 2*i)) &amp; 0b11</code>.</li>
</ul>
<p>For kmers (k ≤ 31), the whole sequence fits in a single
<code>u64</code>. For longer sequences up to 256 nucleotides (512 bits),
the representation extends across multiple words.</p>
<h3 id="complement-and-reverse-complement">Complement and reverse
complement</h3>
<p>The encoding is chosen so that the Watson-Crick complement of any
base is its bitwise NOT (on 2 bits):</p>
<table>
<thead>
<tr>
<th>Base</th>
<th>Encoding</th>
<th>Complement</th>
<th>NOT</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td><code>00</code></td>
<td>T</td>
<td><code>11</code></td>
</tr>
<tr>
<td>C</td>
<td><code>01</code></td>
<td>G</td>
<td><code>10</code></td>
</tr>
<tr>
<td>G</td>
<td><code>10</code></td>
<td>C</td>
<td><code>01</code></td>
</tr>
<tr>
<td>T</td>
<td><code>11</code></td>
<td>A</td>
<td><code>00</code></td>
</tr>
</tbody>
</table>
<p>This means <code>complement(base) = ~base &amp; 0b11</code>, and the
reverse complement of a kmer reduces to two cheap operations:</p>
<pre><code>revcomp(kmer) = reverse_bases(~kmer &amp; mask_2k)</code></pre>
<p>where <code>mask_2k = (1 &lt;&lt; 2k) - 1</code> zeros the unused
high bits, and <code>reverse_bases</code> reorders the 2-bit chunks from
LSB to MSB. No lookup table needed.</p>
<p>DNA being double-stranded, each kmer and its reverse complement are
equivalent. The canonical form is defined as:</p>
<pre><code>canonical(kmer) = min(kmer, revcomp(kmer))</code></pre>
<p>This halves the kmer space and ensures strand-independent
counting.</p>
<h2 id="sequence-operations">Sequence operations</h2>
<h3 id="reverse-complement">Reverse complement</h3>
<p>Reverse complement of a super-kmer sequence (used at most once per
super-kmer, during phase 1 canonisation):</p>
<ol type="1">
<li>Reverse byte order</li>
<li>Apply 256-entry lookup table (precomputed reverse-complemented
bytes) to each byte</li>
<li>Bit-shift the result to realign nucleotide 0 to bit 0 of
word[0]</li>
</ol>
<p>The shift amount depends only on the sequence length:
<code>shift = (len % 4) * 2</code> bits. After the shift, the sequence
is byte-aligned with no offset — the same invariant as the original.</p>
<div class="sourceCode" id="cb15"><pre
class="sourceCode rust"><code class="sourceCode rust"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> reverse_complement(words<span class="op">:</span> <span class="op">&amp;</span>[<span class="dt">u64</span>]<span class="op">,</span> len<span class="op">:</span> <span class="dt">usize</span>) <span class="op">-&gt;</span> <span class="dt">Vec</span><span class="op">&lt;</span><span class="dt">u64</span><span class="op">&gt;</span> <span class="op">{</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> bytes <span class="op">=</span> words_to_bytes(words)<span class="op">;</span></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> <span class="kw">mut</span> rev<span class="op">:</span> <span class="dt">Vec</span><span class="op">&lt;</span><span class="dt">u8</span><span class="op">&gt;</span> <span class="op">=</span> bytes<span class="op">.</span>iter()<span class="op">.</span>rev()<span class="op">.</span>map(<span class="op">|&amp;</span>b<span class="op">|</span> LOOKUP_TABLE[b <span class="kw">as</span> <span class="dt">usize</span>])<span class="op">.</span>collect()<span class="op">;</span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> shift <span class="op">=</span> (len <span class="op">%</span> <span class="dv">4</span>) <span class="op">*</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> shift <span class="op">&gt;</span> <span class="dv">0</span> <span class="op">{</span> bit_shift_left(<span class="op">&amp;</span><span class="kw">mut</span> rev<span class="op">,</span> shift)<span class="op">;</span> <span class="op">}</span></span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a> bytes_to_words(rev)</span>
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h2 id="multithreading">Multithreading</h2>
<p>For HPC with 100+ cores: use Rayon for data parallelism (parallel
iterators), std::thread for custom threading, or async with Tokio for
I/O-bound tasks. Equivalent to Gos goroutines but with
ownership/borrowing safety.</p>
<p>Rayon: Rust crate for automatic data parallelism. Provides parallel
iterators (par_iter) that distribute work across CPU cores, e.g.,
vec.par_iter().map(|x| x*2).collect(). Handles thread
spawning/synchronization transparently. Widely used and maintained by
the Rust project. Orthogonal to Tokio (async runtime): Rayon for CPU
parallelism, Tokio for async I/O.</p>
<p>Data pipelines: use Crossbeam channels (MPMC) with threads/async,
equivalent to Gos channels + goroutines.</p>
<p>Channels: std::sync::mpsc (built-in, MPSC, simple); crossbeam (crate,
MPMC, faster, more features like select). Use Crossbeam for
multi-producer/multi-consumer scenarios.</p>
<h1 id="bibliographie">Bibliographie</h1>
<p>\bibliography</p>