661 lines
33 KiB
HTML
661 lines
33 KiB
HTML
|
|
<h1 id="obikmer">obikmer</h1>
|
|||
|
|
<p><code>obikmer</code> is a Rust tool for manipulation, counting,
|
|||
|
|
indexing, and set operations on DNA sequences represented as kmer
|
|||
|
|
sets.</p>
|
|||
|
|
<h2 id="constraints">Constraints</h2>
|
|||
|
|
<ul>
|
|||
|
|
<li>Target scale: metagenomic data, tens of Gbases, billions of
|
|||
|
|
kmers</li>
|
|||
|
|
<li>Maximum efficiency in computation, memory, and disk usage</li>
|
|||
|
|
<li>k is odd, k ∈ [11, 31], fixed at runtime</li>
|
|||
|
|
<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li>
|
|||
|
|
</ul>
|
|||
|
|
<h2 id="input-processing">Input processing</h2>
|
|||
|
|
<p>For sequence files (FASTA/FASTQ):</p>
|
|||
|
|
<ul>
|
|||
|
|
<li>Cut sequence on non-ACGT nucleotides.</li>
|
|||
|
|
<li>Discard fragments < k.</li>
|
|||
|
|
<li>For sequences > 256 nt: cut at 256, start new with k-1 overlap
|
|||
|
|
(preserves all kmers, no duplicates).</li>
|
|||
|
|
</ul>
|
|||
|
|
<h2 id="super-kmers">Super-kmers</h2>
|
|||
|
|
<p>The fundamental stored unit is a <strong>canonical
|
|||
|
|
super-kmer</strong>: a maximal run of consecutive kmers sharing the same
|
|||
|
|
<strong>non-canonical</strong> (raw forward) minimizer, canonized at the
|
|||
|
|
end.</p>
|
|||
|
|
<h3 id="construction">Construction</h3>
|
|||
|
|
<p>Super-kmers are built using the <strong>non-canonical
|
|||
|
|
minimizer</strong> (minimum forward m-mer, no revcomp comparison). Two
|
|||
|
|
consecutive kmers belong to the same super-kmer if and only if their raw
|
|||
|
|
forward minimizer is identical.</p>
|
|||
|
|
<p>At the end of each super-kmer, the minimizer is canonized:
|
|||
|
|
<code>canonical(M) = min(M, revcomp(M))</code>.</p>
|
|||
|
|
<ul>
|
|||
|
|
<li>If <code>canonical(M) == M</code>: the super-kmer is already in
|
|||
|
|
canonical orientation — keep as-is.</li>
|
|||
|
|
<li>If <code>canonical(M) == revcomp(M)</code>: reverse-complement the
|
|||
|
|
entire super-kmer (and its minimizer).</li>
|
|||
|
|
</ul>
|
|||
|
|
<p>The reverse-complement step does not cause any additional splitting —
|
|||
|
|
it only flips the orientation.</p>
|
|||
|
|
<h3 id="why-non-canonical-minimizer">Why non-canonical minimizer?</h3>
|
|||
|
|
<p>Using the raw forward minimizer (rather than the canonical minimizer)
|
|||
|
|
means fewer candidates are considered at each position → the minimizer
|
|||
|
|
changes more often → super-kmers are <strong>shorter or equal</strong>,
|
|||
|
|
never longer, compared to the canonical-minimizer definition.</p>
|
|||
|
|
<p>The key benefit is at dereplication: a forward read and its
|
|||
|
|
reverse-complement covering the same genomic region both produce the
|
|||
|
|
<strong>same canonical super-kmer</strong> after the canonization step.
|
|||
|
|
Under the canonical-minimizer definition, they would share the same
|
|||
|
|
canonical minimizer but have different sequences (one being the revcomp
|
|||
|
|
of the other) — they would not dereplicate. Here they do, giving a
|
|||
|
|
compression factor of ~2× on paired or double-stranded data.</p>
|
|||
|
|
<h3 id="length">Length</h3>
|
|||
|
|
<p>For a random minimizer of length m over k-mers of length k, the
|
|||
|
|
density of minimizer positions is approximately 2/(k−m+2) <span
|
|||
|
|
class="citation" data-cites="Zheng2020-ji Golan2025-xf">[@Zheng2020-ji;
|
|||
|
|
@Golan2025-xf]</span>, giving an expected super-kmer length of (k+m)/2
|
|||
|
|
nucleotides. For k=31, m=13: expected ≈ 22 nt. In practice super-kmers
|
|||
|
|
rarely exceed a few dozen nucleotides — the 256 nt cap is almost never
|
|||
|
|
reached. When it is, the super-kmer is split at 256 nt with a k−1
|
|||
|
|
overlap, preserving all kmers without duplication.</p>
|
|||
|
|
<h2 id="binary-storage-format">Binary storage format</h2>
|
|||
|
|
<p>Memory-mappable binary format for sequence collections: store
|
|||
|
|
super-kmers end-to-end with u64-aligned headers (add padding if needed
|
|||
|
|
for 8-byte alignment). Include index blocks every B super-kmers
|
|||
|
|
(B=1000-10000) for fast jumping to position J (O(N/B + B) scan
|
|||
|
|
vs. separate index O(1) direct access), where N is total super-kmers in
|
|||
|
|
file. Represent header as union/struct for field access.</p>
|
|||
|
|
<h2 id="partitioning-and-indexing-architecture">Partitioning and
|
|||
|
|
indexing architecture</h2>
|
|||
|
|
<p>The canonical minimizer of a super-kmer is hashed to produce a
|
|||
|
|
<strong>p-bit routing value</strong> (p is a collection-level
|
|||
|
|
parameter):</p>
|
|||
|
|
<pre><code>canonical minimizer → hash(minimizer) → p-bit value → PART → partition directory</code></pre>
|
|||
|
|
<p>PART is computed once at phase 1 to open the correct partition file,
|
|||
|
|
then discarded. It is recomputed on the fly at query time. It is never
|
|||
|
|
stored in the super-kmer header.</p>
|
|||
|
|
<p>Each partition holds one BBHash instance (phase 6) that indexes kmers
|
|||
|
|
as plain u64 values — the minimizer plays no role inside the
|
|||
|
|
partition.</p>
|
|||
|
|
<p>Properties: - <strong>2^p partitions</strong>: p freely tunable at
|
|||
|
|
collection creation; not constrained to powers of 4 - <strong>Uniform
|
|||
|
|
distribution</strong>: hashing the canonical minimizer corrects the
|
|||
|
|
lexicographic bias inherent in minimizer selection <span
|
|||
|
|
class="citation"
|
|||
|
|
data-cites="Zheng2020-ji Zheng2021-cc Pan2024-hb Kille2023-px Golan2025-xf">[@Zheng2020-ji;
|
|||
|
|
@Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]</span> - The
|
|||
|
|
odd-length constraint on m only ensures canonical form correctness
|
|||
|
|
during minimizer computation — it does not affect the partition
|
|||
|
|
layout</p>
|
|||
|
|
<p><strong>Entropy constraint:</strong> the canonical minimizer has 2m
|
|||
|
|
bits of entropy. Requesting p bits from the hash requires 2m ≥ p,
|
|||
|
|
i.e. <strong>p ≤ 2m</strong>:</p>
|
|||
|
|
<table>
|
|||
|
|
<thead>
|
|||
|
|
<tr>
|
|||
|
|
<th>m</th>
|
|||
|
|
<th>p max</th>
|
|||
|
|
<th>Partitions max</th>
|
|||
|
|
</tr>
|
|||
|
|
</thead>
|
|||
|
|
<tbody>
|
|||
|
|
<tr>
|
|||
|
|
<td>9</td>
|
|||
|
|
<td>2</td>
|
|||
|
|
<td>4</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>11</td>
|
|||
|
|
<td>6</td>
|
|||
|
|
<td>64</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>13</td>
|
|||
|
|
<td>10</td>
|
|||
|
|
<td>1 024</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>15</td>
|
|||
|
|
<td>14</td>
|
|||
|
|
<td>16 384</td>
|
|||
|
|
</tr>
|
|||
|
|
</tbody>
|
|||
|
|
</table>
|
|||
|
|
<p>For the typical configuration k=31, m=13: up to 1 024 partitions.
|
|||
|
|
Choosing p > 2m−16 is possible but produces artificially colliding
|
|||
|
|
PART values with no gain in routing granularity.</p>
|
|||
|
|
<h2 id="construction-pipeline">Construction pipeline</h2>
|
|||
|
|
<p>All phases after scatter are embarrassingly parallel across
|
|||
|
|
partitions.</p>
|
|||
|
|
<h3 id="phase-1-scatter">Phase 1 — Scatter</h3>
|
|||
|
|
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ
|
|||
|
|
quality scores are ignored. For each read:</p>
|
|||
|
|
<ol type="1">
|
|||
|
|
<li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base;
|
|||
|
|
discard fragments shorter than k.</li>
|
|||
|
|
<li><strong>Entropy filter</strong>: scan each fragment with a sliding
|
|||
|
|
window of size k. When the kmer <span class="math inline">$K_i = S[i
|
|||
|
|
\mathinner{..} i+k-1]$</span> ended by nucleotide <span
|
|||
|
|
class="math inline"><em>S</em>[<em>j</em>]</span> (with <span
|
|||
|
|
class="math inline"><em>j</em> = <em>i</em> + <em>k</em> − 1</span>) has
|
|||
|
|
entropy below threshold <span class="math inline"><em>θ</em></span>,
|
|||
|
|
emit the current segment and start a new one (see algorithm below).
|
|||
|
|
<span class="math inline"><em>K</em><sub><em>i</em></sub></span> belongs
|
|||
|
|
to neither segment, and no valid kmer is lost.</li>
|
|||
|
|
<li><strong>Super-kmer extraction</strong>: for each clean segment,
|
|||
|
|
slide a minimizer window and group consecutive kmers sharing the same
|
|||
|
|
raw minimizer; canonise at the end.</li>
|
|||
|
|
<li><strong>Partition routing</strong>:
|
|||
|
|
<code>hash(canonical_minimizer) → PART</code> → append super-kmer to
|
|||
|
|
<code>partition/superkmers.bin.gz</code>.</li>
|
|||
|
|
</ol>
|
|||
|
|
<p><strong>Entropy filter algorithm:</strong></p>
|
|||
|
|
<p>When <span class="math inline"><em>K</em><sub><em>i</em></sub></span>
|
|||
|
|
(ended by <span class="math inline"><em>S</em>[<em>j</em>]</span>, <span
|
|||
|
|
class="math inline"><em>j</em> = <em>i</em> + <em>k</em> − 1</span>)
|
|||
|
|
fails the entropy threshold:</p>
|
|||
|
|
<ul>
|
|||
|
|
<li>Current segment <span class="math inline">$S[\textit{seg_start}
|
|||
|
|
\mathinner{..} j-1]$</span> is emitted (last valid kmer = <span
|
|||
|
|
class="math inline"><em>K</em><sub><em>i</em> − 1</sub></span>)</li>
|
|||
|
|
<li>New segment starts at <span
|
|||
|
|
class="math inline"><em>S</em>[<em>i</em> + 1]</span> (first new kmer =
|
|||
|
|
<span
|
|||
|
|
class="math inline"><em>K</em><sub><em>i</em> + 1</sub></span>)</li>
|
|||
|
|
<li><span class="math inline"><em>K</em><sub><em>i</em></sub></span> is
|
|||
|
|
excluded: current segment lacks <span
|
|||
|
|
class="math inline"><em>S</em>[<em>j</em>]</span>, new segment lacks
|
|||
|
|
<span class="math inline"><em>S</em>[<em>i</em>]</span></li>
|
|||
|
|
<li>Overlap = <span class="math inline">$S[i+1 \mathinner{..}
|
|||
|
|
j-1]$</span> = <span class="math inline"><em>k</em> − 2</span>
|
|||
|
|
nucleotides</li>
|
|||
|
|
</ul>
|
|||
|
|
<pre class="pseudocode"><code>\begin{algorithm}
|
|||
|
|
\caption{Entropy filter — sliding window segmentation}
|
|||
|
|
\begin{algorithmic}
|
|||
|
|
\PROCEDURE{EntropyFilter}{S, N, k, theta}
|
|||
|
|
\STATE seg_start <- 0
|
|||
|
|
\STATE window <- []
|
|||
|
|
\FOR{j <- 0 \TO N-1}
|
|||
|
|
\STATE window.push(S[j])
|
|||
|
|
\IF{|window| < k}
|
|||
|
|
\STATE continue
|
|||
|
|
\ENDIF
|
|||
|
|
\STATE i <- j - k + 1
|
|||
|
|
\IF{entropy(window) <= theta}
|
|||
|
|
\STATE emit S[seg_start .. j-1]
|
|||
|
|
\STATE seg_start <- i + 1
|
|||
|
|
\STATE window <- S[i+1 .. j]
|
|||
|
|
\ELSE
|
|||
|
|
\STATE window.pop_front()
|
|||
|
|
\ENDIF
|
|||
|
|
\ENDFOR
|
|||
|
|
\STATE emit S[seg_start .. N-1]
|
|||
|
|
\ENDPROCEDURE
|
|||
|
|
\end{algorithmic}
|
|||
|
|
\end{algorithm}</code></pre>
|
|||
|
|
<p>Writes are sequential and append-only — IO-friendly. Gzip applied at
|
|||
|
|
write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets
|
|||
|
|
header overhead).</p>
|
|||
|
|
<h3 id="phase-2-dereplication">Phase 2 — Dereplication</h3>
|
|||
|
|
<p>Performed independently per partition. Identical super-kmers are
|
|||
|
|
consolidated and their COUNT accumulated — analogous to amplicon
|
|||
|
|
dereplication in metabarcoding. Uses external bucket sort to stay within
|
|||
|
|
RAM bounds:</p>
|
|||
|
|
<p><strong>Pass 1</strong> (streaming): hash the nucleotide payload of
|
|||
|
|
each super-kmer, route to one of B bucket files:</p>
|
|||
|
|
<pre><code>hash(sequence) % B → bucket_i.bin</code></pre>
|
|||
|
|
<p>B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
|
|||
|
|
<p><strong>Pass 2</strong>: for each bucket, load into an in-memory
|
|||
|
|
<code>HashMap<sequence, COUNT></code>, dereplicate by summing
|
|||
|
|
COUNT values, write consolidated super-kmers.</p>
|
|||
|
|
<p>After dereplication: at Nx coverage the partition shrinks by ~x
|
|||
|
|
(errors aside). The COUNT field in each super-kmer header = number of
|
|||
|
|
times that exact super-kmer sequence was observed across all input
|
|||
|
|
reads.</p>
|
|||
|
|
<p><strong>Important:</strong> super-kmer COUNT ≠ individual kmer count.
|
|||
|
|
A kmer can appear in multiple distinct super-kmers (same partition,
|
|||
|
|
different flanking context); its true count = sum of COUNT of all
|
|||
|
|
super-kmers containing it. A super-kmer with COUNT=1 may contain only
|
|||
|
|
high-abundance kmers, each appearing in many other super-kmers.
|
|||
|
|
Abundance filtering therefore cannot be applied at this phase.</p>
|
|||
|
|
<h3 id="phase-3-per-kmer-count-aggregation-and-quorum-filtering">Phase 3
|
|||
|
|
— Per-kmer count aggregation and quorum filtering</h3>
|
|||
|
|
<p>For each dereplicated super-kmer, enumerate its kmers and accumulate
|
|||
|
|
counts:</p>
|
|||
|
|
<pre><code>for each super-kmer (sequence, COUNT):
|
|||
|
|
for each kmer in sequence:
|
|||
|
|
kmer_counts[canonical(kmer)] += COUNT</code></pre>
|
|||
|
|
<p>Implemented as an external sort or a temporary HashMap, depending on
|
|||
|
|
partition size. At the end of this phase, each distinct canonical kmer
|
|||
|
|
has its exact total count.</p>
|
|||
|
|
<p>Abundance filter applied here: kmers with
|
|||
|
|
<code>total_count < q</code> are discarded. <code>q</code> is a
|
|||
|
|
collection parameter (0 = keep all, including singletons for ≤1x
|
|||
|
|
data).</p>
|
|||
|
|
<p>No pre-filter on super-kmer COUNT is possible at phase 2: a
|
|||
|
|
super-kmer with COUNT=1 may contain only high-abundance kmers, each
|
|||
|
|
present in many other super-kmers across the partition.</p>
|
|||
|
|
<h3 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer
|
|||
|
|
compaction</h3>
|
|||
|
|
<p>The valid kmer set from phase 3 is used as a mask to rewrite the
|
|||
|
|
super-kmer files:</p>
|
|||
|
|
<pre><code>for each dereplicated super-kmer:
|
|||
|
|
scan kmer by kmer
|
|||
|
|
kmer not in valid set → break point (terminates current super-kmer)
|
|||
|
|
kmer in valid set → extend current super-kmer</code></pre>
|
|||
|
|
<p>Three cases per super-kmer: - <strong>All kmers valid</strong> →
|
|||
|
|
copied as-is - <strong>No kmer valid</strong> → discarded -
|
|||
|
|
<strong>Mixed</strong> → split into sub-super-kmers at invalid
|
|||
|
|
boundaries; each sub-super-kmer inherits the original COUNT</p>
|
|||
|
|
<p>After splitting, re-apply dereplication (bucket sort, phase 2 method)
|
|||
|
|
— splitting can produce new identical super-kmers. This re-dereplication
|
|||
|
|
is cheap: the volume is already greatly reduced.</p>
|
|||
|
|
<p>Output: a clean super-kmer file where every kmer passes quorum. This
|
|||
|
|
file feeds phase 5.</p>
|
|||
|
|
<h3 id="phase-5-local-de-bruijn-graph-and-unitig-construction">Phase 5 —
|
|||
|
|
Local de Bruijn graph and unitig construction</h3>
|
|||
|
|
<p>Within each partition, build a <strong>local de Bruijn graph</strong>
|
|||
|
|
from the valid kmer set and compute its unitigs. All operations are
|
|||
|
|
local to the partition — no cross-partition communication.</p>
|
|||
|
|
<pre><code>valid kmers → HashSet<u64>
|
|||
|
|
|
|||
|
|
for each kmer K:
|
|||
|
|
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
|
|||
|
|
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
|
|||
|
|
|
|||
|
|
internal node ↔ in_degree=1 AND out_degree=1
|
|||
|
|
branching / dead-end → unitig start or end</code></pre>
|
|||
|
|
<p>Traverse non-branching paths to assemble unitigs. Kmers whose
|
|||
|
|
neighbours fall in other partitions appear as dead ends locally — they
|
|||
|
|
terminate the unitig. The result: <strong>each kmer appears in exactly
|
|||
|
|
one unitig</strong> within the partition.</p>
|
|||
|
|
<p>The partition size (controlled by p) must be calibrated so that the
|
|||
|
|
HashSet fits in RAM during this phase.</p>
|
|||
|
|
<p>Output: <code>unitigs.bin</code> — the permanent evidence structure
|
|||
|
|
for the partition. Each kmer in the partition appears at exactly one
|
|||
|
|
(unitig_id, offset) location.</p>
|
|||
|
|
<p><strong>Scope of local unitigs:</strong> these are unitigs of the
|
|||
|
|
partition’s local de Bruijn graph, not global unitigs. A kmer whose k-1
|
|||
|
|
successor or predecessor falls in another partition appears as a dead
|
|||
|
|
end locally and terminates the unitig. This does not affect correctness
|
|||
|
|
of verification but means partition-local unitigs cannot be directly
|
|||
|
|
reused for global assembly.</p>
|
|||
|
|
<h3 id="phase-6-bbhash-construction-and-index-finalisation">Phase 6 —
|
|||
|
|
BBHash construction and index finalisation</h3>
|
|||
|
|
<p>Built once on the definitive kmer set (all kmers in all unitigs of
|
|||
|
|
the partition):</p>
|
|||
|
|
<pre><code>kmers from unitigs → BBHash → bbhash.bin
|
|||
|
|
→ counts.bin : packed n-bit array (or 1-bit for presence mode)
|
|||
|
|
→ refs.bin : (unitig_id: u32, kmer_offset: u8) per kmer</code></pre>
|
|||
|
|
<p>BBHash is built once — no rebuild. The n-bit width for
|
|||
|
|
<code>counts.bin</code> is chosen from the observed count distribution
|
|||
|
|
(n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts
|
|||
|
|
exceeding 2ⁿ−1 go into <code>overflow.bin</code> as sorted
|
|||
|
|
<code>(bbhash_index: u32, count: u32)</code> pairs.</p>
|
|||
|
|
<p><strong>Exact verification via unitig evidence:</strong></p>
|
|||
|
|
<p><code>unitigs.bin</code> serves as the evidence structure: for any
|
|||
|
|
query kmer, the stored unitig provides the ground truth to confirm or
|
|||
|
|
deny its presence. BBHash maps every input to [0, N) including absent
|
|||
|
|
kmers — the unitig read-back is the only way to guarantee exactness.</p>
|
|||
|
|
<pre><code>query kmer q
|
|||
|
|
→ minimizer(q) → hash → PART, KEY → partition
|
|||
|
|
→ BBHash(q) → index i
|
|||
|
|
→ refs[i] = (unitig_id, offset)
|
|||
|
|
→ read unitig unitig_id from unitigs.bin
|
|||
|
|
→ extract kmer at offset → compare with q
|
|||
|
|
→ match : return counts[i] ← exact hit
|
|||
|
|
→ no match: kmer absent ← BBHash collision on absent kmer</code></pre>
|
|||
|
|
<p>One random disk access into <code>unitigs.bin</code> per query; the
|
|||
|
|
unitig is the minimal, non-redundant evidence (each kmer stored
|
|||
|
|
once).</p>
|
|||
|
|
<h2 id="on-disk-collection-structure">On-disk collection structure</h2>
|
|||
|
|
<p>Collections are too large to hold in RAM (hundreds of genomes,
|
|||
|
|
billions of kmers). The collection lives on disk as a directory of
|
|||
|
|
memory-mapped files:</p>
|
|||
|
|
<pre><code>collection/
|
|||
|
|
metadata.toml — collection parameters (see below)
|
|||
|
|
superkmers.bin — super-kmer sequences, end-to-end, memory-mappable
|
|||
|
|
superkmers.idx — index blocks every B super-kmers for fast seeking
|
|||
|
|
part_XXXX/
|
|||
|
|
bbhash.bin — BBHash function for this partition
|
|||
|
|
counts.bin — packed n-bit count array (or 1-bit presence array)
|
|||
|
|
refs.bin — back-references for exact verification</code></pre>
|
|||
|
|
<p><strong>Collection parameters</strong> (stored in
|
|||
|
|
<code>metadata.toml</code>):</p>
|
|||
|
|
<table>
|
|||
|
|
<thead>
|
|||
|
|
<tr>
|
|||
|
|
<th>Parameter</th>
|
|||
|
|
<th>Role</th>
|
|||
|
|
</tr>
|
|||
|
|
</thead>
|
|||
|
|
<tbody>
|
|||
|
|
<tr>
|
|||
|
|
<td>k</td>
|
|||
|
|
<td>kmer length</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>m</td>
|
|||
|
|
<td>minimizer length (odd, < k)</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>p</td>
|
|||
|
|
<td>partition bits (0 ≤ p ≤ min(14, 2m−16))</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>mode</td>
|
|||
|
|
<td><code>presence</code> (1 bit/kmer) or <code>count</code> (n
|
|||
|
|
bits/kmer)</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>n</td>
|
|||
|
|
<td>bits per kmer in count mode (chosen at construction)</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>min_count</td>
|
|||
|
|
<td>singleton filtering threshold (0 = keep all)</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>hash_fn</td>
|
|||
|
|
<td>hash function identifier</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>hash_seed</td>
|
|||
|
|
<td>seed for the hash function</td>
|
|||
|
|
</tr>
|
|||
|
|
</tbody>
|
|||
|
|
</table>
|
|||
|
|
<p><strong>Exact verification via back-reference:</strong></p>
|
|||
|
|
<p>BBHash maps any input to [0, N) — including absent kmers. To avoid
|
|||
|
|
false positives, <code>refs.bin</code> stores at each BBHash index a
|
|||
|
|
back-reference <code>(superkmer_id: u32, kmer_offset: u8)</code> = 5
|
|||
|
|
bytes. Query protocol:</p>
|
|||
|
|
<pre><code>query kmer q
|
|||
|
|
→ minimizer(q) → hash → PART, KEY → partition
|
|||
|
|
→ BBHash(q) → index i
|
|||
|
|
→ refs[i] = (sk_id, offset)
|
|||
|
|
→ read super-kmer sk_id from superkmers.bin
|
|||
|
|
→ extract kmer at offset → compare with q
|
|||
|
|
→ if match: return counts[i] (exact)
|
|||
|
|
→ if no match: kmer absent</code></pre>
|
|||
|
|
<p><strong>Count storage — two modes:</strong></p>
|
|||
|
|
<p><em>Presence mode</em> (coverage ≤ 1x, or when only presence/absence
|
|||
|
|
matters): - <code>counts.bin</code> is a packed 1-bit array — all bits
|
|||
|
|
set to 1 for indexed kmers - Singletons are the signal, not filtered</p>
|
|||
|
|
<p><em>Count mode</em> (coverage > 1x): - <code>counts.bin</code> is
|
|||
|
|
a packed n-bit array; n chosen at construction from the observed
|
|||
|
|
distribution - Value 0: absent sentinel; values 1..2ⁿ−2: direct count;
|
|||
|
|
value 2ⁿ−1: overflow - Overflow counts stored in a separate
|
|||
|
|
<code>overflow.bin</code> as sorted
|
|||
|
|
<code>(index: u32, count: u32)</code> pairs - Empirically (k=31, 15x
|
|||
|
|
coverage): n=5 covers 97% of real kmers, n=6 covers 99% - min_count
|
|||
|
|
threshold filters low-frequency kmers (errors) before indexing; for ≤1x,
|
|||
|
|
min_count=0</p>
|
|||
|
|
<p><strong>BBHash:</strong> operates on kmer u64 values directly within
|
|||
|
|
each partition. The minimizer plays no role inside the partition —
|
|||
|
|
BBHash receives the kmer and builds the perfect hash naturally.</p>
|
|||
|
|
<h2 id="kmer-entropy-filter">Kmer entropy filter</h2>
|
|||
|
|
<p>Low-complexity kmers (polyA, polyT, tandem repeats) are detected and
|
|||
|
|
excluded during phase 1. The filter computes a <strong>normalized
|
|||
|
|
Shannon entropy</strong> over sub-words of multiple sizes, corrected for
|
|||
|
|
two sources of bias: the small number of observations within a single
|
|||
|
|
kmer, and the unequal sizes of circular equivalence classes.</p>
|
|||
|
|
<h3 id="sub-word-frequencies">Sub-word frequencies</h3>
|
|||
|
|
<p>For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max,
|
|||
|
|
typically ws_max = 6), extract the <span
|
|||
|
|
class="math inline"><em>n</em><sub>words</sub> = <em>k</em> − <em>w</em><em>s</em> + 1</span>
|
|||
|
|
overlapping sub-words by sliding a window of length ws:</p>
|
|||
|
|
<p><span class="math display">$$w_i = \text{kmer}[i \mathinner{..}
|
|||
|
|
i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$</span></p>
|
|||
|
|
<p>Each sub-word is mapped to its <strong>circular canonical
|
|||
|
|
form</strong>: the lexicographic minimum among all cyclic rotations of
|
|||
|
|
the word. Let <span
|
|||
|
|
class="math inline"><em>s</em><sub><em>j</em></sub></span> be the size
|
|||
|
|
of equivalence class <span class="math inline"><em>j</em></span> (number
|
|||
|
|
of distinct raw words mapping to canonical form <span
|
|||
|
|
class="math inline"><em>j</em></span>), and <span
|
|||
|
|
class="math inline"><em>f</em><sub><em>j</em></sub></span> the count of
|
|||
|
|
canonical form <span class="math inline"><em>j</em></span> among the
|
|||
|
|
<span class="math inline"><em>n</em><sub>words</sub></span> sub-words
|
|||
|
|
(<span
|
|||
|
|
class="math inline">∑<sub><em>j</em></sub><em>f</em><sub><em>j</em></sub> = <em>n</em><sub>words</sub></span>).</p>
|
|||
|
|
<h3 id="corrected-shannon-entropy">Corrected Shannon entropy</h3>
|
|||
|
|
<p>The circular equivalence classes have unequal sizes: under a uniform
|
|||
|
|
distribution over all <span
|
|||
|
|
class="math inline">4<sup><em>w</em><em>s</em></sup></span> raw words,
|
|||
|
|
class <span class="math inline"><em>j</em></span> is visited with
|
|||
|
|
probability <span
|
|||
|
|
class="math inline"><em>s</em><sub><em>j</em></sub>/4<sup><em>w</em><em>s</em></sup></span>,
|
|||
|
|
not <span class="math inline">1/<em>n</em><sub><em>a</em></sub></span>.
|
|||
|
|
Computing entropy directly over canonical classes therefore
|
|||
|
|
underestimates the entropy of a random sequence.</p>
|
|||
|
|
<p>The correction “unfolds” each canonical class back to its member raw
|
|||
|
|
words, redistributing each observation of class <span
|
|||
|
|
class="math inline"><em>j</em></span> equally among its <span
|
|||
|
|
class="math inline"><em>s</em><sub><em>j</em></sub></span> members:</p>
|
|||
|
|
<p><span class="math display">$$H_{\text{corr}} = \log(n_{\text{words}})
|
|||
|
|
- \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j +
|
|||
|
|
\frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$</span></p>
|
|||
|
|
<p>The last term is the correction for unequal class sizes. For a
|
|||
|
|
uniformly random sequence (<span
|
|||
|
|
class="math inline"><em>f</em><sub><em>j</em></sub> ≈ <em>n</em><sub>words</sub> ⋅ <em>s</em><sub><em>j</em></sub>/4<sup><em>w</em><em>s</em></sup></span>),
|
|||
|
|
this gives <span
|
|||
|
|
class="math inline"><em>H</em><sub>corr</sub> ≈ log (4<sup><em>w</em><em>s</em></sup>) = 2 ⋅ <em>w</em><em>s</em> ⋅ log 2</span>,
|
|||
|
|
the maximum entropy over raw words.</p>
|
|||
|
|
<h3 id="maximum-entropy-correction-for-small-samples">Maximum entropy
|
|||
|
|
correction for small samples</h3>
|
|||
|
|
<p>With only <span class="math inline"><em>n</em><sub>words</sub></span>
|
|||
|
|
observations over <span
|
|||
|
|
class="math inline">4<sup><em>w</em><em>s</em></sup></span> possible raw
|
|||
|
|
words, the achievable maximum entropy is bounded by the most uniform
|
|||
|
|
integer distribution over <span
|
|||
|
|
class="math inline">4<sup><em>w</em><em>s</em></sup></span>
|
|||
|
|
categories.</p>
|
|||
|
|
<p>Let <span
|
|||
|
|
class="math inline"><em>c</em> = ⌊<em>n</em><sub>words</sub>/4<sup><em>w</em><em>s</em></sup>⌋</span>
|
|||
|
|
and <span
|
|||
|
|
class="math inline"><em>r</em> = <em>n</em><sub>words</sub> mod 4<sup><em>w</em><em>s</em></sup></span>.
|
|||
|
|
The most uniform integer distribution assigns frequency <span
|
|||
|
|
class="math inline"><em>c</em> + 1</span> to <span
|
|||
|
|
class="math inline"><em>r</em></span> categories and <span
|
|||
|
|
class="math inline"><em>c</em></span> to the remaining <span
|
|||
|
|
class="math inline">4<sup><em>w</em><em>s</em></sup> − <em>r</em></span>,
|
|||
|
|
with the convention <span class="math inline">0log 0 = 0</span>:</p>
|
|||
|
|
<p><span class="math display">$$H_{\max} = -\left[(4^{ws} -
|
|||
|
|
r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} +
|
|||
|
|
r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$</span></p>
|
|||
|
|
<p>When <span
|
|||
|
|
class="math inline"><em>n</em><sub>words</sub> < 4<sup><em>w</em><em>s</em></sup></span>:
|
|||
|
|
<span class="math inline"><em>c</em> = 0</span>, <span
|
|||
|
|
class="math inline"><em>r</em> = <em>n</em><sub>words</sub></span>, and
|
|||
|
|
the formula reduces to <span
|
|||
|
|
class="math inline"><em>H</em><sub>max</sub> = log (<em>n</em><sub>words</sub>)</span>
|
|||
|
|
— a single unified expression covers both regimes. A truly random
|
|||
|
|
sequence achieves <span
|
|||
|
|
class="math inline"><em>H</em><sub>corr</sub> ≈ <em>H</em><sub>max</sub></span>.</p>
|
|||
|
|
<h3 id="normalized-entropy">Normalized entropy</h3>
|
|||
|
|
<p><span class="math display">$$\hat{H}(ws) =
|
|||
|
|
\frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$</span></p>
|
|||
|
|
<h3 id="final-score">Final score</h3>
|
|||
|
|
<p>The filter computes <span
|
|||
|
|
class="math inline"><em>Ĥ</em>(<em>w</em><em>s</em>)</span> for each
|
|||
|
|
word size ws from 1 to ws_max and returns the
|
|||
|
|
<strong>minimum</strong>:</p>
|
|||
|
|
<p><span class="math display">$$\text{entropy}(kmer) =
|
|||
|
|
\min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$</span></p>
|
|||
|
|
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1
|
|||
|
|
indicates high complexity. A kmer is rejected if <span
|
|||
|
|
class="math inline">entropy(<em>k</em><em>m</em><em>e</em><em>r</em>) ≤ <em>θ</em></span>,
|
|||
|
|
where <span class="math inline"><em>θ</em></span> is a collection
|
|||
|
|
parameter. The minimum across word sizes ensures that any scale of
|
|||
|
|
repetition is detected independently: polyA is caught at ws=1,
|
|||
|
|
dinucleotide repeats at ws=2, etc.</p>
|
|||
|
|
<h2 id="priority-operations">Priority operations</h2>
|
|||
|
|
<ul>
|
|||
|
|
<li>Kmer counting (frequencies)</li>
|
|||
|
|
<li>Fast search / query</li>
|
|||
|
|
<li>Set operations: union, intersection, difference</li>
|
|||
|
|
</ul>
|
|||
|
|
<h2 id="data-types">Data types</h2>
|
|||
|
|
<p>Two types of DNA are represented:</p>
|
|||
|
|
<ul>
|
|||
|
|
<li><strong>Super-kmers</strong>: stored as a nucleotide sequence (2
|
|||
|
|
bits/base, byte-aligned, no padding). Max length: 256 nucleotides (512
|
|||
|
|
bits, 64 bytes). Each super-kmer has a <strong>32-bit
|
|||
|
|
header</strong>:</li>
|
|||
|
|
</ul>
|
|||
|
|
<table>
|
|||
|
|
<thead>
|
|||
|
|
<tr>
|
|||
|
|
<th>Field</th>
|
|||
|
|
<th>Bits</th>
|
|||
|
|
<th>Role</th>
|
|||
|
|
</tr>
|
|||
|
|
</thead>
|
|||
|
|
<tbody>
|
|||
|
|
<tr>
|
|||
|
|
<td>COUNT</td>
|
|||
|
|
<td>24</td>
|
|||
|
|
<td>Occurrence count (≤ 16 M)</td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>SEQL</td>
|
|||
|
|
<td>8</td>
|
|||
|
|
<td>Sequence length in nucleotides (≤ 256)</td>
|
|||
|
|
</tr>
|
|||
|
|
</tbody>
|
|||
|
|
</table>
|
|||
|
|
<p>Bit layout (MSB to LSB):</p>
|
|||
|
|
<pre><code>[31:8] COUNT [7:0] SEQL</code></pre>
|
|||
|
|
<p>Getters:</p>
|
|||
|
|
<div class="sourceCode" id="cb12"><pre
|
|||
|
|
class="sourceCode rust"><code class="sourceCode rust"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> seql(<span class="op">&</span><span class="kw">self</span>) <span class="op">-></span> <span class="dt">u8</span> <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="kw">as</span> <span class="dt">u8</span> <span class="op">}</span></span>
|
|||
|
|
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> count(<span class="op">&</span><span class="kw">self</span>) <span class="op">-></span> <span class="dt">u32</span> <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="op">>></span> <span class="dv">8</span> <span class="op">}</span></span>
|
|||
|
|
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> increment(<span class="op">&</span><span class="kw">mut</span> <span class="kw">self</span>) <span class="op">{</span> <span class="kw">self</span><span class="op">.</span><span class="dv">0</span> <span class="op">+=</span> <span class="dv">1</span> <span class="op"><<</span> <span class="dv">8</span><span class="op">;</span> <span class="op">}</span></span></code></pre></div>
|
|||
|
|
<p>PART and KEY were considered as routing fields derived from
|
|||
|
|
<code>hash(minimizer)</code> but are single-use at phase 1 only —
|
|||
|
|
neither is stored. The nucleotide sequence is always byte-aligned (no
|
|||
|
|
offset): after reverse-complementing a super-kmer, a single bit-shift
|
|||
|
|
realigns the sequence so that nucleotide 0 always starts at bit 0 of
|
|||
|
|
word[0]. This shift is paid once at canonisation (phase 1, at most half
|
|||
|
|
of super-kmers) and eliminates any need for an alignment offset field.
|
|||
|
|
The direct consequence for dereplication: the byte array can be hashed
|
|||
|
|
directly without any offset adjustment.</p>
|
|||
|
|
<p>Note: minimizer selection (window minimum) biases the raw minimizer
|
|||
|
|
distribution toward lexicographically small values <span
|
|||
|
|
class="citation"
|
|||
|
|
data-cites="Zheng2020-ji Zheng2021-cc Pan2024-hb Kille2023-px Golan2025-xf">[@Zheng2020-ji;
|
|||
|
|
@Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]</span>;
|
|||
|
|
hashing corrects this before partition routing.</p>
|
|||
|
|
<ul>
|
|||
|
|
<li><strong>Kmers</strong>: short DNA subsequences (k bases),
|
|||
|
|
represented in <code>u64</code>.</li>
|
|||
|
|
</ul>
|
|||
|
|
<h2 id="kmer-representation">Kmer representation</h2>
|
|||
|
|
<p>Each base is encoded on 2 bits, so any kmer with k ≤ 31 fits in a
|
|||
|
|
<code>u64</code>.</p>
|
|||
|
|
<h3 id="bit-layout">Bit layout</h3>
|
|||
|
|
<p>Nucleotide <code>i</code> occupies bits <code>64-i</code> and
|
|||
|
|
<code>63-i</code>. For a kmer of length k, 2k bits are used; the
|
|||
|
|
low-order 64 - 2k bits of the u64 are kept at 0 and have no meaning.</p>
|
|||
|
|
<ul>
|
|||
|
|
<li>Extraction of nuc[i]:
|
|||
|
|
<code>(word >> (63 - 2*i)) & 0b11</code>.</li>
|
|||
|
|
</ul>
|
|||
|
|
<p>For kmers (k ≤ 31), the whole sequence fits in a single
|
|||
|
|
<code>u64</code>. For longer sequences up to 256 nucleotides (512 bits),
|
|||
|
|
the representation extends across multiple words.</p>
|
|||
|
|
<h3 id="complement-and-reverse-complement">Complement and reverse
|
|||
|
|
complement</h3>
|
|||
|
|
<p>The encoding is chosen so that the Watson-Crick complement of any
|
|||
|
|
base is its bitwise NOT (on 2 bits):</p>
|
|||
|
|
<table>
|
|||
|
|
<thead>
|
|||
|
|
<tr>
|
|||
|
|
<th>Base</th>
|
|||
|
|
<th>Encoding</th>
|
|||
|
|
<th>Complement</th>
|
|||
|
|
<th>NOT</th>
|
|||
|
|
</tr>
|
|||
|
|
</thead>
|
|||
|
|
<tbody>
|
|||
|
|
<tr>
|
|||
|
|
<td>A</td>
|
|||
|
|
<td><code>00</code></td>
|
|||
|
|
<td>T</td>
|
|||
|
|
<td><code>11</code></td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>C</td>
|
|||
|
|
<td><code>01</code></td>
|
|||
|
|
<td>G</td>
|
|||
|
|
<td><code>10</code></td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>G</td>
|
|||
|
|
<td><code>10</code></td>
|
|||
|
|
<td>C</td>
|
|||
|
|
<td><code>01</code></td>
|
|||
|
|
</tr>
|
|||
|
|
<tr>
|
|||
|
|
<td>T</td>
|
|||
|
|
<td><code>11</code></td>
|
|||
|
|
<td>A</td>
|
|||
|
|
<td><code>00</code></td>
|
|||
|
|
</tr>
|
|||
|
|
</tbody>
|
|||
|
|
</table>
|
|||
|
|
<p>This means <code>complement(base) = ~base & 0b11</code>, and the
|
|||
|
|
reverse complement of a kmer reduces to two cheap operations:</p>
|
|||
|
|
<pre><code>revcomp(kmer) = reverse_bases(~kmer & mask_2k)</code></pre>
|
|||
|
|
<p>where <code>mask_2k = (1 << 2k) - 1</code> zeros the unused
|
|||
|
|
high bits, and <code>reverse_bases</code> reorders the 2-bit chunks from
|
|||
|
|
LSB to MSB. No lookup table needed.</p>
|
|||
|
|
<p>DNA being double-stranded, each kmer and its reverse complement are
|
|||
|
|
equivalent. The canonical form is defined as:</p>
|
|||
|
|
<pre><code>canonical(kmer) = min(kmer, revcomp(kmer))</code></pre>
|
|||
|
|
<p>This halves the kmer space and ensures strand-independent
|
|||
|
|
counting.</p>
|
|||
|
|
<h2 id="sequence-operations">Sequence operations</h2>
|
|||
|
|
<h3 id="reverse-complement">Reverse complement</h3>
|
|||
|
|
<p>Reverse complement of a super-kmer sequence (used at most once per
|
|||
|
|
super-kmer, during phase 1 canonisation):</p>
|
|||
|
|
<ol type="1">
|
|||
|
|
<li>Reverse byte order</li>
|
|||
|
|
<li>Apply 256-entry lookup table (precomputed reverse-complemented
|
|||
|
|
bytes) to each byte</li>
|
|||
|
|
<li>Bit-shift the result to realign nucleotide 0 to bit 0 of
|
|||
|
|
word[0]</li>
|
|||
|
|
</ol>
|
|||
|
|
<p>The shift amount depends only on the sequence length:
|
|||
|
|
<code>shift = (len % 4) * 2</code> bits. After the shift, the sequence
|
|||
|
|
is byte-aligned with no offset — the same invariant as the original.</p>
|
|||
|
|
<div class="sourceCode" id="cb15"><pre
|
|||
|
|
class="sourceCode rust"><code class="sourceCode rust"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> reverse_complement(words<span class="op">:</span> <span class="op">&</span>[<span class="dt">u64</span>]<span class="op">,</span> len<span class="op">:</span> <span class="dt">usize</span>) <span class="op">-></span> <span class="dt">Vec</span><span class="op"><</span><span class="dt">u64</span><span class="op">></span> <span class="op">{</span></span>
|
|||
|
|
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> bytes <span class="op">=</span> words_to_bytes(words)<span class="op">;</span></span>
|
|||
|
|
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> <span class="kw">mut</span> rev<span class="op">:</span> <span class="dt">Vec</span><span class="op"><</span><span class="dt">u8</span><span class="op">></span> <span class="op">=</span> bytes<span class="op">.</span>iter()<span class="op">.</span>rev()<span class="op">.</span>map(<span class="op">|&</span>b<span class="op">|</span> LOOKUP_TABLE[b <span class="kw">as</span> <span class="dt">usize</span>])<span class="op">.</span>collect()<span class="op">;</span></span>
|
|||
|
|
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a> <span class="kw">let</span> shift <span class="op">=</span> (len <span class="op">%</span> <span class="dv">4</span>) <span class="op">*</span> <span class="dv">2</span><span class="op">;</span></span>
|
|||
|
|
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> shift <span class="op">></span> <span class="dv">0</span> <span class="op">{</span> bit_shift_left(<span class="op">&</span><span class="kw">mut</span> rev<span class="op">,</span> shift)<span class="op">;</span> <span class="op">}</span></span>
|
|||
|
|
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a> bytes_to_words(rev)</span>
|
|||
|
|
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
|
|||
|
|
<h2 id="multithreading">Multithreading</h2>
|
|||
|
|
<p>For HPC with 100+ cores: use Rayon for data parallelism (parallel
|
|||
|
|
iterators), std::thread for custom threading, or async with Tokio for
|
|||
|
|
I/O-bound tasks. Equivalent to Go’s goroutines but with
|
|||
|
|
ownership/borrowing safety.</p>
|
|||
|
|
<p>Rayon: Rust crate for automatic data parallelism. Provides parallel
|
|||
|
|
iterators (par_iter) that distribute work across CPU cores, e.g.,
|
|||
|
|
vec.par_iter().map(|x| x*2).collect(). Handles thread
|
|||
|
|
spawning/synchronization transparently. Widely used and maintained by
|
|||
|
|
the Rust project. Orthogonal to Tokio (async runtime): Rayon for CPU
|
|||
|
|
parallelism, Tokio for async I/O.</p>
|
|||
|
|
<p>Data pipelines: use Crossbeam channels (MPMC) with threads/async,
|
|||
|
|
equivalent to Go’s channels + goroutines.</p>
|
|||
|
|
<p>Channels: std::sync::mpsc (built-in, MPSC, simple); crossbeam (crate,
|
|||
|
|
MPMC, faster, more features like select). Use Crossbeam for
|
|||
|
|
multi-producer/multi-consumer scenarios.</p>
|
|||
|
|
<h1 id="bibliographie">Bibliographie</h1>
|
|||
|
|
<p>\bibliography</p>
|