obikmer is a Rust tool for manipulation, counting,
indexing, and set operations on DNA sequences represented as kmer
sets.
For sequence files (FASTA/FASTQ):
The fundamental stored unit is a canonical super-kmer: a maximal run of consecutive kmers sharing the same non-canonical (raw forward) minimizer, canonized at the end.
Super-kmers are built using the non-canonical minimizer (minimum forward m-mer, no revcomp comparison). Two consecutive kmers belong to the same super-kmer if and only if their raw forward minimizer is identical.
At the end of each super-kmer, the minimizer is canonized:
canonical(M) = min(M, revcomp(M)).
canonical(M) == M: the super-kmer is already in
canonical orientation — keep as-is.canonical(M) == revcomp(M): reverse-complement the
entire super-kmer (and its minimizer).The reverse-complement step does not cause any additional splitting — it only flips the orientation.
Using the raw forward minimizer (rather than the canonical minimizer) means fewer candidates are considered at each position → the minimizer changes more often → super-kmers are shorter or equal, never longer, compared to the canonical-minimizer definition.
The key benefit is at dereplication: a forward read and its reverse-complement covering the same genomic region both produce the same canonical super-kmer after the canonization step. Under the canonical-minimizer definition, they would share the same canonical minimizer but have different sequences (one being the revcomp of the other) — they would not dereplicate. Here they do, giving a compression factor of ~2× on paired or double-stranded data.
For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) [@Zheng2020-ji; @Golan2025-xf], giving an expected super-kmer length of (k+m)/2 nucleotides. For k=31, m=13: expected ≈ 22 nt. In practice super-kmers rarely exceed a few dozen nucleotides — the 256 nt cap is almost never reached. When it is, the super-kmer is split at 256 nt with a k−1 overlap, preserving all kmers without duplication.
Memory-mappable binary format for sequence collections: store super-kmers end-to-end with u64-aligned headers (add padding if needed for 8-byte alignment). Include index blocks every B super-kmers (B=1000-10000) for fast jumping to position J (O(N/B + B) scan vs. separate index O(1) direct access), where N is total super-kmers in file. Represent header as union/struct for field access.
The canonical minimizer of a super-kmer is hashed to produce a p-bit routing value (p is a collection-level parameter):
canonical minimizer → hash(minimizer) → p-bit value → PART → partition directory
PART is computed once at phase 1 to open the correct partition file, then discarded. It is recomputed on the fly at query time. It is never stored in the super-kmer header.
Each partition holds one BBHash instance (phase 6) that indexes kmers as plain u64 values — the minimizer plays no role inside the partition.
Properties: - 2^p partitions: p freely tunable at collection creation; not constrained to powers of 4 - Uniform distribution: hashing the canonical minimizer corrects the lexicographic bias inherent in minimizer selection [@Zheng2020-ji; @Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf] - The odd-length constraint on m only ensures canonical form correctness during minimizer computation — it does not affect the partition layout
Entropy constraint: the canonical minimizer has 2m bits of entropy. Requesting p bits from the hash requires 2m ≥ p, i.e. p ≤ 2m:
| m | p max | Partitions max |
|---|---|---|
| 9 | 2 | 4 |
| 11 | 6 | 64 |
| 13 | 10 | 1 024 |
| 15 | 14 | 16 384 |
For the typical configuration k=31, m=13: up to 1 024 partitions. Choosing p > 2m−16 is possible but produces artificially colliding PART values with no gain in routing granularity.
All phases after scatter are embarrassingly parallel across partitions.
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:
hash(canonical_minimizer) → PART → append super-kmer to
partition/superkmers.bin.gz.Entropy filter algorithm:
When Ki (ended by S[j], j = i + k − 1) fails the entropy threshold:
\begin{algorithm}
\caption{Entropy filter — sliding window segmentation}
\begin{algorithmic}
\PROCEDURE{EntropyFilter}{S, N, k, theta}
\STATE seg_start <- 0
\STATE window <- []
\FOR{j <- 0 \TO N-1}
\STATE window.push(S[j])
\IF{|window| < k}
\STATE continue
\ENDIF
\STATE i <- j - k + 1
\IF{entropy(window) <= theta}
\STATE emit S[seg_start .. j-1]
\STATE seg_start <- i + 1
\STATE window <- S[i+1 .. j]
\ELSE
\STATE window.pop_front()
\ENDIF
\ENDFOR
\STATE emit S[seg_start .. N-1]
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}
Writes are sequential and append-only — IO-friendly. Gzip applied at write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets header overhead).
Performed independently per partition. Identical super-kmers are consolidated and their COUNT accumulated — analogous to amplicon dereplication in metabarcoding. Uses external bucket sort to stay within RAM bounds:
Pass 1 (streaming): hash the nucleotide payload of each super-kmer, route to one of B bucket files:
hash(sequence) % B → bucket_i.bin
B ≈ 100 is tunable; RAM needed ≈ partition_size / B.
Pass 2: for each bucket, load into an in-memory
HashMap<sequence, COUNT>, dereplicate by summing
COUNT values, write consolidated super-kmers.
After dereplication: at Nx coverage the partition shrinks by ~x (errors aside). The COUNT field in each super-kmer header = number of times that exact super-kmer sequence was observed across all input reads.
Important: super-kmer COUNT ≠ individual kmer count. A kmer can appear in multiple distinct super-kmers (same partition, different flanking context); its true count = sum of COUNT of all super-kmers containing it. A super-kmer with COUNT=1 may contain only high-abundance kmers, each appearing in many other super-kmers. Abundance filtering therefore cannot be applied at this phase.
For each dereplicated super-kmer, enumerate its kmers and accumulate counts:
for each super-kmer (sequence, COUNT):
for each kmer in sequence:
kmer_counts[canonical(kmer)] += COUNT
Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.
Abundance filter applied here: kmers with
total_count < q are discarded. q is a
collection parameter (0 = keep all, including singletons for ≤1x
data).
No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.
The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:
for each dereplicated super-kmer:
scan kmer by kmer
kmer not in valid set → break point (terminates current super-kmer)
kmer in valid set → extend current super-kmer
Three cases per super-kmer: - All kmers valid → copied as-is - No kmer valid → discarded - Mixed → split into sub-super-kmers at invalid boundaries; each sub-super-kmer inherits the original COUNT
After splitting, re-apply dereplication (bucket sort, phase 2 method) — splitting can produce new identical super-kmers. This re-dereplication is cheap: the volume is already greatly reduced.
Output: a clean super-kmer file where every kmer passes quorum. This file feeds phase 5.
Within each partition, build a local de Bruijn graph from the valid kmer set and compute its unitigs. All operations are local to the partition — no cross-partition communication.
valid kmers → HashSet<u64>
for each kmer K:
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
internal node ↔ in_degree=1 AND out_degree=1
branching / dead-end → unitig start or end
Traverse non-branching paths to assemble unitigs. Kmers whose neighbours fall in other partitions appear as dead ends locally — they terminate the unitig. The result: each kmer appears in exactly one unitig within the partition.
The partition size (controlled by p) must be calibrated so that the HashSet fits in RAM during this phase.
Output: unitigs.bin — the permanent evidence structure
for the partition. Each kmer in the partition appears at exactly one
(unitig_id, offset) location.
Scope of local unitigs: these are unitigs of the partition’s local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.
Built once on the definitive kmer set (all kmers in all unitigs of the partition):
kmers from unitigs → BBHash → bbhash.bin
→ counts.bin : packed n-bit array (or 1-bit for presence mode)
→ refs.bin : (unitig_id: u32, kmer_offset: u8) per kmer
BBHash is built once — no rebuild. The n-bit width for
counts.bin is chosen from the observed count distribution
(n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts
exceeding 2ⁿ−1 go into overflow.bin as sorted
(bbhash_index: u32, count: u32) pairs.
Exact verification via unitig evidence:
unitigs.bin serves as the evidence structure: for any
query kmer, the stored unitig provides the ground truth to confirm or
deny its presence. BBHash maps every input to [0, N) including absent
kmers — the unitig read-back is the only way to guarantee exactness.
query kmer q
→ minimizer(q) → hash → PART, KEY → partition
→ BBHash(q) → index i
→ refs[i] = (unitig_id, offset)
→ read unitig unitig_id from unitigs.bin
→ extract kmer at offset → compare with q
→ match : return counts[i] ← exact hit
→ no match: kmer absent ← BBHash collision on absent kmer
One random disk access into unitigs.bin per query; the
unitig is the minimal, non-redundant evidence (each kmer stored
once).
Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:
collection/
metadata.toml — collection parameters (see below)
superkmers.bin — super-kmer sequences, end-to-end, memory-mappable
superkmers.idx — index blocks every B super-kmers for fast seeking
part_XXXX/
bbhash.bin — BBHash function for this partition
counts.bin — packed n-bit count array (or 1-bit presence array)
refs.bin — back-references for exact verification
Collection parameters (stored in
metadata.toml):
| Parameter | Role |
|---|---|
| k | kmer length |
| m | minimizer length (odd, < k) |
| p | partition bits (0 ≤ p ≤ min(14, 2m−16)) |
| mode | presence (1 bit/kmer) or count (n
bits/kmer) |
| n | bits per kmer in count mode (chosen at construction) |
| min_count | singleton filtering threshold (0 = keep all) |
| hash_fn | hash function identifier |
| hash_seed | seed for the hash function |
Exact verification via back-reference:
BBHash maps any input to [0, N) — including absent kmers. To avoid
false positives, refs.bin stores at each BBHash index a
back-reference (superkmer_id: u32, kmer_offset: u8) = 5
bytes. Query protocol:
query kmer q
→ minimizer(q) → hash → PART, KEY → partition
→ BBHash(q) → index i
→ refs[i] = (sk_id, offset)
→ read super-kmer sk_id from superkmers.bin
→ extract kmer at offset → compare with q
→ if match: return counts[i] (exact)
→ if no match: kmer absent
Count storage — two modes:
Presence mode (coverage ≤ 1x, or when only presence/absence
matters): - counts.bin is a packed 1-bit array — all bits
set to 1 for indexed kmers - Singletons are the signal, not filtered
Count mode (coverage > 1x): - counts.bin is
a packed n-bit array; n chosen at construction from the observed
distribution - Value 0: absent sentinel; values 1..2ⁿ−2: direct count;
value 2ⁿ−1: overflow - Overflow counts stored in a separate
overflow.bin as sorted
(index: u32, count: u32) pairs - Empirically (k=31, 15x
coverage): n=5 covers 97% of real kmers, n=6 covers 99% - min_count
threshold filters low-frequency kmers (errors) before indexing; for ≤1x,
min_count=0
BBHash: operates on kmer u64 values directly within each partition. The minimizer plays no role inside the partition — BBHash receives the kmer and builds the perfect hash naturally.
Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a normalized Shannon entropy over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.
For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the nwords = k − ws + 1 overlapping sub-words by sliding a window of length ws:
$$w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$
Each sub-word is mapped to its circular canonical form: the lexicographic minimum among all cyclic rotations of the word. Let sj be the size of equivalence class j (number of distinct raw words mapping to canonical form j), and fj the count of canonical form j among the nwords sub-words (∑jfj = nwords).
The circular equivalence classes have unequal sizes: under a uniform distribution over all 4ws raw words, class j is visited with probability sj/4ws, not 1/na. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.
The correction “unfolds” each canonical class back to its member raw words, redistributing each observation of class j equally among its sj members:
$$H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$
The last term is the correction for unequal class sizes. For a uniformly random sequence (fj ≈ nwords ⋅ sj/4ws), this gives Hcorr ≈ log (4ws) = 2 ⋅ ws ⋅ log 2, the maximum entropy over raw words.
With only nwords observations over 4ws possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over 4ws categories.
Let c = ⌊nwords/4ws⌋ and r = nwords mod 4ws. The most uniform integer distribution assigns frequency c + 1 to r categories and c to the remaining 4ws − r, with the convention 0log 0 = 0:
$$H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$
When nwords < 4ws: c = 0, r = nwords, and the formula reduces to Hmax = log (nwords) — a single unified expression covers both regimes. A truly random sequence achieves Hcorr ≈ Hmax.
$$\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$
The filter computes Ĥ(ws) for each word size ws from 1 to ws_max and returns the minimum:
$$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if entropy(kmer) ≤ θ, where θ is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
Two types of DNA are represented:
| Field | Bits | Role |
|---|---|---|
| COUNT | 24 | Occurrence count (≤ 16 M) |
| SEQL | 8 | Sequence length in nucleotides (≤ 256) |
Bit layout (MSB to LSB):
[31:8] COUNT [7:0] SEQL
Getters:
fn seql(&self) -> u8 { self.0 as u8 }
fn count(&self) -> u32 { self.0 >> 8 }
fn increment(&mut self) { self.0 += 1 << 8; }PART and KEY were considered as routing fields derived from
hash(minimizer) but are single-use at phase 1 only —
neither is stored. The nucleotide sequence is always byte-aligned (no
offset): after reverse-complementing a super-kmer, a single bit-shift
realigns the sequence so that nucleotide 0 always starts at bit 0 of
word[0]. This shift is paid once at canonisation (phase 1, at most half
of super-kmers) and eliminates any need for an alignment offset field.
The direct consequence for dereplication: the byte array can be hashed
directly without any offset adjustment.
Note: minimizer selection (window minimum) biases the raw minimizer distribution toward lexicographically small values [@Zheng2020-ji; @Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]; hashing corrects this before partition routing.
u64.Each base is encoded on 2 bits, so any kmer with k ≤ 31 fits in a
u64.
Nucleotide i occupies bits 64-i and
63-i. For a kmer of length k, 2k bits are used; the
low-order 64 - 2k bits of the u64 are kept at 0 and have no meaning.
(word >> (63 - 2*i)) & 0b11.For kmers (k ≤ 31), the whole sequence fits in a single
u64. For longer sequences up to 256 nucleotides (512 bits),
the representation extends across multiple words.
The encoding is chosen so that the Watson-Crick complement of any base is its bitwise NOT (on 2 bits):
| Base | Encoding | Complement | NOT |
|---|---|---|---|
| A | 00 |
T | 11 |
| C | 01 |
G | 10 |
| G | 10 |
C | 01 |
| T | 11 |
A | 00 |
This means complement(base) = ~base & 0b11, and the
reverse complement of a kmer reduces to two cheap operations:
revcomp(kmer) = reverse_bases(~kmer & mask_2k)
where mask_2k = (1 << 2k) - 1 zeros the unused
high bits, and reverse_bases reorders the 2-bit chunks from
LSB to MSB. No lookup table needed.
DNA being double-stranded, each kmer and its reverse complement are equivalent. The canonical form is defined as:
canonical(kmer) = min(kmer, revcomp(kmer))
This halves the kmer space and ensures strand-independent counting.
Reverse complement of a super-kmer sequence (used at most once per super-kmer, during phase 1 canonisation):
The shift amount depends only on the sequence length:
shift = (len % 4) * 2 bits. After the shift, the sequence
is byte-aligned with no offset — the same invariant as the original.
fn reverse_complement(words: &[u64], len: usize) -> Vec<u64> {
let bytes = words_to_bytes(words);
let mut rev: Vec<u8> = bytes.iter().rev().map(|&b| LOOKUP_TABLE[b as usize]).collect();
let shift = (len % 4) * 2;
if shift > 0 { bit_shift_left(&mut rev, shift); }
bytes_to_words(rev)
}For HPC with 100+ cores: use Rayon for data parallelism (parallel iterators), std::thread for custom threading, or async with Tokio for I/O-bound tasks. Equivalent to Go’s goroutines but with ownership/borrowing safety.
Rayon: Rust crate for automatic data parallelism. Provides parallel iterators (par_iter) that distribute work across CPU cores, e.g., vec.par_iter().map(|x| x*2).collect(). Handles thread spawning/synchronization transparently. Widely used and maintained by the Rust project. Orthogonal to Tokio (async runtime): Rayon for CPU parallelism, Tokio for async I/O.
Data pipelines: use Crossbeam channels (MPMC) with threads/async, equivalent to Go’s channels + goroutines.
Channels: std::sync::mpsc (built-in, MPSC, simple); crossbeam (crate, MPMC, faster, more features like select). Use Crossbeam for multi-producer/multi-consumer scenarios.
\bibliography