obikmer

obikmer is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.

Constraints

Input processing

For sequence files (FASTA/FASTQ):

Super-kmers

The fundamental stored unit is a canonical super-kmer: a maximal run of consecutive kmers sharing the same non-canonical (raw forward) minimizer, canonized at the end.

Construction

Super-kmers are built using the non-canonical minimizer (minimum forward m-mer, no revcomp comparison). Two consecutive kmers belong to the same super-kmer if and only if their raw forward minimizer is identical.

At the end of each super-kmer, the minimizer is canonized: canonical(M) = min(M, revcomp(M)).

The reverse-complement step does not cause any additional splitting — it only flips the orientation.

Why non-canonical minimizer?

Using the raw forward minimizer (rather than the canonical minimizer) means fewer candidates are considered at each position → the minimizer changes more often → super-kmers are shorter or equal, never longer, compared to the canonical-minimizer definition.

The key benefit is at dereplication: a forward read and its reverse-complement covering the same genomic region both produce the same canonical super-kmer after the canonization step. Under the canonical-minimizer definition, they would share the same canonical minimizer but have different sequences (one being the revcomp of the other) — they would not dereplicate. Here they do, giving a compression factor of ~2× on paired or double-stranded data.

Length

For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) [@Zheng2020-ji; @Golan2025-xf], giving an expected super-kmer length of (k+m)/2 nucleotides. For k=31, m=13: expected ≈ 22 nt. In practice super-kmers rarely exceed a few dozen nucleotides — the 256 nt cap is almost never reached. When it is, the super-kmer is split at 256 nt with a k−1 overlap, preserving all kmers without duplication.

Binary storage format

Memory-mappable binary format for sequence collections: store super-kmers end-to-end with u64-aligned headers (add padding if needed for 8-byte alignment). Include index blocks every B super-kmers (B=1000-10000) for fast jumping to position J (O(N/B + B) scan vs. separate index O(1) direct access), where N is total super-kmers in file. Represent header as union/struct for field access.

Partitioning and indexing architecture

The canonical minimizer of a super-kmer is hashed to produce a p-bit routing value (p is a collection-level parameter):

canonical minimizer → hash(minimizer) → p-bit value → PART → partition directory

PART is computed once at phase 1 to open the correct partition file, then discarded. It is recomputed on the fly at query time. It is never stored in the super-kmer header.

Each partition holds one BBHash instance (phase 6) that indexes kmers as plain u64 values — the minimizer plays no role inside the partition.

Properties: - 2^p partitions: p freely tunable at collection creation; not constrained to powers of 4 - Uniform distribution: hashing the canonical minimizer corrects the lexicographic bias inherent in minimizer selection [@Zheng2020-ji; @Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf] - The odd-length constraint on m only ensures canonical form correctness during minimizer computation — it does not affect the partition layout

Entropy constraint: the canonical minimizer has 2m bits of entropy. Requesting p bits from the hash requires 2m ≥ p, i.e. p ≤ 2m:

m p max Partitions max
9 2 4
11 6 64
13 10 1 024
15 14 16 384

For the typical configuration k=31, m=13: up to 1 024 partitions. Choosing p > 2m−16 is possible but produces artificially colliding PART values with no gain in routing granularity.

Construction pipeline

All phases after scatter are embarrassingly parallel across partitions.

Phase 1 — Scatter

Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:

  1. Ambiguous base filter: cut at any non-ACGT base; discard fragments shorter than k.
  2. Entropy filter: scan each fragment with a sliding window of size k. When the kmer $K_i = S[i \mathinner{..} i+k-1]$ ended by nucleotide S[j] (with j = i + k − 1) has entropy below threshold θ, emit the current segment and start a new one (see algorithm below). Ki belongs to neither segment, and no valid kmer is lost.
  3. Super-kmer extraction: for each clean segment, slide a minimizer window and group consecutive kmers sharing the same raw minimizer; canonise at the end.
  4. Partition routing: hash(canonical_minimizer) → PART → append super-kmer to partition/superkmers.bin.gz.

Entropy filter algorithm:

When Ki (ended by S[j], j = i + k − 1) fails the entropy threshold:

\begin{algorithm}
\caption{Entropy filter — sliding window segmentation}
\begin{algorithmic}
\PROCEDURE{EntropyFilter}{S, N, k, theta}
    \STATE seg_start <- 0
    \STATE window <- []
    \FOR{j <- 0 \TO N-1}
        \STATE window.push(S[j])
        \IF{|window| < k}
            \STATE continue
        \ENDIF
        \STATE i <- j - k + 1
        \IF{entropy(window) <= theta}
            \STATE emit S[seg_start .. j-1]
            \STATE seg_start <- i + 1
            \STATE window <- S[i+1 .. j]
        \ELSE
            \STATE window.pop_front()
        \ENDIF
    \ENDFOR
    \STATE emit S[seg_start .. N-1]
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

Writes are sequential and append-only — IO-friendly. Gzip applied at write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets header overhead).

Phase 2 — Dereplication

Performed independently per partition. Identical super-kmers are consolidated and their COUNT accumulated — analogous to amplicon dereplication in metabarcoding. Uses external bucket sort to stay within RAM bounds:

Pass 1 (streaming): hash the nucleotide payload of each super-kmer, route to one of B bucket files:

hash(sequence) % B → bucket_i.bin

B ≈ 100 is tunable; RAM needed ≈ partition_size / B.

Pass 2: for each bucket, load into an in-memory HashMap<sequence, COUNT>, dereplicate by summing COUNT values, write consolidated super-kmers.

After dereplication: at Nx coverage the partition shrinks by ~x (errors aside). The COUNT field in each super-kmer header = number of times that exact super-kmer sequence was observed across all input reads.

Important: super-kmer COUNT ≠ individual kmer count. A kmer can appear in multiple distinct super-kmers (same partition, different flanking context); its true count = sum of COUNT of all super-kmers containing it. A super-kmer with COUNT=1 may contain only high-abundance kmers, each appearing in many other super-kmers. Abundance filtering therefore cannot be applied at this phase.

Phase 3 — Per-kmer count aggregation and quorum filtering

For each dereplicated super-kmer, enumerate its kmers and accumulate counts:

for each super-kmer (sequence, COUNT):
    for each kmer in sequence:
        kmer_counts[canonical(kmer)] += COUNT

Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.

Abundance filter applied here: kmers with total_count < q are discarded. q is a collection parameter (0 = keep all, including singletons for ≤1x data).

No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.

Phase 4 — Super-kmer compaction

The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:

for each dereplicated super-kmer:
    scan kmer by kmer
    kmer not in valid set → break point (terminates current super-kmer)
    kmer in valid set     → extend current super-kmer

Three cases per super-kmer: - All kmers valid → copied as-is - No kmer valid → discarded - Mixed → split into sub-super-kmers at invalid boundaries; each sub-super-kmer inherits the original COUNT

After splitting, re-apply dereplication (bucket sort, phase 2 method) — splitting can produce new identical super-kmers. This re-dereplication is cheap: the volume is already greatly reduced.

Output: a clean super-kmer file where every kmer passes quorum. This file feeds phase 5.

Phase 5 — Local de Bruijn graph and unitig construction

Within each partition, build a local de Bruijn graph from the valid kmer set and compute its unitigs. All operations are local to the partition — no cross-partition communication.

valid kmers → HashSet<u64>

for each kmer K:
    out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
    in_degree  = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|

internal node ↔ in_degree=1 AND out_degree=1
branching / dead-end → unitig start or end

Traverse non-branching paths to assemble unitigs. Kmers whose neighbours fall in other partitions appear as dead ends locally — they terminate the unitig. The result: each kmer appears in exactly one unitig within the partition.

The partition size (controlled by p) must be calibrated so that the HashSet fits in RAM during this phase.

Output: unitigs.bin — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.

Scope of local unitigs: these are unitigs of the partition’s local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.

Phase 6 — BBHash construction and index finalisation

Built once on the definitive kmer set (all kmers in all unitigs of the partition):

kmers from unitigs → BBHash → bbhash.bin
                   → counts.bin : packed n-bit array (or 1-bit for presence mode)
                   → refs.bin   : (unitig_id: u32, kmer_offset: u8) per kmer

BBHash is built once — no rebuild. The n-bit width for counts.bin is chosen from the observed count distribution (n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts exceeding 2ⁿ−1 go into overflow.bin as sorted (bbhash_index: u32, count: u32) pairs.

Exact verification via unitig evidence:

unitigs.bin serves as the evidence structure: for any query kmer, the stored unitig provides the ground truth to confirm or deny its presence. BBHash maps every input to [0, N) including absent kmers — the unitig read-back is the only way to guarantee exactness.

query kmer q
  → minimizer(q) → hash → PART, KEY → partition
  → BBHash(q) → index i
  → refs[i] = (unitig_id, offset)
  → read unitig unitig_id from unitigs.bin
  → extract kmer at offset → compare with q
  → match   : return counts[i]   ← exact hit
  → no match: kmer absent        ← BBHash collision on absent kmer

One random disk access into unitigs.bin per query; the unitig is the minimal, non-redundant evidence (each kmer stored once).

On-disk collection structure

Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:

collection/
  metadata.toml          — collection parameters (see below)
  superkmers.bin         — super-kmer sequences, end-to-end, memory-mappable
  superkmers.idx         — index blocks every B super-kmers for fast seeking
  part_XXXX/
    bbhash.bin           — BBHash function for this partition
    counts.bin           — packed n-bit count array (or 1-bit presence array)
    refs.bin             — back-references for exact verification

Collection parameters (stored in metadata.toml):

Parameter Role
k kmer length
m minimizer length (odd, < k)
p partition bits (0 ≤ p ≤ min(14, 2m−16))
mode presence (1 bit/kmer) or count (n bits/kmer)
n bits per kmer in count mode (chosen at construction)
min_count singleton filtering threshold (0 = keep all)
hash_fn hash function identifier
hash_seed seed for the hash function

Exact verification via back-reference:

BBHash maps any input to [0, N) — including absent kmers. To avoid false positives, refs.bin stores at each BBHash index a back-reference (superkmer_id: u32, kmer_offset: u8) = 5 bytes. Query protocol:

query kmer q
  → minimizer(q) → hash → PART, KEY → partition
  → BBHash(q) → index i
  → refs[i] = (sk_id, offset)
  → read super-kmer sk_id from superkmers.bin
  → extract kmer at offset → compare with q
  → if match: return counts[i]   (exact)
  → if no match: kmer absent

Count storage — two modes:

Presence mode (coverage ≤ 1x, or when only presence/absence matters): - counts.bin is a packed 1-bit array — all bits set to 1 for indexed kmers - Singletons are the signal, not filtered

Count mode (coverage > 1x): - counts.bin is a packed n-bit array; n chosen at construction from the observed distribution - Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow - Overflow counts stored in a separate overflow.bin as sorted (index: u32, count: u32) pairs - Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99% - min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0

BBHash: operates on kmer u64 values directly within each partition. The minimizer plays no role inside the partition — BBHash receives the kmer and builds the perfect hash naturally.

Kmer entropy filter

Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a normalized Shannon entropy over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.

Sub-word frequencies

For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the nwords = k − ws + 1 overlapping sub-words by sliding a window of length ws:

$$w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$

Each sub-word is mapped to its circular canonical form: the lexicographic minimum among all cyclic rotations of the word. Let sj be the size of equivalence class j (number of distinct raw words mapping to canonical form j), and fj the count of canonical form j among the nwords sub-words (jfj = nwords).

Corrected Shannon entropy

The circular equivalence classes have unequal sizes: under a uniform distribution over all 4ws raw words, class j is visited with probability sj/4ws, not 1/na. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.

The correction “unfolds” each canonical class back to its member raw words, redistributing each observation of class j equally among its sj members:

$$H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$

The last term is the correction for unequal class sizes. For a uniformly random sequence (fj ≈ nwords ⋅ sj/4ws), this gives Hcorr ≈ log (4ws) = 2 ⋅ ws ⋅ log 2, the maximum entropy over raw words.

Maximum entropy correction for small samples

With only nwords observations over 4ws possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over 4ws categories.

Let c = ⌊nwords/4ws and r = nwords mod  4ws. The most uniform integer distribution assigns frequency c + 1 to r categories and c to the remaining 4ws − r, with the convention 0log 0 = 0:

$$H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$

When nwords < 4ws: c = 0, r = nwords, and the formula reduces to Hmax = log (nwords) — a single unified expression covers both regimes. A truly random sequence achieves Hcorr ≈ Hmax.

Normalized entropy

$$\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$

Final score

The filter computes (ws) for each word size ws from 1 to ws_max and returns the minimum:

$$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$

A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if entropy(kmer) ≤ θ, where θ is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.

Priority operations

Data types

Two types of DNA are represented:

Field Bits Role
COUNT 24 Occurrence count (≤ 16 M)
SEQL 8 Sequence length in nucleotides (≤ 256)

Bit layout (MSB to LSB):

[31:8] COUNT  [7:0] SEQL

Getters:

fn seql(&self)         -> u8  { self.0 as u8 }
fn count(&self)        -> u32 { self.0 >> 8 }
fn increment(&mut self)       { self.0 += 1 << 8; }

PART and KEY were considered as routing fields derived from hash(minimizer) but are single-use at phase 1 only — neither is stored. The nucleotide sequence is always byte-aligned (no offset): after reverse-complementing a super-kmer, a single bit-shift realigns the sequence so that nucleotide 0 always starts at bit 0 of word[0]. This shift is paid once at canonisation (phase 1, at most half of super-kmers) and eliminates any need for an alignment offset field. The direct consequence for dereplication: the byte array can be hashed directly without any offset adjustment.

Note: minimizer selection (window minimum) biases the raw minimizer distribution toward lexicographically small values [@Zheng2020-ji; @Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]; hashing corrects this before partition routing.

Kmer representation

Each base is encoded on 2 bits, so any kmer with k ≤ 31 fits in a u64.

Bit layout

Nucleotide i occupies bits 64-i and 63-i. For a kmer of length k, 2k bits are used; the low-order 64 - 2k bits of the u64 are kept at 0 and have no meaning.

For kmers (k ≤ 31), the whole sequence fits in a single u64. For longer sequences up to 256 nucleotides (512 bits), the representation extends across multiple words.

Complement and reverse complement

The encoding is chosen so that the Watson-Crick complement of any base is its bitwise NOT (on 2 bits):

Base Encoding Complement NOT
A 00 T 11
C 01 G 10
G 10 C 01
T 11 A 00

This means complement(base) = ~base & 0b11, and the reverse complement of a kmer reduces to two cheap operations:

revcomp(kmer) = reverse_bases(~kmer & mask_2k)

where mask_2k = (1 << 2k) - 1 zeros the unused high bits, and reverse_bases reorders the 2-bit chunks from LSB to MSB. No lookup table needed.

DNA being double-stranded, each kmer and its reverse complement are equivalent. The canonical form is defined as:

canonical(kmer) = min(kmer, revcomp(kmer))

This halves the kmer space and ensures strand-independent counting.

Sequence operations

Reverse complement

Reverse complement of a super-kmer sequence (used at most once per super-kmer, during phase 1 canonisation):

  1. Reverse byte order
  2. Apply 256-entry lookup table (precomputed reverse-complemented bytes) to each byte
  3. Bit-shift the result to realign nucleotide 0 to bit 0 of word[0]

The shift amount depends only on the sequence length: shift = (len % 4) * 2 bits. After the shift, the sequence is byte-aligned with no offset — the same invariant as the original.

fn reverse_complement(words: &[u64], len: usize) -> Vec<u64> {
    let bytes = words_to_bytes(words);
    let mut rev: Vec<u8> = bytes.iter().rev().map(|&b| LOOKUP_TABLE[b as usize]).collect();
    let shift = (len % 4) * 2;
    if shift > 0 { bit_shift_left(&mut rev, shift); }
    bytes_to_words(rev)
}

Multithreading

For HPC with 100+ cores: use Rayon for data parallelism (parallel iterators), std::thread for custom threading, or async with Tokio for I/O-bound tasks. Equivalent to Go’s goroutines but with ownership/borrowing safety.

Rayon: Rust crate for automatic data parallelism. Provides parallel iterators (par_iter) that distribute work across CPU cores, e.g., vec.par_iter().map(|x| x*2).collect(). Handles thread spawning/synchronization transparently. Widely used and maintained by the Rust project. Orthogonal to Tokio (async runtime): Rayon for CPU parallelism, Tokio for async I/O.

Data pipelines: use Crossbeam channels (MPMC) with threads/async, equivalent to Go’s channels + goroutines.

Channels: std::sync::mpsc (built-in, MPSC, simple); crossbeam (crate, MPMC, faster, more features like select). Use Crossbeam for multi-producer/multi-consumer scenarios.

Bibliographie

\bibliography