first implementation but far to be optimal

2026-04-16 22:38:20 +02:00
commit de3f9b16cf
19336 changed files with 380276 additions and 0 deletions
@@ -0,0 +1,100 @@
+# Chunk reader — implementation
+
+The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
+
+## Output type: rope
+
+Each chunk is a `Vec<Bytes>` — a **rope**: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.
+
+Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::split_to(n)` adjusts a reference counter, not memory. No `memcpy` in the common case.
+
+## Allocation policy
+
+| Case | Cost |
+|------|------|
+| Boundary found in the current block (common) | zero extra allocation — `split_to` only |
+| Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer |
+| EOF flush | zero extra allocation |
+
+## SeqChunkIter
+
+```rust
+pub struct SeqChunkIter<R: Read> { /* private */ }
+
+impl<R: Read> Iterator for SeqChunkIter<R> {
+    type Item = io::Result<Vec<Bytes>>;
+}
+
+pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
+pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
+```
+
+`next()` loop:
+
+```text
+1. read one block of block_size bytes → push onto rope
+2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
+   last block, skip the splitter (avoids a full backward scan for nothing)
+3. call splitter on last block
+   if found at offset n:
+       remainder = last_block.split_to(n)    ← O(1), zero copy
+       return std::mem::take(&mut self.rope)  ← the chunk
+4. if rope.len() > 1 (multi-block accumulation):
+       pack rope into one flat buffer         ← one alloc
+       retry splitter on flat buffer
+5. if EOF: flush remaining rope as final chunk
+```
+
+## Boundary detection — FASTA
+
+Backward scan with a 2-state machine. Searches for `>` immediately preceded by `\n` or `\r`:
+
+```mermaid
+stateDiagram-v2
+    direction LR
+    [*]      --> Scanning
+    Scanning --> FoundGt  : '>'
+    FoundGt  --> Scanning : other
+    FoundGt  --> [*]      : '\\n' / '\\r' ✓
+```
+
+Returns the byte offset of the `>` that starts the last complete record.
+
+## Boundary detection — FASTQ
+
+FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.
+
+7-state machine (port of Go's `EndOfLastFastqEntry`), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
+
+```mermaid
+stateDiagram-v2
+    direction LR
+
+    [*]          --> Scanning
+
+    Scanning     --> FoundPlus    : '+' (save restart)
+    FoundPlus    --> AfterNlPlus  : '\\n' / '\\r'
+    FoundPlus    --> Scanning     : other → backtrack
+
+    AfterNlPlus  --> AfterNlPlus  : séparateur
+    AfterNlPlus  --> InSequence   : lettre / - / . / [ / ]
+    AfterNlPlus  --> Scanning     : other → backtrack
+
+    InSequence   --> AfterSequence : '\\n' / '\\r'
+    InSequence   --> InSequence    : lettre / - / . / [ / ]
+    InSequence   --> Scanning      : other → backtrack
+
+    AfterSequence --> AfterSequence : '\\n' / '\\r'
+    AfterSequence --> InHeader      : other
+
+    InHeader     --> FoundAt    : '@' (save cut)
+    InHeader     --> Scanning   : '\\n' / '\\r' → backtrack
+    InHeader     --> InHeader   : other
+
+    FoundAt      --> [*]       : '\\n' / '\\r' ✓
+    FoundAt      --> InHeader  : other
+```
+
+`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.
+
+Returns the byte offset of the `@` that starts the last complete record.
@@ -0,0 +1,60 @@
+# Kmer — implementation
+
+## Memory layout
+
+`Kmer` is a `#[repr(transparent)]` newtype over `u64`:
+
+```rust
+#[repr(transparent)]
+pub struct Kmer(u64);
+```
+
+Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is **not stored** — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.
+
+| 63–62 | 61–60 | … | 63−2(k−1)−1 to 63−2(k−1) | 63−2k down to 0 |
+|-------|-------|---|--------------------------|-----------------|
+| nt 0  | nt 1  | … | nt k−1                   | zero padding    |
+
+## Encoding
+
+`Kmer::from_ascii(ascii, k)` encodes the first k bytes of an ASCII slice using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)):
+
+```rust
+for i in 0..k {
+    val = (val << 2) | encode_base(ascii[i]) as u64;
+}
+Kmer(val << (64 - 2 * k))
+```
+
+Zero allocation — result lives on the stack.
+
+## Decoding
+
+`write_ascii(k, buf)` appends k ASCII characters to a caller-supplied `Vec<u8>` using the shared `DEC4` table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.
+
+`to_ascii(k)` is a convenience wrapper that allocates and returns a `Vec<u8>`; intended for tests and display only.
+
+## Reverse complement
+
+Computed as pure arithmetic — no lookup table, no memory access:
+
+```rust
+let x = !self.0;                                                               // complement
+let x = x.swap_bytes();                                                        // reverse bytes
+let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
+let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2);   // swap 2-bit groups
+Kmer(x << (64 - 2 * k))
+```
+
+After complementing, bytes are reversed (`swap_bytes`), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.
+
+## Canonical form
+
+```rust
+pub fn canonical(&self, k: usize) -> Self {
+    let rc = self.revcomp(k);
+    if self.0 <= rc.0 { *self } else { rc }
+}
+```
+
+Lexicographic minimum of forward and reverse-complement, comparing the raw `u64` values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
@@ -0,0 +1,48 @@
+# MPHF selection — analysis in progress
+
+The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.
+
+## Candidates
+
+**boomphf** (BBHash algorithm, maintained by 10X Genomics):
+
+- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
+- Parallel construction; well-tested with DNA kmer data at scale
+- Drawback: largest space footprint of the three
+
+**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
+
+- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
+- Theoretical foundation solid; paper and Rust crate from the same author
+- Drawback: published February 2025 — very young, no production track record
+
+**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
+
+- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
+- More established than ptr_hash; actively maintained
+- Currently preferred candidate
+
+## Space at scale
+
+For 1 024 partitions × 100 M kmers/partition:
+
+| MPHF    | bits/key | Total MPHF size |
+|---------|----------|-----------------|
+| boomphf | 3.7      | ~47 GB          |
+| ptr_hash | 2.4     | ~31 GB          |
+| FMPHGO  | 2.1      | ~27 GB          |
+
+In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 3–30 M kmers → 1–8 MB per MPHF, well within RAM.
+
+## On-disk and mmap considerations
+
+All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.
+
+No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
+
+## Open questions
+
+- Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.
+- Evaluate whether ptr_hash's query speed advantage (2.1–3.3×) justifies adopting a crate that is less than a year old.
+- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.
+- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
@@ -0,0 +1,161 @@
+# Construction pipeline
+
+All phases after scatter are embarrassingly parallel across partitions.
+
+## Phase 0 — Parameter estimation
+
+The construction parameters p, n, and min_count depend on the kmer frequency spectrum of the dataset. Estimating this spectrum before construction avoids costly re-partitioning if p is badly chosen.
+
+Two approaches are supported:
+
+- **External estimation (preferred):** run [NT-CARD](https://github.com/bcgsc/ntCard) on the input files and pass its histogram output to `obikmer build`. NT-CARD produces a kmer frequency histogram in a single streaming pass using ntHash and a Flajolet-Martin-style estimator; obikmer reads this file and derives p, n, and min_count automatically.
+- **Internal estimation (future):** an `obikmer estimate` subcommand for users who prefer a single-tool workflow. The implementation would combine two components: (1) **ntHash**, a rolling hash that updates the kmer hash in O(1) per nucleotide by incrementally adding the incoming base and removing the outgoing one — Rust crates exist; (2) a **Flajolet-Martin-style streaming estimator** that maintains a small table of minimum hash values and infers the frequency histogram from their statistical distribution, as described in the NT-CARD paper [@Mohamadi2017-ok].
+
+The histogram gives:
+
+- **F0** (number of distinct kmers) → sets p (target ~10M kmers/partition → p = ⌈log₂(F0 / 10M)⌉)
+- **frequency distribution** → sets n (choose n so that fewer than 1% of kmers overflow)
+- **error valley** → suggests min_count (typically the local minimum between the error peak and the coverage peak)
+
+## Phase 1 — Scatter
+
+Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:
+
+1. **Ambiguous base filter**: cut at any non-ACGT base; discard fragments shorter than k.
+2. **Entropy filter**: scan each fragment with a sliding window of size k. When the kmer $K_i = S[i \mathinner{..} i+k-1]$ ended by nucleotide $S[j]$ (with $j = i+k-1$) has entropy below threshold $\theta$, emit the current segment and start a new one (see algorithm below). $K_i$ belongs to neither segment, and no valid kmer is lost.
+3. **Length filter**: discard any segment shorter than k produced by step 2.
+4. **Super-kmer extraction**: for each clean segment, slide a minimizer window and group consecutive kmers sharing the same canonical minimizer; canonise each super-kmer by lexicographic comparison with its reverse complement (early exit).
+5. **Partition routing**: `hash(canonical_minimizer) → PART` → append super-kmer to `partition/superkmers.bin.gz`.
+
+**Segmentation behavior:**
+
+When $K_i$ (ended by $S[j]$, $j = i+k-1$) fails the entropy threshold:
+
+- Current segment $S[\textit{seg_start} \mathinner{..} j-1]$ is emitted (last valid kmer = $K_{i-1}$)
+- New segment starts at $S[i+1]$ (first new kmer = $K_{i+1}$)
+- $K_i$ is excluded: current segment lacks $S[j]$, new segment lacks $S[i]$
+- Overlap = $S[i+1 \mathinner{..} j-1]$ = $k-2$ nucleotides
+
+!!! abstract "Algorithm — Entropy filter: sliding window segmentation"
+    ```text
+    procedure EntropyFilter(S, N, k, θ):
+        seg_start ← 0
+        window ← []
+        for j ← 0 to N−1:
+            window.push(S[j])
+            if |window| < k: continue
+            i ← j − k + 1
+            if entropy(window) ≤ θ:
+                emit S[seg_start .. j−1]
+                seg_start ← i + 1
+                window ← S[i+1 .. j]
+            else:
+                window.pop_front()
+        emit S[seg_start .. N−1]
+    ```
+
+Writes are sequential and append-only — IO-friendly. Gzip applied at write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets header overhead).
+
+## Phase 2 — Dereplication
+
+Performed independently per partition. Identical super-kmers are consolidated and their COUNT accumulated — analogous to amplicon dereplication in metabarcoding. Uses external bucket sort to stay within RAM bounds:
+
+**Pass 1** (streaming): hash the nucleotide payload of each super-kmer, route to one of B bucket files:
+```
+hash(sequence) % B → bucket_i.bin
+```
+B ≈ 100 is tunable; RAM needed ≈ partition_size / B.
+
+**Pass 2**: for each bucket, load into an in-memory `HashMap<sequence, COUNT>`, dereplicate by summing COUNT values, write consolidated super-kmers.
+
+After dereplication: at Nx coverage the partition shrinks by ~x (errors aside). The COUNT field in each super-kmer header = number of times that exact super-kmer sequence was observed across all input reads.
+
+**Important:** super-kmer COUNT ≠ individual kmer count. A kmer can appear in multiple distinct super-kmers (same partition, different flanking context); its true count = sum of COUNT of all super-kmers containing it. A super-kmer with COUNT=1 may contain only high-abundance kmers, each appearing in many other super-kmers. Abundance filtering therefore cannot be applied at this phase.
+
+## Phase 3 — Per-kmer count aggregation and quorum filtering
+
+For each dereplicated super-kmer, enumerate its kmers and accumulate counts:
+
+```
+for each super-kmer (sequence, COUNT):
+    for each kmer in sequence:
+        kmer_counts[canonical(kmer)] += COUNT
+```
+
+Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.
+
+Abundance filter applied here: kmers with `total_count < q` are discarded. `q` is a collection parameter (0 = keep all, including singletons for ≤1x data).
+
+No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.
+
+## Phase 4 — Super-kmer compaction
+
+The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:
+
+```
+for each dereplicated super-kmer:
+    scan kmer by kmer
+    kmer not in valid set → break point (terminates current super-kmer)
+    kmer in valid set     → extend current super-kmer
+```
+
+Three cases per super-kmer:
+
+- **All kmers valid** → copied as-is
+- **No kmer valid** → discarded
+- **Mixed** → split into sub-super-kmers at invalid boundaries; each sub-super-kmer inherits the original COUNT
+
+After splitting, re-apply dereplication (bucket sort, phase 2 method) — splitting can produce new identical super-kmers. This re-dereplication is cheap: the volume is already greatly reduced.
+
+Output: a clean super-kmer file where every kmer passes quorum. This file feeds phase 5.
+
+## Phase 5 — Local de Bruijn graph and unitig construction
+
+Within each partition, build a **local de Bruijn graph** from the valid kmer set and compute its unitigs. All operations are local to the partition — no cross-partition communication.
+
+```
+valid kmers → HashSet<u64>
+
+for each kmer K:
+    out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
+    in_degree  = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
+
+internal node ↔ in_degree=1 AND out_degree=1
+branching / dead-end → unitig start or end
+```
+
+Traverse non-branching paths to assemble unitigs. Kmers whose neighbours fall in other partitions appear as dead ends locally — they terminate the unitig. The result: **each kmer appears in exactly one unitig** within the partition.
+
+The partition size (controlled by p) must be calibrated so that the HashSet fits in RAM during this phase.
+
+Output: `unitigs.bin` — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.
+
+**Scope of local unitigs:** these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.
+
+## Phase 6 — MPHF construction and index finalisation
+
+Built once on the definitive kmer set (all kmers in all unitigs of the partition):
+
+```
+kmers from unitigs → MPHF → mphf.bin
+                   → counts.bin : packed n-bit array (or 1-bit for presence mode)
+                   → refs.bin   : u32 nucleotide offset into unitigs.bin per kmer
+```
+
+The MPHF is built once — no rebuild. The n-bit width for `counts.bin` is chosen from the observed count distribution (n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts exceeding 2ⁿ−1 go into `overflow.bin` as sorted `(mphf_index: u32, count: u32)` pairs.
+
+**Exact verification via unitig evidence:**
+
+`unitigs.bin` serves as the evidence structure: for any query kmer, the stored unitig provides the ground truth to confirm or deny its presence. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back is the only way to guarantee exactness.
+
+```
+query kmer q
+  → canonical_minimizer(q) → hash → PART → part_XXXX/
+  → MPHF(q) → index i
+  → refs[i] = (unitig_id, kmer_offset)
+  → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
+  → match   : return counts[i]   ← exact hit
+  → no match: kmer absent        ← MPHF collision on absent kmer
+```
+
+One random disk access into `unitigs.bin` per query; the unitig is the minimal, non-redundant evidence (each kmer stored once). `superkmers.bin.gz` is no longer needed at this point and can be deleted.
@@ -0,0 +1,61 @@
+# On-disk collection structure
+
+Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:
+
+```
+collection/
+  metadata.toml          — collection parameters (see below)
+  part_XXXX/
+    superkmers.bin.gz    — dereplicated super-kmers for this partition (construction artifact)
+    mphf.bin             — minimal perfect hash function for this partition
+    counts.bin           — packed n-bit count array (or 1-bit presence array)
+    refs.bin             — back-references u32 nucleotide offset into unitigs.bin per kmer
+    unitigs.bin          — local de Bruijn unitigs (permanent evidence structure)
+    overflow.bin         — counts exceeding the packed range (optional)
+```
+
+`superkmers.bin.gz` is produced during phase 1 and consumed through phases 2–4. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is `mphf.bin + counts.bin + refs.bin + unitigs.bin`.
+
+## Collection parameters
+
+Stored in `metadata.toml`:
+
+| Parameter | Role |
+|-----------|------|
+| k | kmer length |
+| m | minimizer length (odd, < k) |
+| p | partition bits (0 ≤ p ≤ min(14, 2m−16)) |
+| mode | `presence` (1 bit/kmer) or `count` (n bits/kmer) |
+| n | bits per kmer in count mode (chosen at construction) |
+| min_count | singleton filtering threshold (0 = keep all) |
+| hash_fn | hash function identifier |
+| hash_seed | seed for the hash function |
+
+## Count storage
+
+**refs.bin capacity:** `unitigs.bin` is a flat 2-bit-packed nucleotide stream with no separators. Each entry in `refs.bin` is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.
+
+*Presence mode* (coverage ≤ 1x, or when only presence/absence matters):
+
+- `counts.bin` is a packed 1-bit array — all bits set to 1 for indexed kmers
+- Singletons are the signal, not filtered
+
+*Count mode* (coverage > 1x):
+
+- `counts.bin` is a packed n-bit array; n chosen at construction from the observed distribution
+- Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow
+- Overflow counts stored in a separate `overflow.bin` as sorted `(index: u32, count: u32)` pairs
+- Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%
+- min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0
+
+## Query protocol
+
+```
+query kmer q
+  → canonical_minimizer(q) → hash → PART → part_XXXX/
+  → MPHF(q) → index i
+  → refs[i] = (unitig_id, kmer_offset)
+  → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
+  → match   : return counts[i]
+  → no match: kmer absent
+```
@@ -0,0 +1,114 @@
+# SuperKmer — implementation
+
+## Memory layout
+
+A super-kmer is stored as a **32-bit header** followed by a **byte-aligned nucleotide sequence** (2 bits/base, nucleotide 0 at the MSB of the first byte, max 256 nt):
+
+| Field | Bits | Role |
+|-------|------|------|
+| COUNT | 24   | Occurrence count (≤ 16 M) |
+| SEQL  | 8    | Sequence length in nucleotides (1–256) |
+
+Bit layout (MSB to LSB): `[31:8] COUNT  [7:0] SEQL`
+
+SEQL is stored as a raw `u8`: values 1–255 represent lengths 1–255; **0 represents 256** (wrapping convention). The public accessor returns a `usize` and performs the conversion:
+
+```rust
+fn seql(&self)               -> usize { if s == 0 { 256 } else { s as usize } }
+fn count(&self)              -> u32   { self.0 >> 8 }
+fn increment(&mut self)               { self.0 += 1 << 8; }
+fn add(&mut self, n: u32)             { self.0 += n << 8; }
+fn set_count(&mut self, n: u32)       { self.0 = (self.0 & 0xFF) | (n << 8); }
+```
+
+The SEQL field is 8 bits, capping the stored sequence at 256 nt. Given the expected length of ~40 nt, this cap is almost never reached; when it is, the super-kmer is split at 256 nt with a k−1 overlap, preserving all kmers without duplication.
+
+The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
+
+## ASCII encoding and decoding
+
+Two lookup tables handle ASCII ↔ 2-bit conversion:
+
+- **`ENC: [u8; 32]`** — indexed by `b & 0x1F` (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
+- **`DEC4: [u32; 256]`** — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian `u32`. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.
+
+Encoding 4 nucleotides into one byte:
+
+```rust
+byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]
+```
+
+Decoding one byte into 4 ASCII characters:
+
+```rust
+DEC4[byte].to_be_bytes()   // [nuc0, nuc1, nuc2, nuc3] in ASCII
+```
+
+## Reverse complement
+
+The reverse complement is computed **in place** with zero allocation in two steps.
+
+**Step 1 — byte swap with `REVCOMP4`.**  A 256-byte lookup table `REVCOMP4` maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying `REVCOMP4` to each:
+
+```rust
+const fn revcomp4(x: u8) -> u8 {
+    let x = !x;                                       // complement all bases
+    let x = (x >> 4) | (x << 4);                     // swap nibbles
+    let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2);  // swap 2-bit groups
+    x
+}
+```
+
+`REVCOMP4` is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
+
+**Step 2 — realignment.**  After step 1, `padding = n × 8 − SEQL × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:
+
+```rust
+shift = n * 8 - SEQL * 2          // number of padding bits
+bits.rotate_left(shift)
+bits[len - shift..].fill(false)
+```
+
+`Msb0` ordering makes the bit layout hardware-independent.
+
+!!! abstract "Algorithm — Super-kmer canonisation"
+    ```text
+    procedure SuperKmerCanonical(seq, SEQL):
+        for i ← 0 to SEQL − 1:
+            fwd ← nucleotide(seq, i)
+            rev ← complement(nucleotide(seq, SEQL − 1 − i))
+            if fwd < rev: return seq                          -- forward is canonical
+            if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical
+        return seq                                            -- palindrome: either orientation valid
+    ```
+
+## Kmer extraction
+
+A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
+
+```rust
+pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
+```
+
+The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a big-endian `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
+
+---
+
+!!! abstract "Algorithm — Super-kmer reverse complement"
+    ```text
+    procedure SuperKmerRevcomp(seq, SEQL):
+        n     ← ⌈SEQL / 4⌉                  -- number of bytes
+        shift ← n × 8 − SEQL × 2            -- padding bits to flush
+
+        -- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
+        lo ← 0 ; hi ← n − 1
+        while lo < hi:
+            seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
+            lo ← lo + 1 ; hi ← hi − 1
+        if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]
+
+        -- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
+        if shift > 0:
+            bits.rotate_left(shift)
+            bits[n×8 − shift .. n×8].fill(0)
+    ```