first implementation but far to be optimal
This commit is contained in:
@@ -0,0 +1,100 @@
|
||||
# Chunk reader — implementation
|
||||
|
||||
The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
|
||||
|
||||
## Output type: rope
|
||||
|
||||
Each chunk is a `Vec<Bytes>` — a **rope**: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.
|
||||
|
||||
Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::split_to(n)` adjusts a reference counter, not memory. No `memcpy` in the common case.
|
||||
|
||||
## Allocation policy
|
||||
|
||||
| Case | Cost |
|
||||
|------|------|
|
||||
| Boundary found in the current block (common) | zero extra allocation — `split_to` only |
|
||||
| Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer |
|
||||
| EOF flush | zero extra allocation |
|
||||
|
||||
## SeqChunkIter
|
||||
|
||||
```rust
|
||||
pub struct SeqChunkIter<R: Read> { /* private */ }
|
||||
|
||||
impl<R: Read> Iterator for SeqChunkIter<R> {
|
||||
type Item = io::Result<Vec<Bytes>>;
|
||||
}
|
||||
|
||||
pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
||||
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
||||
```
|
||||
|
||||
`next()` loop:
|
||||
|
||||
```text
|
||||
1. read one block of block_size bytes → push onto rope
|
||||
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
|
||||
last block, skip the splitter (avoids a full backward scan for nothing)
|
||||
3. call splitter on last block
|
||||
if found at offset n:
|
||||
remainder = last_block.split_to(n) ← O(1), zero copy
|
||||
return std::mem::take(&mut self.rope) ← the chunk
|
||||
4. if rope.len() > 1 (multi-block accumulation):
|
||||
pack rope into one flat buffer ← one alloc
|
||||
retry splitter on flat buffer
|
||||
5. if EOF: flush remaining rope as final chunk
|
||||
```
|
||||
|
||||
## Boundary detection — FASTA
|
||||
|
||||
Backward scan with a 2-state machine. Searches for `>` immediately preceded by `\n` or `\r`:
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
direction LR
|
||||
[*] --> Scanning
|
||||
Scanning --> FoundGt : '>'
|
||||
FoundGt --> Scanning : other
|
||||
FoundGt --> [*] : '\\n' / '\\r' ✓
|
||||
```
|
||||
|
||||
Returns the byte offset of the `>` that starts the last complete record.
|
||||
|
||||
## Boundary detection — FASTQ
|
||||
|
||||
FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.
|
||||
|
||||
7-state machine (port of Go's `EndOfLastFastqEntry`), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
direction LR
|
||||
|
||||
[*] --> Scanning
|
||||
|
||||
Scanning --> FoundPlus : '+' (save restart)
|
||||
FoundPlus --> AfterNlPlus : '\\n' / '\\r'
|
||||
FoundPlus --> Scanning : other → backtrack
|
||||
|
||||
AfterNlPlus --> AfterNlPlus : séparateur
|
||||
AfterNlPlus --> InSequence : lettre / - / . / [ / ]
|
||||
AfterNlPlus --> Scanning : other → backtrack
|
||||
|
||||
InSequence --> AfterSequence : '\\n' / '\\r'
|
||||
InSequence --> InSequence : lettre / - / . / [ / ]
|
||||
InSequence --> Scanning : other → backtrack
|
||||
|
||||
AfterSequence --> AfterSequence : '\\n' / '\\r'
|
||||
AfterSequence --> InHeader : other
|
||||
|
||||
InHeader --> FoundAt : '@' (save cut)
|
||||
InHeader --> Scanning : '\\n' / '\\r' → backtrack
|
||||
InHeader --> InHeader : other
|
||||
|
||||
FoundAt --> [*] : '\\n' / '\\r' ✓
|
||||
FoundAt --> InHeader : other
|
||||
```
|
||||
|
||||
`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.
|
||||
|
||||
Returns the byte offset of the `@` that starts the last complete record.
|
||||
@@ -0,0 +1,60 @@
|
||||
# Kmer — implementation
|
||||
|
||||
## Memory layout
|
||||
|
||||
`Kmer` is a `#[repr(transparent)]` newtype over `u64`:
|
||||
|
||||
```rust
|
||||
#[repr(transparent)]
|
||||
pub struct Kmer(u64);
|
||||
```
|
||||
|
||||
Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is **not stored** — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.
|
||||
|
||||
| 63–62 | 61–60 | … | 63−2(k−1)−1 to 63−2(k−1) | 63−2k down to 0 |
|
||||
|-------|-------|---|--------------------------|-----------------|
|
||||
| nt 0 | nt 1 | … | nt k−1 | zero padding |
|
||||
|
||||
## Encoding
|
||||
|
||||
`Kmer::from_ascii(ascii, k)` encodes the first k bytes of an ASCII slice using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)):
|
||||
|
||||
```rust
|
||||
for i in 0..k {
|
||||
val = (val << 2) | encode_base(ascii[i]) as u64;
|
||||
}
|
||||
Kmer(val << (64 - 2 * k))
|
||||
```
|
||||
|
||||
Zero allocation — result lives on the stack.
|
||||
|
||||
## Decoding
|
||||
|
||||
`write_ascii(k, buf)` appends k ASCII characters to a caller-supplied `Vec<u8>` using the shared `DEC4` table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.
|
||||
|
||||
`to_ascii(k)` is a convenience wrapper that allocates and returns a `Vec<u8>`; intended for tests and display only.
|
||||
|
||||
## Reverse complement
|
||||
|
||||
Computed as pure arithmetic — no lookup table, no memory access:
|
||||
|
||||
```rust
|
||||
let x = !self.0; // complement
|
||||
let x = x.swap_bytes(); // reverse bytes
|
||||
let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
|
||||
let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2); // swap 2-bit groups
|
||||
Kmer(x << (64 - 2 * k))
|
||||
```
|
||||
|
||||
After complementing, bytes are reversed (`swap_bytes`), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.
|
||||
|
||||
## Canonical form
|
||||
|
||||
```rust
|
||||
pub fn canonical(&self, k: usize) -> Self {
|
||||
let rc = self.revcomp(k);
|
||||
if self.0 <= rc.0 { *self } else { rc }
|
||||
}
|
||||
```
|
||||
|
||||
Lexicographic minimum of forward and reverse-complement, comparing the raw `u64` values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
|
||||
@@ -0,0 +1,48 @@
|
||||
# MPHF selection — analysis in progress
|
||||
|
||||
The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.
|
||||
|
||||
## Candidates
|
||||
|
||||
**boomphf** (BBHash algorithm, maintained by 10X Genomics):
|
||||
|
||||
- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
|
||||
- Parallel construction; well-tested with DNA kmer data at scale
|
||||
- Drawback: largest space footprint of the three
|
||||
|
||||
**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
|
||||
|
||||
- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
|
||||
- Theoretical foundation solid; paper and Rust crate from the same author
|
||||
- Drawback: published February 2025 — very young, no production track record
|
||||
|
||||
**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
|
||||
|
||||
- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
|
||||
- More established than ptr_hash; actively maintained
|
||||
- Currently preferred candidate
|
||||
|
||||
## Space at scale
|
||||
|
||||
For 1 024 partitions × 100 M kmers/partition:
|
||||
|
||||
| MPHF | bits/key | Total MPHF size |
|
||||
|---------|----------|-----------------|
|
||||
| boomphf | 3.7 | ~47 GB |
|
||||
| ptr_hash | 2.4 | ~31 GB |
|
||||
| FMPHGO | 2.1 | ~27 GB |
|
||||
|
||||
In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 3–30 M kmers → 1–8 MB per MPHF, well within RAM.
|
||||
|
||||
## On-disk and mmap considerations
|
||||
|
||||
All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.
|
||||
|
||||
No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
|
||||
|
||||
## Open questions
|
||||
|
||||
- Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.
|
||||
- Evaluate whether ptr_hash's query speed advantage (2.1–3.3×) justifies adopting a crate that is less than a year old.
|
||||
- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.
|
||||
- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
|
||||
@@ -0,0 +1,161 @@
|
||||
# Construction pipeline
|
||||
|
||||
All phases after scatter are embarrassingly parallel across partitions.
|
||||
|
||||
## Phase 0 — Parameter estimation
|
||||
|
||||
The construction parameters p, n, and min_count depend on the kmer frequency spectrum of the dataset. Estimating this spectrum before construction avoids costly re-partitioning if p is badly chosen.
|
||||
|
||||
Two approaches are supported:
|
||||
|
||||
- **External estimation (preferred):** run [NT-CARD](https://github.com/bcgsc/ntCard) on the input files and pass its histogram output to `obikmer build`. NT-CARD produces a kmer frequency histogram in a single streaming pass using ntHash and a Flajolet-Martin-style estimator; obikmer reads this file and derives p, n, and min_count automatically.
|
||||
- **Internal estimation (future):** an `obikmer estimate` subcommand for users who prefer a single-tool workflow. The implementation would combine two components: (1) **ntHash**, a rolling hash that updates the kmer hash in O(1) per nucleotide by incrementally adding the incoming base and removing the outgoing one — Rust crates exist; (2) a **Flajolet-Martin-style streaming estimator** that maintains a small table of minimum hash values and infers the frequency histogram from their statistical distribution, as described in the NT-CARD paper [@Mohamadi2017-ok].
|
||||
|
||||
The histogram gives:
|
||||
|
||||
- **F0** (number of distinct kmers) → sets p (target ~10M kmers/partition → p = ⌈log₂(F0 / 10M)⌉)
|
||||
- **frequency distribution** → sets n (choose n so that fewer than 1% of kmers overflow)
|
||||
- **error valley** → suggests min_count (typically the local minimum between the error peak and the coverage peak)
|
||||
|
||||
## Phase 1 — Scatter
|
||||
|
||||
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:
|
||||
|
||||
1. **Ambiguous base filter**: cut at any non-ACGT base; discard fragments shorter than k.
|
||||
2. **Entropy filter**: scan each fragment with a sliding window of size k. When the kmer $K_i = S[i \mathinner{..} i+k-1]$ ended by nucleotide $S[j]$ (with $j = i+k-1$) has entropy below threshold $\theta$, emit the current segment and start a new one (see algorithm below). $K_i$ belongs to neither segment, and no valid kmer is lost.
|
||||
3. **Length filter**: discard any segment shorter than k produced by step 2.
|
||||
4. **Super-kmer extraction**: for each clean segment, slide a minimizer window and group consecutive kmers sharing the same canonical minimizer; canonise each super-kmer by lexicographic comparison with its reverse complement (early exit).
|
||||
5. **Partition routing**: `hash(canonical_minimizer) → PART` → append super-kmer to `partition/superkmers.bin.gz`.
|
||||
|
||||
**Segmentation behavior:**
|
||||
|
||||
When $K_i$ (ended by $S[j]$, $j = i+k-1$) fails the entropy threshold:
|
||||
|
||||
- Current segment $S[\textit{seg_start} \mathinner{..} j-1]$ is emitted (last valid kmer = $K_{i-1}$)
|
||||
- New segment starts at $S[i+1]$ (first new kmer = $K_{i+1}$)
|
||||
- $K_i$ is excluded: current segment lacks $S[j]$, new segment lacks $S[i]$
|
||||
- Overlap = $S[i+1 \mathinner{..} j-1]$ = $k-2$ nucleotides
|
||||
|
||||
!!! abstract "Algorithm — Entropy filter: sliding window segmentation"
|
||||
```text
|
||||
procedure EntropyFilter(S, N, k, θ):
|
||||
seg_start ← 0
|
||||
window ← []
|
||||
for j ← 0 to N−1:
|
||||
window.push(S[j])
|
||||
if |window| < k: continue
|
||||
i ← j − k + 1
|
||||
if entropy(window) ≤ θ:
|
||||
emit S[seg_start .. j−1]
|
||||
seg_start ← i + 1
|
||||
window ← S[i+1 .. j]
|
||||
else:
|
||||
window.pop_front()
|
||||
emit S[seg_start .. N−1]
|
||||
```
|
||||
|
||||
Writes are sequential and append-only — IO-friendly. Gzip applied at write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets header overhead).
|
||||
|
||||
## Phase 2 — Dereplication
|
||||
|
||||
Performed independently per partition. Identical super-kmers are consolidated and their COUNT accumulated — analogous to amplicon dereplication in metabarcoding. Uses external bucket sort to stay within RAM bounds:
|
||||
|
||||
**Pass 1** (streaming): hash the nucleotide payload of each super-kmer, route to one of B bucket files:
|
||||
```
|
||||
hash(sequence) % B → bucket_i.bin
|
||||
```
|
||||
B ≈ 100 is tunable; RAM needed ≈ partition_size / B.
|
||||
|
||||
**Pass 2**: for each bucket, load into an in-memory `HashMap<sequence, COUNT>`, dereplicate by summing COUNT values, write consolidated super-kmers.
|
||||
|
||||
After dereplication: at Nx coverage the partition shrinks by ~x (errors aside). The COUNT field in each super-kmer header = number of times that exact super-kmer sequence was observed across all input reads.
|
||||
|
||||
**Important:** super-kmer COUNT ≠ individual kmer count. A kmer can appear in multiple distinct super-kmers (same partition, different flanking context); its true count = sum of COUNT of all super-kmers containing it. A super-kmer with COUNT=1 may contain only high-abundance kmers, each appearing in many other super-kmers. Abundance filtering therefore cannot be applied at this phase.
|
||||
|
||||
## Phase 3 — Per-kmer count aggregation and quorum filtering
|
||||
|
||||
For each dereplicated super-kmer, enumerate its kmers and accumulate counts:
|
||||
|
||||
```
|
||||
for each super-kmer (sequence, COUNT):
|
||||
for each kmer in sequence:
|
||||
kmer_counts[canonical(kmer)] += COUNT
|
||||
```
|
||||
|
||||
Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.
|
||||
|
||||
Abundance filter applied here: kmers with `total_count < q` are discarded. `q` is a collection parameter (0 = keep all, including singletons for ≤1x data).
|
||||
|
||||
No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.
|
||||
|
||||
## Phase 4 — Super-kmer compaction
|
||||
|
||||
The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:
|
||||
|
||||
```
|
||||
for each dereplicated super-kmer:
|
||||
scan kmer by kmer
|
||||
kmer not in valid set → break point (terminates current super-kmer)
|
||||
kmer in valid set → extend current super-kmer
|
||||
```
|
||||
|
||||
Three cases per super-kmer:
|
||||
|
||||
- **All kmers valid** → copied as-is
|
||||
- **No kmer valid** → discarded
|
||||
- **Mixed** → split into sub-super-kmers at invalid boundaries; each sub-super-kmer inherits the original COUNT
|
||||
|
||||
After splitting, re-apply dereplication (bucket sort, phase 2 method) — splitting can produce new identical super-kmers. This re-dereplication is cheap: the volume is already greatly reduced.
|
||||
|
||||
Output: a clean super-kmer file where every kmer passes quorum. This file feeds phase 5.
|
||||
|
||||
## Phase 5 — Local de Bruijn graph and unitig construction
|
||||
|
||||
Within each partition, build a **local de Bruijn graph** from the valid kmer set and compute its unitigs. All operations are local to the partition — no cross-partition communication.
|
||||
|
||||
```
|
||||
valid kmers → HashSet<u64>
|
||||
|
||||
for each kmer K:
|
||||
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
|
||||
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
|
||||
|
||||
internal node ↔ in_degree=1 AND out_degree=1
|
||||
branching / dead-end → unitig start or end
|
||||
```
|
||||
|
||||
Traverse non-branching paths to assemble unitigs. Kmers whose neighbours fall in other partitions appear as dead ends locally — they terminate the unitig. The result: **each kmer appears in exactly one unitig** within the partition.
|
||||
|
||||
The partition size (controlled by p) must be calibrated so that the HashSet fits in RAM during this phase.
|
||||
|
||||
Output: `unitigs.bin` — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.
|
||||
|
||||
**Scope of local unitigs:** these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.
|
||||
|
||||
## Phase 6 — MPHF construction and index finalisation
|
||||
|
||||
Built once on the definitive kmer set (all kmers in all unitigs of the partition):
|
||||
|
||||
```
|
||||
kmers from unitigs → MPHF → mphf.bin
|
||||
→ counts.bin : packed n-bit array (or 1-bit for presence mode)
|
||||
→ refs.bin : u32 nucleotide offset into unitigs.bin per kmer
|
||||
```
|
||||
|
||||
The MPHF is built once — no rebuild. The n-bit width for `counts.bin` is chosen from the observed count distribution (n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts exceeding 2ⁿ−1 go into `overflow.bin` as sorted `(mphf_index: u32, count: u32)` pairs.
|
||||
|
||||
**Exact verification via unitig evidence:**
|
||||
|
||||
`unitigs.bin` serves as the evidence structure: for any query kmer, the stored unitig provides the ground truth to confirm or deny its presence. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back is the only way to guarantee exactness.
|
||||
|
||||
```
|
||||
query kmer q
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXX/
|
||||
→ MPHF(q) → index i
|
||||
→ refs[i] = (unitig_id, kmer_offset)
|
||||
→ read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
|
||||
→ match : return counts[i] ← exact hit
|
||||
→ no match: kmer absent ← MPHF collision on absent kmer
|
||||
```
|
||||
|
||||
One random disk access into `unitigs.bin` per query; the unitig is the minimal, non-redundant evidence (each kmer stored once). `superkmers.bin.gz` is no longer needed at this point and can be deleted.
|
||||
@@ -0,0 +1,61 @@
|
||||
# On-disk collection structure
|
||||
|
||||
Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:
|
||||
|
||||
```
|
||||
collection/
|
||||
metadata.toml — collection parameters (see below)
|
||||
part_XXXX/
|
||||
superkmers.bin.gz — dereplicated super-kmers for this partition (construction artifact)
|
||||
mphf.bin — minimal perfect hash function for this partition
|
||||
counts.bin — packed n-bit count array (or 1-bit presence array)
|
||||
refs.bin — back-references u32 nucleotide offset into unitigs.bin per kmer
|
||||
unitigs.bin — local de Bruijn unitigs (permanent evidence structure)
|
||||
overflow.bin — counts exceeding the packed range (optional)
|
||||
```
|
||||
|
||||
`superkmers.bin.gz` is produced during phase 1 and consumed through phases 2–4. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is `mphf.bin + counts.bin + refs.bin + unitigs.bin`.
|
||||
|
||||
## Collection parameters
|
||||
|
||||
Stored in `metadata.toml`:
|
||||
|
||||
| Parameter | Role |
|
||||
|-----------|------|
|
||||
| k | kmer length |
|
||||
| m | minimizer length (odd, < k) |
|
||||
| p | partition bits (0 ≤ p ≤ min(14, 2m−16)) |
|
||||
| mode | `presence` (1 bit/kmer) or `count` (n bits/kmer) |
|
||||
| n | bits per kmer in count mode (chosen at construction) |
|
||||
| min_count | singleton filtering threshold (0 = keep all) |
|
||||
| hash_fn | hash function identifier |
|
||||
| hash_seed | seed for the hash function |
|
||||
|
||||
## Count storage
|
||||
|
||||
**refs.bin capacity:** `unitigs.bin` is a flat 2-bit-packed nucleotide stream with no separators. Each entry in `refs.bin` is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.
|
||||
|
||||
*Presence mode* (coverage ≤ 1x, or when only presence/absence matters):
|
||||
|
||||
- `counts.bin` is a packed 1-bit array — all bits set to 1 for indexed kmers
|
||||
- Singletons are the signal, not filtered
|
||||
|
||||
*Count mode* (coverage > 1x):
|
||||
|
||||
- `counts.bin` is a packed n-bit array; n chosen at construction from the observed distribution
|
||||
- Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow
|
||||
- Overflow counts stored in a separate `overflow.bin` as sorted `(index: u32, count: u32)` pairs
|
||||
- Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%
|
||||
- min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0
|
||||
|
||||
## Query protocol
|
||||
|
||||
```
|
||||
query kmer q
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXX/
|
||||
→ MPHF(q) → index i
|
||||
→ refs[i] = (unitig_id, kmer_offset)
|
||||
→ read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
|
||||
→ match : return counts[i]
|
||||
→ no match: kmer absent
|
||||
```
|
||||
@@ -0,0 +1,114 @@
|
||||
# SuperKmer — implementation
|
||||
|
||||
## Memory layout
|
||||
|
||||
A super-kmer is stored as a **32-bit header** followed by a **byte-aligned nucleotide sequence** (2 bits/base, nucleotide 0 at the MSB of the first byte, max 256 nt):
|
||||
|
||||
| Field | Bits | Role |
|
||||
|-------|------|------|
|
||||
| COUNT | 24 | Occurrence count (≤ 16 M) |
|
||||
| SEQL | 8 | Sequence length in nucleotides (1–256) |
|
||||
|
||||
Bit layout (MSB to LSB): `[31:8] COUNT [7:0] SEQL`
|
||||
|
||||
SEQL is stored as a raw `u8`: values 1–255 represent lengths 1–255; **0 represents 256** (wrapping convention). The public accessor returns a `usize` and performs the conversion:
|
||||
|
||||
```rust
|
||||
fn seql(&self) -> usize { if s == 0 { 256 } else { s as usize } }
|
||||
fn count(&self) -> u32 { self.0 >> 8 }
|
||||
fn increment(&mut self) { self.0 += 1 << 8; }
|
||||
fn add(&mut self, n: u32) { self.0 += n << 8; }
|
||||
fn set_count(&mut self, n: u32) { self.0 = (self.0 & 0xFF) | (n << 8); }
|
||||
```
|
||||
|
||||
The SEQL field is 8 bits, capping the stored sequence at 256 nt. Given the expected length of ~40 nt, this cap is almost never reached; when it is, the super-kmer is split at 256 nt with a k−1 overlap, preserving all kmers without duplication.
|
||||
|
||||
The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
|
||||
|
||||
## ASCII encoding and decoding
|
||||
|
||||
Two lookup tables handle ASCII ↔ 2-bit conversion:
|
||||
|
||||
- **`ENC: [u8; 32]`** — indexed by `b & 0x1F` (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
|
||||
- **`DEC4: [u32; 256]`** — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian `u32`. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.
|
||||
|
||||
Encoding 4 nucleotides into one byte:
|
||||
|
||||
```rust
|
||||
byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]
|
||||
```
|
||||
|
||||
Decoding one byte into 4 ASCII characters:
|
||||
|
||||
```rust
|
||||
DEC4[byte].to_be_bytes() // [nuc0, nuc1, nuc2, nuc3] in ASCII
|
||||
```
|
||||
|
||||
## Reverse complement
|
||||
|
||||
The reverse complement is computed **in place** with zero allocation in two steps.
|
||||
|
||||
**Step 1 — byte swap with `REVCOMP4`.** A 256-byte lookup table `REVCOMP4` maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying `REVCOMP4` to each:
|
||||
|
||||
```rust
|
||||
const fn revcomp4(x: u8) -> u8 {
|
||||
let x = !x; // complement all bases
|
||||
let x = (x >> 4) | (x << 4); // swap nibbles
|
||||
let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2); // swap 2-bit groups
|
||||
x
|
||||
}
|
||||
```
|
||||
|
||||
`REVCOMP4` is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
|
||||
|
||||
**Step 2 — realignment.** After step 1, `padding = n × 8 − SEQL × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:
|
||||
|
||||
```rust
|
||||
shift = n * 8 - SEQL * 2 // number of padding bits
|
||||
bits.rotate_left(shift)
|
||||
bits[len - shift..].fill(false)
|
||||
```
|
||||
|
||||
`Msb0` ordering makes the bit layout hardware-independent.
|
||||
|
||||
!!! abstract "Algorithm — Super-kmer canonisation"
|
||||
```text
|
||||
procedure SuperKmerCanonical(seq, SEQL):
|
||||
for i ← 0 to SEQL − 1:
|
||||
fwd ← nucleotide(seq, i)
|
||||
rev ← complement(nucleotide(seq, SEQL − 1 − i))
|
||||
if fwd < rev: return seq -- forward is canonical
|
||||
if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical
|
||||
return seq -- palindrome: either orientation valid
|
||||
```
|
||||
|
||||
## Kmer extraction
|
||||
|
||||
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
|
||||
|
||||
```rust
|
||||
pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
|
||||
```
|
||||
|
||||
The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a big-endian `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
|
||||
|
||||
---
|
||||
|
||||
!!! abstract "Algorithm — Super-kmer reverse complement"
|
||||
```text
|
||||
procedure SuperKmerRevcomp(seq, SEQL):
|
||||
n ← ⌈SEQL / 4⌉ -- number of bytes
|
||||
shift ← n × 8 − SEQL × 2 -- padding bits to flush
|
||||
|
||||
-- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
|
||||
lo ← 0 ; hi ← n − 1
|
||||
while lo < hi:
|
||||
seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
|
||||
lo ← lo + 1 ; hi ← hi − 1
|
||||
if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]
|
||||
|
||||
-- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
|
||||
if shift > 0:
|
||||
bits.rotate_left(shift)
|
||||
bits[n×8 − shift .. n×8].fill(0)
|
||||
```
|
||||
Reference in New Issue
Block a user