cfadf63bbc
Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.
229 lines
13 KiB
Markdown
229 lines
13 KiB
Markdown
# Construction pipeline
|
||
|
||
All phases after scatter are embarrassingly parallel across partitions.
|
||
|
||
## Phase 0 — Parameter estimation
|
||
|
||
The construction parameters p, n, and min_count depend on the kmer frequency spectrum of the dataset. Estimating this spectrum before construction avoids costly re-partitioning if p is badly chosen.
|
||
|
||
Two approaches are supported:
|
||
|
||
- **External estimation (preferred):** run [NT-CARD](https://github.com/bcgsc/ntCard) on the input files and pass its histogram output to `obikmer build`. NT-CARD produces a kmer frequency histogram in a single streaming pass using ntHash and a Flajolet-Martin-style estimator; obikmer reads this file and derives p, n, and min_count automatically.
|
||
- **Internal estimation (future):** an `obikmer estimate` subcommand for users who prefer a single-tool workflow. The implementation would combine two components: (1) **ntHash**, a rolling hash that updates the kmer hash in O(1) per nucleotide by incrementally adding the incoming base and removing the outgoing one — Rust crates exist; (2) a **Flajolet-Martin-style streaming estimator** that maintains a small table of minimum hash values and infers the frequency histogram from their statistical distribution, as described in the NT-CARD paper [@Mohamadi2017-ok].
|
||
|
||
The histogram gives:
|
||
|
||
- **F0** (number of distinct kmers) → sets p (target ~10M kmers/partition → p = ⌈log₂(F0 / 10M)⌉)
|
||
- **frequency distribution** → sets n (choose n so that fewer than 1% of kmers overflow)
|
||
- **error valley** → suggests min_count (typically the local minimum between the error peak and the coverage peak)
|
||
|
||
## Phase 1 — Scatter
|
||
|
||
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.
|
||
|
||
Input files are read via `open_nuc_stream`, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of `NucPage` buffers. Each `NucPage` is a flat 64 KB buffer of normalised bytes (`ACGT` + `\x00` separators), carrying a k−1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.
|
||
|
||
For each read fragment within a page:
|
||
|
||
1. **Ambiguous base filter**: cut at any non-ACGT base; discard fragments shorter than k.
|
||
2. **Entropy filter**: scan each fragment with a sliding window of size k. When the kmer $K_i = S[i \mathinner{..} i+k-1]$ ended by nucleotide $S[j]$ (with $j = i+k-1$) has entropy below threshold $\theta$, emit the current segment and start a new one (see algorithm below). $K_i$ belongs to neither segment, and no valid kmer is lost.
|
||
3. **Length filter**: discard any segment shorter than k produced by step 2.
|
||
4. **Super-kmer extraction**: for each clean segment, slide a minimizer window and group consecutive kmers sharing the same canonical minimizer; canonise each super-kmer by lexicographic comparison with its reverse complement (early exit).
|
||
5. **Partition routing**: `hash(canonical_minimizer) → PART` → append super-kmer to `partition/superkmers.bin.gz`.
|
||
|
||
**Segmentation behavior:**
|
||
|
||
When $K_i$ (ended by $S[j]$, $j = i+k-1$) fails the entropy threshold:
|
||
|
||
- Current segment $S[\textit{seg_start} \mathinner{..} j-1]$ is emitted (last valid kmer = $K_{i-1}$)
|
||
- New segment starts at $S[i+1]$ (first new kmer = $K_{i+1}$)
|
||
- $K_i$ is excluded: current segment lacks $S[j]$, new segment lacks $S[i]$
|
||
- Overlap = $S[i+1 \mathinner{..} j-1]$ = $k-2$ nucleotides
|
||
|
||
!!! abstract "Algorithm — Entropy filter: sliding window segmentation"
|
||
```text
|
||
procedure EntropyFilter(S, N, k, θ):
|
||
seg_start ← 0
|
||
window ← []
|
||
for j ← 0 to N−1:
|
||
window.push(S[j])
|
||
if |window| < k: continue
|
||
i ← j − k + 1
|
||
if entropy(window) ≤ θ:
|
||
emit S[seg_start .. j−1]
|
||
seg_start ← i + 1
|
||
window ← S[i+1 .. j]
|
||
else:
|
||
window.pop_front()
|
||
emit S[seg_start .. N−1]
|
||
```
|
||
|
||
Writes are sequential and append-only — IO-friendly. Gzip applied at write time. Data volume ≈ raw genome size (2 bits/nt compaction offsets header overhead).
|
||
|
||
## Phase 2 — Dereplication
|
||
|
||
Performed independently per partition. Identical super-kmers are consolidated and their COUNT accumulated — analogous to amplicon dereplication in metabarcoding. Uses external bucket sort to stay within RAM bounds:
|
||
|
||
**Pass 1** (streaming): hash the nucleotide payload of each super-kmer, route to one of B bucket files:
|
||
```
|
||
hash(sequence) % B → bucket_i.bin
|
||
```
|
||
B ≈ 100 is tunable; RAM needed ≈ partition_size / B.
|
||
|
||
**Pass 2**: for each bucket, load into an in-memory `HashMap<sequence, COUNT>`, dereplicate by summing COUNT values, write consolidated super-kmers.
|
||
|
||
After dereplication: at Nx coverage the partition shrinks by ~x (errors aside). The COUNT field in each super-kmer header = number of times that exact super-kmer sequence was observed across all input reads.
|
||
|
||
**Important:** super-kmer COUNT ≠ individual kmer count. A kmer can appear in multiple distinct super-kmers (same partition, different flanking context); its true count = sum of COUNT of all super-kmers containing it. A super-kmer with COUNT=1 may contain only high-abundance kmers, each appearing in many other super-kmers. Abundance filtering therefore cannot be applied at this phase.
|
||
|
||
## Phase 3 — Per-kmer count aggregation and quorum filtering
|
||
|
||
For each dereplicated super-kmer, enumerate its kmers and accumulate counts:
|
||
|
||
```
|
||
for each super-kmer (sequence, COUNT):
|
||
for each kmer in sequence:
|
||
kmer_counts[canonical(kmer)] += COUNT
|
||
```
|
||
|
||
Implemented as a three-step pipeline in `count_partition()`:
|
||
|
||
1. **External sort** (`kmer_sort::sort_unique_kmers`): read dereplicated superkmers, extract canonical kmer raw `u64` values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → `sorted_unique.bin`. f0 is now known exactly.
|
||
2. **Provisional MPHF** (ptr_hash): built from `sorted_unique.bin` via `new_from_par_iter(f0, ...)`. Stored to `mphf1.bin`; `sorted_unique.bin` deleted immediately.
|
||
3. **Accumulation pass**: re-read dereplicated superkmers; for each kmer, `slot = mphf.index(kmer.raw())`, increment `counts1[slot]` by the superkmer COUNT. Stored in a `PersistentCompactIntVec` (`counts1.bin`).
|
||
|
||
At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (`spectrums/{label}.json`) is written to the index root.
|
||
|
||
No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.
|
||
|
||
## Phase 4 — Super-kmer compaction
|
||
|
||
The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:
|
||
|
||
```
|
||
for each dereplicated super-kmer:
|
||
scan kmer by kmer
|
||
kmer not in valid set → break point (terminates current super-kmer)
|
||
kmer in valid set → extend current super-kmer
|
||
```
|
||
|
||
Three cases per super-kmer:
|
||
|
||
- **All kmers valid** → copied as-is
|
||
- **No kmer valid** → discarded
|
||
- **Mixed** → split into sub-super-kmers at invalid boundaries; each sub-super-kmer inherits the original COUNT
|
||
|
||
After splitting, re-apply dereplication (bucket sort, phase 2 method) — splitting can produce new identical super-kmers. This re-dereplication is cheap: the volume is already greatly reduced.
|
||
|
||
Output: a clean super-kmer file where every kmer passes quorum. This file feeds phase 5.
|
||
|
||
## Phase 5 — Local de Bruijn graph and unitig construction
|
||
|
||
Within each partition, build a **local de Bruijn graph** from the valid kmer set and compute its unitigs. All operations are local to the partition — no cross-partition communication.
|
||
|
||
```
|
||
valid kmers → HashSet<u64>
|
||
|
||
for each kmer K:
|
||
out_degree = |{K[1:]+b | b ∈ {A,C,G,T}} ∩ HashSet|
|
||
in_degree = |{b+K[:-1] | b ∈ {A,C,G,T}} ∩ HashSet|
|
||
|
||
internal node ↔ in_degree=1 AND out_degree=1
|
||
branching / dead-end → unitig start or end
|
||
```
|
||
|
||
Traverse non-branching paths to assemble unitigs. Kmers whose neighbours fall in other partitions appear as dead ends locally — they terminate the unitig. The result: **each kmer appears in exactly one unitig** within the partition.
|
||
|
||
The partition size (controlled by p) must be calibrated so that the HashSet fits in RAM during this phase.
|
||
|
||
Output: `unitigs.bin` — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.
|
||
|
||
**Scope of local unitigs:** these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.
|
||
|
||
## Phase 6 — MPHF construction and index finalisation
|
||
|
||
`build_index_layer` is called per partition (in parallel via `build_layers`) with the following parameters sourced from `IndexConfig`:
|
||
|
||
- `block_bits` — from `IndexConfig::block_bits`; controls the `.idx` block size (2^block_bits unitig chunks per block) for exact evidence
|
||
- `evidence` — `EvidenceKind::Exact` or `EvidenceKind::Approx { b, z }`; propagated unchanged from `IndexConfig::evidence`
|
||
- `min_ab` / `max_ab` — abundance bounds applied before graph construction
|
||
- `with_counts` — whether to store kmer counts alongside set membership
|
||
|
||
**Abundance filtering:** when `min_ab > 1` or `max_ab.is_some()`, the provisional `mphf1.bin` and `counts1.bin` produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in `counts1` satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).
|
||
|
||
```
|
||
for each kmer in dereplicated super-kmer:
|
||
ab = counts1[mphf1.index(kmer.raw())]
|
||
if ab < min_ab || ab > max_ab: skip
|
||
graph.push(kmer)
|
||
```
|
||
|
||
**Graph build and unitig write:**
|
||
|
||
The surviving kmers are fed into `GraphDeBruijn`, which computes degrees and yields unitigs. Unitigs are written to `layer_0/unitigs.bin` via a `UnitigFileWriter`.
|
||
|
||
**MPHF and evidence build:**
|
||
|
||
`Layer::build` (membership-only) or `Layer::<PersistentCompactIntMatrix>::build` (with counts) is called next. Internally, `MphfLayer::build` performs two passes:
|
||
|
||
1. **Pass 1 (parallel):** build `unitigs.bin.idx` (block size = 2^`block_bits`) then construct the MPHF from all canonical kmers in `unitigs.bin`; store to `mphf.bin`.
|
||
2. **Pass 2 (sequential):** for each kmer in `unitigs.bin`, compute its slot and write `evidence.bin` (`chunk_id: 25 bits | rank: 7 bits` packed into a `u32`); also invoke the payload callback (`fill_slot`) to populate `counts/` if `with_counts`.
|
||
|
||
After `Layer::build` completes, `layer_meta.json` records `EvidenceKind::Exact`.
|
||
|
||
**Approximate evidence override:**
|
||
|
||
If `evidence` is `EvidenceKind::Approx { b, z }`, `build_approx_evidence` is called immediately after `Layer::build`. It overwrites the exact evidence bundle with `fingerprint.bin` (b-bit hash per slot) and rewrites `layer_meta.json` with `EvidenceKind::Approx { b, z }`. No `.idx` file is needed at query time in this mode.
|
||
|
||
```
|
||
// Exact path → evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
|
||
// Approx path → fingerprint.bin + layer_meta.json(Approx{b,z})
|
||
// (evidence.bin left on disk but not used)
|
||
```
|
||
|
||
**Partition metadata:**
|
||
|
||
After all layer files are written, `PartitionMeta { n_layers: 1 }` is serialised to `index/meta.json` inside the partition directory. This file is required by `LayeredMap::open` for subsequent merge operations.
|
||
|
||
**File layout per partition after phase 6:**
|
||
|
||
```
|
||
part_XXXXX/
|
||
index/
|
||
meta.json ← PartitionMeta { n_layers: 1 }
|
||
layer_0/
|
||
unitigs.bin ← permanent evidence (all modes)
|
||
unitigs.bin.idx ← block index (exact mode only)
|
||
mphf.bin ← MPHF
|
||
evidence.bin ← exact evidence (exact mode)
|
||
fingerprint.bin ← b-bit fingerprints (approx mode)
|
||
layer_meta.json ← EvidenceKind tag
|
||
counts/ ← PersistentCompactIntMatrix (with_counts only)
|
||
```
|
||
|
||
**Cleanup:** unless `--keep-intermediate` is set, `remove_build_artifacts` deletes `dereplicated.skmer.zst`, `mphf1.bin`, and `counts1.bin` after all partitions are indexed.
|
||
|
||
See [obilayeredmap](obilayeredmap.md) and [MPHF selection](mphf.md) for data structure details.
|
||
|
||
**Query path (exact evidence):**
|
||
|
||
```
|
||
query kmer q
|
||
→ canonical_minimizer(q) → hash → PART → part_XXXXX/
|
||
→ MPHF(q) → slot s
|
||
→ evidence[s] = (chunk_id, rank)
|
||
→ read k nucleotides at rank in unitigs[chunk_id] → compare with q
|
||
→ match : return payload[s] ← exact hit
|
||
→ no match: kmer absent ← MPHF collision on absent kmer
|
||
```
|
||
|
||
**Query path (approximate evidence):**
|
||
|
||
```
|
||
query kmer q
|
||
→ MPHF(q) → slot s
|
||
→ fingerprint[s] matches seq_hash(q)?
|
||
→ yes : probable hit (FP rate = 1/2^b per kmer, 1/2^(b·z) per z-window)
|
||
→ no : kmer absent
|
||
```
|