refactor: migrate pipeline to NucPage-based stream processing
Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.
This commit is contained in:
@@ -1,6 +1,24 @@
|
||||
# Chunk reader — implementation
|
||||
|
||||
The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
|
||||
`obiread` exposes two distinct sequence reading paths, each optimised for a different use case.
|
||||
|
||||
## Two reading paths
|
||||
|
||||
| Path | API | Output unit | Per-record identity | Use case |
|
||||
|------|-----|-------------|---------------------|----------|
|
||||
| **Record path** | `read_sequence_chunks` → `parse_chunk` | `SeqRecord` (id + raw sequence + normalised rope) | yes | `query` — must read complete records |
|
||||
| **Stream path** | `open_nuc_stream` | `NucPage` (flat normalised byte buffer) | no | `index`, `superkmer` — bulk throughput |
|
||||
|
||||
The record path uses `Rope`-backed chunks and is described in detail below.
|
||||
The stream path (`NucStream` / `NucPage`) is described in the scatter section of [pipeline](pipeline.md).
|
||||
|
||||
---
|
||||
|
||||
## Record path: chunk reader
|
||||
|
||||
The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. `parse_chunk` then converts each chunk into a `Vec<SeqRecord>`, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.
|
||||
|
||||
This path is mandatory for `query`, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.
|
||||
|
||||
## Output type: Rope
|
||||
|
||||
|
||||
@@ -19,7 +19,11 @@ The histogram gives:
|
||||
|
||||
## Phase 1 — Scatter
|
||||
|
||||
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:
|
||||
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.
|
||||
|
||||
Input files are read via `open_nuc_stream`, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of `NucPage` buffers. Each `NucPage` is a flat 64 KB buffer of normalised bytes (`ACGT` + `\x00` separators), carrying a k−1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.
|
||||
|
||||
For each read fragment within a page:
|
||||
|
||||
1. **Ambiguous base filter**: cut at any non-ACGT base; discard fragments shorter than k.
|
||||
2. **Entropy filter**: scan each fragment with a sliding window of size k. When the kmer $K_i = S[i \mathinner{..} i+k-1]$ ended by nucleotide $S[j]$ (with $j = i+k-1$) has entropy below threshold $\theta$, emit the current segment and start a new one (see algorithm below). $K_i$ belongs to neither segment, and no valid kmer is lost.
|
||||
|
||||
Reference in New Issue
Block a user