refactor: migrate pipeline to NucPage-based stream processing

Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.
This commit is contained in:
Eric Coissac
2026-05-28 22:16:31 +02:00
parent a4b57a96de
commit cfadf63bbc
12 changed files with 876 additions and 609 deletions
+19 -1
View File
@@ -1,6 +1,24 @@
# Chunk reader — implementation
The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
`obiread` exposes two distinct sequence reading paths, each optimised for a different use case.
## Two reading paths
| Path | API | Output unit | Per-record identity | Use case |
|------|-----|-------------|---------------------|----------|
| **Record path** | `read_sequence_chunks``parse_chunk` | `SeqRecord` (id + raw sequence + normalised rope) | yes | `query` — must read complete records |
| **Stream path** | `open_nuc_stream` | `NucPage` (flat normalised byte buffer) | no | `index`, `superkmer` — bulk throughput |
The record path uses `Rope`-backed chunks and is described in detail below.
The stream path (`NucStream` / `NucPage`) is described in the scatter section of [pipeline](pipeline.md).
---
## Record path: chunk reader
The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. `parse_chunk` then converts each chunk into a `Vec<SeqRecord>`, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.
This path is mandatory for `query`, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.
## Output type: Rope