Files

T

Eric Coissac cfadf63bbc refactor: migrate pipeline to NucPage-based stream processing

Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.

2026-05-29 09:10:25 +02:00

4.9 KiB

Raw Blame History

Chunk reader — implementation

obiread exposes two distinct sequence reading paths, each optimised for a different use case.

Two reading paths

Path	API	Output unit	Per-record identity	Use case
Record path	`read_sequence_chunks` → `parse_chunk`	`SeqRecord` (id + raw sequence + normalised rope)	yes	`query` — must read complete records
Stream path	`open_nuc_stream`	`NucPage` (flat normalised byte buffer)	no	`index`, `superkmer` — bulk throughput

The record path uses Rope-backed chunks and is described in detail below. The stream path (NucStream / NucPage) is described in the scatter section of pipeline.

Record path: chunk reader

The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. parse_chunk then converts each chunk into a Vec<SeqRecord>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.

This path is mandatory for query, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.

Output type: Rope

Each chunk is a Rope — a segmented byte sequence: a Vec of blocks, where each block is a Vec<Cell<u8>>. The consumer iterates over the blocks via a forward or backward cursor.

Rope::split_off(pos) splits at an absolute byte offset in O(log n) (binary search over block-start index). If pos falls inside a block, that block is split in two via Vec::split_off — no memcpy in the common case.

SeqChunkIter

pub struct SeqChunkIter<R: Read> { /* private */ }

impl<R: Read> Iterator for SeqChunkIter<R> {
    type Item = io::Result<Rope>;
}

pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>

next() loop:

1. read one block of block_size bytes → push onto Rope
2. call splitter(rope) → Option<abs_offset>
   if Some(pos):
       tail = rope.split_off(pos)    ← O(log n), may split one block
       chunk = mem::replace(&mut rope, tail)
       return Some(Ok(chunk))
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
4. if EOF and rope empty: return None

The Splitter function signature is fn(&Rope) -> Option<usize>. It returns the absolute byte offset of the start of the last complete record, or None if no boundary was found in the accumulated rope (need more data).

Boundary detection — FASTA

Backward scan with a 2-state machine. Searches (right to left) for > followed by \n or \r (i.e., a > that is preceded by a newline in forward order):

stateDiagram-v2
    direction LR
    [*]      --> Scanning
    Scanning --> FoundGt  : '>'
    FoundGt  --> Scanning : other
    FoundGt  --> [*]      : '\\n' / '\\r' ✓

Returns the byte offset of the > that starts the last complete record. Returns None if only one > is found (cannot confirm there is a prior complete record).

Boundary detection — FASTQ

FASTQ records have a rigid 4-line structure (@header, sequence, +, quality). The @ character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate @.

7-state machine (states 0–6), scanning from right to left. Each time a + is found, its position is saved as restart; any state mismatch resets the scan to that position.

stateDiagram-v2
    direction LR

    [*]          --> Scanning

    Scanning     --> FoundPlus    : '+' (save restart)
    FoundPlus    --> AfterNlPlus  : '\\n' / '\\r'
    FoundPlus    --> Scanning     : other → backtrack

    AfterNlPlus  --> AfterNlPlus  : séparateur
    AfterNlPlus  --> InSequence   : lettre / - / . / [ / ]
    AfterNlPlus  --> Scanning     : other → backtrack

    InSequence   --> AfterSequence : '\\n' / '\\r'
    InSequence   --> InSequence    : lettre / - / . / [ / ]
    InSequence   --> Scanning      : other → backtrack

    AfterSequence --> AfterSequence : '\\n' / '\\r'
    AfterSequence --> InHeader      : other

    InHeader     --> FoundAt    : '@' (save cut)
    InHeader     --> Scanning   : '\\n' / '\\r' → backtrack
    InHeader     --> InHeader   : other

    FoundAt      --> [*]       : '\\n' / '\\r' ✓
    FoundAt      --> InHeader  : other

restart is updated each time a + is found. When any state fails its expected input, the scan jumps back to restart and continues from there — guaranteeing that a @ in a quality line cannot be accepted as a record start, because the \n+\n structure immediately following it (going backward) will not be found.

Returns the byte offset of the @ that starts the last complete record.

4.9 KiB Raw Blame History Unescape Escape