Skip to content

Chunk reader — implementation

The obiread crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.

Output type: rope

Each chunk is a Vec<Bytes> — a rope: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.

Using bytes::Bytes means the split at the record boundary is O(1): Bytes::split_to(n) adjusts a reference counter, not memory. No memcpy in the common case.

Allocation policy

Case Cost
Boundary found in the current block (common) zero extra allocation — split_to only
Boundary straddles multiple blocks (sequence > block size, rare) one allocation to pack the rope into a flat buffer
EOF flush zero extra allocation

SeqChunkIter

pub struct SeqChunkIter<R: Read> { /* private */ }

impl<R: Read> Iterator for SeqChunkIter<R> {
    type Item = io::Result<Vec<Bytes>>;
}

pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>

next() loop:

1. read one block of block_size bytes → push onto rope
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
   last block, skip the splitter (avoids a full backward scan for nothing)
3. call splitter on last block
   if found at offset n:
       remainder = last_block.split_to(n)    ← O(1), zero copy
       return std::mem::take(&mut self.rope)  ← the chunk
4. if rope.len() > 1 (multi-block accumulation):
       pack rope into one flat buffer         ← one alloc
       retry splitter on flat buffer
5. if EOF: flush remaining rope as final chunk

Boundary detection — FASTA

Backward scan with a 2-state machine. Searches for > immediately preceded by \n or \r:

stateDiagram-v2
    direction LR
    [*]      --> Scanning
    Scanning --> FoundGt  : '>'
    FoundGt  --> Scanning : other
    FoundGt  --> [*]      : '\\n' / '\\r' ✓

Returns the byte offset of the > that starts the last complete record.

Boundary detection — FASTQ

FASTQ records have a rigid 4-line structure (@header, sequence, +, quality). The @ character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate @.

7-state machine (port of Go's EndOfLastFastqEntry), scanning from right to left. Each time a + is found, its position is saved as restart; any state mismatch resets the scan to that position.

stateDiagram-v2
    direction LR

    [*]          --> Scanning

    Scanning     --> FoundPlus    : '+' (save restart)
    FoundPlus    --> AfterNlPlus  : '\\n' / '\\r'
    FoundPlus    --> Scanning     : other → backtrack

    AfterNlPlus  --> AfterNlPlus  : séparateur
    AfterNlPlus  --> InSequence   : lettre / - / . / [ / ]
    AfterNlPlus  --> Scanning     : other → backtrack

    InSequence   --> AfterSequence : '\\n' / '\\r'
    InSequence   --> InSequence    : lettre / - / . / [ / ]
    InSequence   --> Scanning      : other → backtrack

    AfterSequence --> AfterSequence : '\\n' / '\\r'
    AfterSequence --> InHeader      : other

    InHeader     --> FoundAt    : '@' (save cut)
    InHeader     --> Scanning   : '\\n' / '\\r' → backtrack
    InHeader     --> InHeader   : other

    FoundAt      --> [*]       : '\\n' / '\\r' ✓
    FoundAt      --> InHeader  : other

restart is updated each time a + is found. When any state fails its expected input, the scan jumps back to restart and continues from there — guaranteeing that a @ in a quality line cannot be accepted as a record start, because the \n+\n structure immediately following it (going backward) will not be found.

Returns the byte offset of the @ that starts the last complete record.