4.0 KiB
Chunk reader — implementation
The obiread crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
Output type: rope
Each chunk is a Vec<Bytes> — a rope: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.
Using bytes::Bytes means the split at the record boundary is O(1): Bytes::split_to(n) adjusts a reference counter, not memory. No memcpy in the common case.
Allocation policy
| Case | Cost |
|---|---|
| Boundary found in the current block (common) | zero extra allocation — split_to only |
| Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer |
| EOF flush | zero extra allocation |
SeqChunkIter
pub struct SeqChunkIter<R: Read> { /* private */ }
impl<R: Read> Iterator for SeqChunkIter<R> {
type Item = io::Result<Vec<Bytes>>;
}
pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
next() loop:
1. read one block of block_size bytes → push onto rope
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
last block, skip the splitter (avoids a full backward scan for nothing)
3. call splitter on last block
if found at offset n:
remainder = last_block.split_to(n) ← O(1), zero copy
return std::mem::take(&mut self.rope) ← the chunk
4. if rope.len() > 1 (multi-block accumulation):
pack rope into one flat buffer ← one alloc
retry splitter on flat buffer
5. if EOF: flush remaining rope as final chunk
Boundary detection — FASTA
Backward scan with a 2-state machine. Searches for > immediately preceded by \n or \r:
stateDiagram-v2
direction LR
[*] --> Scanning
Scanning --> FoundGt : '>'
FoundGt --> Scanning : other
FoundGt --> [*] : '\\n' / '\\r' ✓
Returns the byte offset of the > that starts the last complete record.
Boundary detection — FASTQ
FASTQ records have a rigid 4-line structure (@header, sequence, +, quality). The @ character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate @.
7-state machine (port of Go's EndOfLastFastqEntry), scanning from right to left. Each time a + is found, its position is saved as restart; any state mismatch resets the scan to that position.
stateDiagram-v2
direction LR
[*] --> Scanning
Scanning --> FoundPlus : '+' (save restart)
FoundPlus --> AfterNlPlus : '\\n' / '\\r'
FoundPlus --> Scanning : other → backtrack
AfterNlPlus --> AfterNlPlus : séparateur
AfterNlPlus --> InSequence : lettre / - / . / [ / ]
AfterNlPlus --> Scanning : other → backtrack
InSequence --> AfterSequence : '\\n' / '\\r'
InSequence --> InSequence : lettre / - / . / [ / ]
InSequence --> Scanning : other → backtrack
AfterSequence --> AfterSequence : '\\n' / '\\r'
AfterSequence --> InHeader : other
InHeader --> FoundAt : '@' (save cut)
InHeader --> Scanning : '\\n' / '\\r' → backtrack
InHeader --> InHeader : other
FoundAt --> [*] : '\\n' / '\\r' ✓
FoundAt --> InHeader : other
restart is updated each time a + is found. When any state fails its expected input, the scan jumps back to restart and continues from there — guaranteeing that a @ in a quality line cannot be accepted as a record start, because the \n+\n structure immediately following it (going backward) will not be found.
Returns the byte offset of the @ that starts the last complete record.