# Chunk reader — implementation The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers. ## Output type: rope Each chunk is a `Vec` — a **rope**: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order. Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::split_to(n)` adjusts a reference counter, not memory. No `memcpy` in the common case. ## Allocation policy | Case | Cost | |------|------| | Boundary found in the current block (common) | zero extra allocation — `split_to` only | | Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer | | EOF flush | zero extra allocation | ## SeqChunkIter ```rust pub struct SeqChunkIter { /* private */ } impl Iterator for SeqChunkIter { type Item = io::Result>; } pub fn fasta_chunks(source: R) -> SeqChunkIter pub fn fastq_chunks(source: R) -> SeqChunkIter ``` `next()` loop: ```text 1. read one block of block_size bytes → push onto rope 2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the last block, skip the splitter (avoids a full backward scan for nothing) 3. call splitter on last block if found at offset n: remainder = last_block.split_to(n) ← O(1), zero copy return std::mem::take(&mut self.rope) ← the chunk 4. if rope.len() > 1 (multi-block accumulation): pack rope into one flat buffer ← one alloc retry splitter on flat buffer 5. if EOF: flush remaining rope as final chunk ``` ## Boundary detection — FASTA Backward scan with a 2-state machine. Searches for `>` immediately preceded by `\n` or `\r`: ```mermaid stateDiagram-v2 direction LR [*] --> Scanning Scanning --> FoundGt : '>' FoundGt --> Scanning : other FoundGt --> [*] : '\\n' / '\\r' ✓ ``` Returns the byte offset of the `>` that starts the last complete record. ## Boundary detection — FASTQ FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`. 7-state machine (port of Go's `EndOfLastFastqEntry`), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position. ```mermaid stateDiagram-v2 direction LR [*] --> Scanning Scanning --> FoundPlus : '+' (save restart) FoundPlus --> AfterNlPlus : '\\n' / '\\r' FoundPlus --> Scanning : other → backtrack AfterNlPlus --> AfterNlPlus : séparateur AfterNlPlus --> InSequence : lettre / - / . / [ / ] AfterNlPlus --> Scanning : other → backtrack InSequence --> AfterSequence : '\\n' / '\\r' InSequence --> InSequence : lettre / - / . / [ / ] InSequence --> Scanning : other → backtrack AfterSequence --> AfterSequence : '\\n' / '\\r' AfterSequence --> InHeader : other InHeader --> FoundAt : '@' (save cut) InHeader --> Scanning : '\\n' / '\\r' → backtrack InHeader --> InHeader : other FoundAt --> [*] : '\\n' / '\\r' ✓ FoundAt --> InHeader : other ``` `restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found. Returns the byte offset of the `@` that starts the last complete record.