101 lines
4.0 KiB
Markdown
101 lines
4.0 KiB
Markdown
# Chunk reader — implementation
|
|
|
|
The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
|
|
|
|
## Output type: rope
|
|
|
|
Each chunk is a `Vec<Bytes>` — a **rope**: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.
|
|
|
|
Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::split_to(n)` adjusts a reference counter, not memory. No `memcpy` in the common case.
|
|
|
|
## Allocation policy
|
|
|
|
| Case | Cost |
|
|
|------|------|
|
|
| Boundary found in the current block (common) | zero extra allocation — `split_to` only |
|
|
| Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer |
|
|
| EOF flush | zero extra allocation |
|
|
|
|
## SeqChunkIter
|
|
|
|
```rust
|
|
pub struct SeqChunkIter<R: Read> { /* private */ }
|
|
|
|
impl<R: Read> Iterator for SeqChunkIter<R> {
|
|
type Item = io::Result<Vec<Bytes>>;
|
|
}
|
|
|
|
pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
|
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
|
```
|
|
|
|
`next()` loop:
|
|
|
|
```text
|
|
1. read one block of block_size bytes → push onto rope
|
|
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
|
|
last block, skip the splitter (avoids a full backward scan for nothing)
|
|
3. call splitter on last block
|
|
if found at offset n:
|
|
remainder = last_block.split_to(n) ← O(1), zero copy
|
|
return std::mem::take(&mut self.rope) ← the chunk
|
|
4. if rope.len() > 1 (multi-block accumulation):
|
|
pack rope into one flat buffer ← one alloc
|
|
retry splitter on flat buffer
|
|
5. if EOF: flush remaining rope as final chunk
|
|
```
|
|
|
|
## Boundary detection — FASTA
|
|
|
|
Backward scan with a 2-state machine. Searches for `>` immediately preceded by `\n` or `\r`:
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
direction LR
|
|
[*] --> Scanning
|
|
Scanning --> FoundGt : '>'
|
|
FoundGt --> Scanning : other
|
|
FoundGt --> [*] : '\\n' / '\\r' ✓
|
|
```
|
|
|
|
Returns the byte offset of the `>` that starts the last complete record.
|
|
|
|
## Boundary detection — FASTQ
|
|
|
|
FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.
|
|
|
|
7-state machine (port of Go's `EndOfLastFastqEntry`), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
direction LR
|
|
|
|
[*] --> Scanning
|
|
|
|
Scanning --> FoundPlus : '+' (save restart)
|
|
FoundPlus --> AfterNlPlus : '\\n' / '\\r'
|
|
FoundPlus --> Scanning : other → backtrack
|
|
|
|
AfterNlPlus --> AfterNlPlus : séparateur
|
|
AfterNlPlus --> InSequence : lettre / - / . / [ / ]
|
|
AfterNlPlus --> Scanning : other → backtrack
|
|
|
|
InSequence --> AfterSequence : '\\n' / '\\r'
|
|
InSequence --> InSequence : lettre / - / . / [ / ]
|
|
InSequence --> Scanning : other → backtrack
|
|
|
|
AfterSequence --> AfterSequence : '\\n' / '\\r'
|
|
AfterSequence --> InHeader : other
|
|
|
|
InHeader --> FoundAt : '@' (save cut)
|
|
InHeader --> Scanning : '\\n' / '\\r' → backtrack
|
|
InHeader --> InHeader : other
|
|
|
|
FoundAt --> [*] : '\\n' / '\\r' ✓
|
|
FoundAt --> InHeader : other
|
|
```
|
|
|
|
`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.
|
|
|
|
Returns the byte offset of the `@` that starts the last complete record.
|