first implementation but far to be optimal
This commit is contained in:
@@ -0,0 +1,100 @@
|
||||
# Chunk reader — implementation
|
||||
|
||||
The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
|
||||
|
||||
## Output type: rope
|
||||
|
||||
Each chunk is a `Vec<Bytes>` — a **rope**: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.
|
||||
|
||||
Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::split_to(n)` adjusts a reference counter, not memory. No `memcpy` in the common case.
|
||||
|
||||
## Allocation policy
|
||||
|
||||
| Case | Cost |
|
||||
|------|------|
|
||||
| Boundary found in the current block (common) | zero extra allocation — `split_to` only |
|
||||
| Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer |
|
||||
| EOF flush | zero extra allocation |
|
||||
|
||||
## SeqChunkIter
|
||||
|
||||
```rust
|
||||
pub struct SeqChunkIter<R: Read> { /* private */ }
|
||||
|
||||
impl<R: Read> Iterator for SeqChunkIter<R> {
|
||||
type Item = io::Result<Vec<Bytes>>;
|
||||
}
|
||||
|
||||
pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
||||
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
||||
```
|
||||
|
||||
`next()` loop:
|
||||
|
||||
```text
|
||||
1. read one block of block_size bytes → push onto rope
|
||||
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
|
||||
last block, skip the splitter (avoids a full backward scan for nothing)
|
||||
3. call splitter on last block
|
||||
if found at offset n:
|
||||
remainder = last_block.split_to(n) ← O(1), zero copy
|
||||
return std::mem::take(&mut self.rope) ← the chunk
|
||||
4. if rope.len() > 1 (multi-block accumulation):
|
||||
pack rope into one flat buffer ← one alloc
|
||||
retry splitter on flat buffer
|
||||
5. if EOF: flush remaining rope as final chunk
|
||||
```
|
||||
|
||||
## Boundary detection — FASTA
|
||||
|
||||
Backward scan with a 2-state machine. Searches for `>` immediately preceded by `\n` or `\r`:
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
direction LR
|
||||
[*] --> Scanning
|
||||
Scanning --> FoundGt : '>'
|
||||
FoundGt --> Scanning : other
|
||||
FoundGt --> [*] : '\\n' / '\\r' ✓
|
||||
```
|
||||
|
||||
Returns the byte offset of the `>` that starts the last complete record.
|
||||
|
||||
## Boundary detection — FASTQ
|
||||
|
||||
FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.
|
||||
|
||||
7-state machine (port of Go's `EndOfLastFastqEntry`), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
direction LR
|
||||
|
||||
[*] --> Scanning
|
||||
|
||||
Scanning --> FoundPlus : '+' (save restart)
|
||||
FoundPlus --> AfterNlPlus : '\\n' / '\\r'
|
||||
FoundPlus --> Scanning : other → backtrack
|
||||
|
||||
AfterNlPlus --> AfterNlPlus : séparateur
|
||||
AfterNlPlus --> InSequence : lettre / - / . / [ / ]
|
||||
AfterNlPlus --> Scanning : other → backtrack
|
||||
|
||||
InSequence --> AfterSequence : '\\n' / '\\r'
|
||||
InSequence --> InSequence : lettre / - / . / [ / ]
|
||||
InSequence --> Scanning : other → backtrack
|
||||
|
||||
AfterSequence --> AfterSequence : '\\n' / '\\r'
|
||||
AfterSequence --> InHeader : other
|
||||
|
||||
InHeader --> FoundAt : '@' (save cut)
|
||||
InHeader --> Scanning : '\\n' / '\\r' → backtrack
|
||||
InHeader --> InHeader : other
|
||||
|
||||
FoundAt --> [*] : '\\n' / '\\r' ✓
|
||||
FoundAt --> InHeader : other
|
||||
```
|
||||
|
||||
`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.
|
||||
|
||||
Returns the byte offset of the `@` that starts the last complete record.
|
||||
Reference in New Issue
Block a user