Files
obikmer/docmd/implementation/chunkreader.md
T
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00

92 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Chunk reader — implementation
The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
## Output type: Rope
Each chunk is a `Rope` — a segmented byte sequence: a `Vec` of blocks, where each block is a `Vec<Cell<u8>>`. The consumer iterates over the blocks via a forward or backward cursor.
`Rope::split_off(pos)` splits at an absolute byte offset in O(log n) (binary search over block-start index). If `pos` falls inside a block, that block is split in two via `Vec::split_off` — no `memcpy` in the common case.
## SeqChunkIter
```rust
pub struct SeqChunkIter<R: Read> { /* private */ }
impl<R: Read> Iterator for SeqChunkIter<R> {
type Item = io::Result<Rope>;
}
pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
```
`next()` loop:
```text
1. read one block of block_size bytes → push onto Rope
2. call splitter(rope) → Option<abs_offset>
if Some(pos):
tail = rope.split_off(pos) ← O(log n), may split one block
chunk = mem::replace(&mut rope, tail)
return Some(Ok(chunk))
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
4. if EOF and rope empty: return None
```
The `Splitter` function signature is `fn(&Rope) -> Option<usize>`. It returns the absolute byte offset of the start of the last complete record, or `None` if no boundary was found in the accumulated rope (need more data).
## Boundary detection — FASTA
Backward scan with a 2-state machine. Searches (right to left) for `>` followed by `\n` or `\r` (i.e., a `>` that is preceded by a newline in forward order):
```mermaid
stateDiagram-v2
direction LR
[*] --> Scanning
Scanning --> FoundGt : '>'
FoundGt --> Scanning : other
FoundGt --> [*] : '\\n' / '\\r' ✓
```
Returns the byte offset of the `>` that starts the last complete record. Returns `None` if only one `>` is found (cannot confirm there is a prior complete record).
## Boundary detection — FASTQ
FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.
7-state machine (states 06), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
```mermaid
stateDiagram-v2
direction LR
[*] --> Scanning
Scanning --> FoundPlus : '+' (save restart)
FoundPlus --> AfterNlPlus : '\\n' / '\\r'
FoundPlus --> Scanning : other → backtrack
AfterNlPlus --> AfterNlPlus : séparateur
AfterNlPlus --> InSequence : lettre / - / . / [ / ]
AfterNlPlus --> Scanning : other → backtrack
InSequence --> AfterSequence : '\\n' / '\\r'
InSequence --> InSequence : lettre / - / . / [ / ]
InSequence --> Scanning : other → backtrack
AfterSequence --> AfterSequence : '\\n' / '\\r'
AfterSequence --> InHeader : other
InHeader --> FoundAt : '@' (save cut)
InHeader --> Scanning : '\\n' / '\\r' → backtrack
InHeader --> InHeader : other
FoundAt --> [*] : '\\n' / '\\r' ✓
FoundAt --> InHeader : other
```
`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.
Returns the byte offset of the `@` that starts the last complete record.