cfadf63bbc
Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.
110 lines
4.9 KiB
Markdown
110 lines
4.9 KiB
Markdown
# Chunk reader — implementation
|
||
|
||
`obiread` exposes two distinct sequence reading paths, each optimised for a different use case.
|
||
|
||
## Two reading paths
|
||
|
||
| Path | API | Output unit | Per-record identity | Use case |
|
||
|------|-----|-------------|---------------------|----------|
|
||
| **Record path** | `read_sequence_chunks` → `parse_chunk` | `SeqRecord` (id + raw sequence + normalised rope) | yes | `query` — must read complete records |
|
||
| **Stream path** | `open_nuc_stream` | `NucPage` (flat normalised byte buffer) | no | `index`, `superkmer` — bulk throughput |
|
||
|
||
The record path uses `Rope`-backed chunks and is described in detail below.
|
||
The stream path (`NucStream` / `NucPage`) is described in the scatter section of [pipeline](pipeline.md).
|
||
|
||
---
|
||
|
||
## Record path: chunk reader
|
||
|
||
The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. `parse_chunk` then converts each chunk into a `Vec<SeqRecord>`, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.
|
||
|
||
This path is mandatory for `query`, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.
|
||
|
||
## Output type: Rope
|
||
|
||
Each chunk is a `Rope` — a segmented byte sequence: a `Vec` of blocks, where each block is a `Vec<Cell<u8>>`. The consumer iterates over the blocks via a forward or backward cursor.
|
||
|
||
`Rope::split_off(pos)` splits at an absolute byte offset in O(log n) (binary search over block-start index). If `pos` falls inside a block, that block is split in two via `Vec::split_off` — no `memcpy` in the common case.
|
||
|
||
## SeqChunkIter
|
||
|
||
```rust
|
||
pub struct SeqChunkIter<R: Read> { /* private */ }
|
||
|
||
impl<R: Read> Iterator for SeqChunkIter<R> {
|
||
type Item = io::Result<Rope>;
|
||
}
|
||
|
||
pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
||
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
|
||
```
|
||
|
||
`next()` loop:
|
||
|
||
```text
|
||
1. read one block of block_size bytes → push onto Rope
|
||
2. call splitter(rope) → Option<abs_offset>
|
||
if Some(pos):
|
||
tail = rope.split_off(pos) ← O(log n), may split one block
|
||
chunk = mem::replace(&mut rope, tail)
|
||
return Some(Ok(chunk))
|
||
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
|
||
4. if EOF and rope empty: return None
|
||
```
|
||
|
||
The `Splitter` function signature is `fn(&Rope) -> Option<usize>`. It returns the absolute byte offset of the start of the last complete record, or `None` if no boundary was found in the accumulated rope (need more data).
|
||
|
||
## Boundary detection — FASTA
|
||
|
||
Backward scan with a 2-state machine. Searches (right to left) for `>` followed by `\n` or `\r` (i.e., a `>` that is preceded by a newline in forward order):
|
||
|
||
```mermaid
|
||
stateDiagram-v2
|
||
direction LR
|
||
[*] --> Scanning
|
||
Scanning --> FoundGt : '>'
|
||
FoundGt --> Scanning : other
|
||
FoundGt --> [*] : '\\n' / '\\r' ✓
|
||
```
|
||
|
||
Returns the byte offset of the `>` that starts the last complete record. Returns `None` if only one `>` is found (cannot confirm there is a prior complete record).
|
||
|
||
## Boundary detection — FASTQ
|
||
|
||
FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.
|
||
|
||
7-state machine (states 0–6), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
|
||
|
||
```mermaid
|
||
stateDiagram-v2
|
||
direction LR
|
||
|
||
[*] --> Scanning
|
||
|
||
Scanning --> FoundPlus : '+' (save restart)
|
||
FoundPlus --> AfterNlPlus : '\\n' / '\\r'
|
||
FoundPlus --> Scanning : other → backtrack
|
||
|
||
AfterNlPlus --> AfterNlPlus : séparateur
|
||
AfterNlPlus --> InSequence : lettre / - / . / [ / ]
|
||
AfterNlPlus --> Scanning : other → backtrack
|
||
|
||
InSequence --> AfterSequence : '\\n' / '\\r'
|
||
InSequence --> InSequence : lettre / - / . / [ / ]
|
||
InSequence --> Scanning : other → backtrack
|
||
|
||
AfterSequence --> AfterSequence : '\\n' / '\\r'
|
||
AfterSequence --> InHeader : other
|
||
|
||
InHeader --> FoundAt : '@' (save cut)
|
||
InHeader --> Scanning : '\\n' / '\\r' → backtrack
|
||
InHeader --> InHeader : other
|
||
|
||
FoundAt --> [*] : '\\n' / '\\r' ✓
|
||
FoundAt --> InHeader : other
|
||
```
|
||
|
||
`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.
|
||
|
||
Returns the byte offset of the `@` that starts the last complete record.
|