Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
3.9 KiB
Chunk reader — implementation
The obiread crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
Output type: Rope
Each chunk is a Rope — a segmented byte sequence: a Vec of blocks, where each block is a Vec<Cell<u8>>. The consumer iterates over the blocks via a forward or backward cursor.
Rope::split_off(pos) splits at an absolute byte offset in O(log n) (binary search over block-start index). If pos falls inside a block, that block is split in two via Vec::split_off — no memcpy in the common case.
SeqChunkIter
pub struct SeqChunkIter<R: Read> { /* private */ }
impl<R: Read> Iterator for SeqChunkIter<R> {
type Item = io::Result<Rope>;
}
pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
next() loop:
1. read one block of block_size bytes → push onto Rope
2. call splitter(rope) → Option<abs_offset>
if Some(pos):
tail = rope.split_off(pos) ← O(log n), may split one block
chunk = mem::replace(&mut rope, tail)
return Some(Ok(chunk))
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
4. if EOF and rope empty: return None
The Splitter function signature is fn(&Rope) -> Option<usize>. It returns the absolute byte offset of the start of the last complete record, or None if no boundary was found in the accumulated rope (need more data).
Boundary detection — FASTA
Backward scan with a 2-state machine. Searches (right to left) for > followed by \n or \r (i.e., a > that is preceded by a newline in forward order):
stateDiagram-v2
direction LR
[*] --> Scanning
Scanning --> FoundGt : '>'
FoundGt --> Scanning : other
FoundGt --> [*] : '\\n' / '\\r' ✓
Returns the byte offset of the > that starts the last complete record. Returns None if only one > is found (cannot confirm there is a prior complete record).
Boundary detection — FASTQ
FASTQ records have a rigid 4-line structure (@header, sequence, +, quality). The @ character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate @.
7-state machine (states 0–6), scanning from right to left. Each time a + is found, its position is saved as restart; any state mismatch resets the scan to that position.
stateDiagram-v2
direction LR
[*] --> Scanning
Scanning --> FoundPlus : '+' (save restart)
FoundPlus --> AfterNlPlus : '\\n' / '\\r'
FoundPlus --> Scanning : other → backtrack
AfterNlPlus --> AfterNlPlus : séparateur
AfterNlPlus --> InSequence : lettre / - / . / [ / ]
AfterNlPlus --> Scanning : other → backtrack
InSequence --> AfterSequence : '\\n' / '\\r'
InSequence --> InSequence : lettre / - / . / [ / ]
InSequence --> Scanning : other → backtrack
AfterSequence --> AfterSequence : '\\n' / '\\r'
AfterSequence --> InHeader : other
InHeader --> FoundAt : '@' (save cut)
InHeader --> Scanning : '\\n' / '\\r' → backtrack
InHeader --> InHeader : other
FoundAt --> [*] : '\\n' / '\\r' ✓
FoundAt --> InHeader : other
restart is updated each time a + is found. When any state fails its expected input, the scan jumps back to restart and continues from there — guaranteeing that a @ in a quality line cannot be accepted as a record start, because the \n+\n structure immediately following it (going backward) will not be found.
Returns the byte offset of the @ that starts the last complete record.