docmd/implementation/chunkreader.md

# Chunk reader — implementation

`obiread` exposes two distinct sequence reading paths, each optimised for a different use case.

## Two reading paths

| Path | API | Output unit | Per-record identity | Use case |
|------|-----|-------------|---------------------|----------|
| **Record path** | `read_sequence_chunks` → `parse_chunk` | `SeqRecord` (id + raw sequence + normalised rope) | yes | `query` — must read complete records |
| **Stream path** | `open_nuc_stream` | `NucPage` (flat normalised byte buffer) | no | `index`, `superkmer` — bulk throughput |

The record path uses `Rope`-backed chunks and is described in detail below.
The stream path (`NucStream` / `NucPage`) is described in the scatter section of [pipeline](pipeline.md).

---

## Record path: chunk reader

The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. `parse_chunk` then converts each chunk into a `Vec<SeqRecord>`, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.

This path is mandatory for `query`, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.

## Output type: Rope

Each chunk is a `Rope` — a segmented byte sequence: a `Vec` of blocks, where each block is a `Vec<Cell<u8>>`. The consumer iterates over the blocks via a forward or backward cursor.

`Rope::split_off(pos)` splits at an absolute byte offset in O(log n) (binary search over block-start index). If `pos` falls inside a block, that block is split in two via `Vec::split_off` — no `memcpy` in the common case.

## SeqChunkIter

```rust
pub struct SeqChunkIter<R: Read> { /* private */ }

impl<R: Read> Iterator for SeqChunkIter<R> {
    type Item = io::Result<Rope>;
}

pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
```

`next()` loop:

```text
1. read one block of block_size bytes → push onto Rope
2. call splitter(rope) → Option<abs_offset>
   if Some(pos):
       tail = rope.split_off(pos)    ← O(log n), may split one block
       chunk = mem::replace(&mut rope, tail)
       return Some(Ok(chunk))
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
4. if EOF and rope empty: return None
```

The `Splitter` function signature is `fn(&Rope) -> Option<usize>`. It returns the absolute byte offset of the start of the last complete record, or `None` if no boundary was found in the accumulated rope (need more data).

## Boundary detection — FASTA

Backward scan with a 2-state machine. Searches (right to left) for `>` followed by `\n` or `\r` (i.e., a `>` that is preceded by a newline in forward order):

```mermaid
stateDiagram-v2
    direction LR
    [*]      --> Scanning
    Scanning --> FoundGt  : '>'
    FoundGt  --> Scanning : other
    FoundGt  --> [*]      : '\\n' / '\\r' ✓
```

Returns the byte offset of the `>` that starts the last complete record. Returns `None` if only one `>` is found (cannot confirm there is a prior complete record).

## Boundary detection — FASTQ

FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.

7-state machine (states 0–6), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.

```mermaid
stateDiagram-v2
    direction LR

    [*]          --> Scanning

    Scanning     --> FoundPlus    : '+' (save restart)
    FoundPlus    --> AfterNlPlus  : '\\n' / '\\r'
    FoundPlus    --> Scanning     : other → backtrack

    AfterNlPlus  --> AfterNlPlus  : séparateur
    AfterNlPlus  --> InSequence   : lettre / - / . / [ / ]
    AfterNlPlus  --> Scanning     : other → backtrack

    InSequence   --> AfterSequence : '\\n' / '\\r'
    InSequence   --> InSequence    : lettre / - / . / [ / ]
    InSequence   --> Scanning      : other → backtrack

    AfterSequence --> AfterSequence : '\\n' / '\\r'
    AfterSequence --> InHeader      : other

    InHeader     --> FoundAt    : '@' (save cut)
    InHeader     --> Scanning   : '\\n' / '\\r' → backtrack
    InHeader     --> InHeader   : other

    FoundAt      --> [*]       : '\\n' / '\\r' ✓
    FoundAt      --> InHeader  : other
```

`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.

Returns the byte offset of the `@` that starts the last complete record.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								# Chunk reader — implementation
-											refactor: migrate pipeline to NucPage-based stream processing
										
										
											2026-05-28 22:16:31 +02:00
+								`obiread` exposes two distinct sequence reading paths, each optimised for a different use case.
 								## Two reading paths
 								| Path | API | Output unit | Per-record identity | Use case |
 								|------|-----|-------------|---------------------|----------|
 								| **Record path** | `read_sequence_chunks` → `parse_chunk` | `SeqRecord` (id + raw sequence + normalised rope) | yes | `query` — must read complete records |
 								| **Stream path** | `open_nuc_stream` | `NucPage` (flat normalised byte buffer) | no | `index`, `superkmer` — bulk throughput |
 								The record path uses `Rope`-backed chunks and is described in detail below.
 								The stream path (`NucStream` / `NucPage`) is described in the scatter section of [pipeline](pipeline.md).
 								---
 								## Record path: chunk reader
 								The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. `parse_chunk` then converts each chunk into a `Vec<SeqRecord>`, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.
 								This path is mandatory for `query`, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								## Output type: Rope
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								Each chunk is a `Rope` — a segmented byte sequence: a `Vec` of blocks, where each block is a `Vec<Cell<u8>>`. The consumer iterates over the blocks via a forward or backward cursor.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								`Rope::split_off(pos)` splits at an absolute byte offset in O(log n) (binary search over block-start index). If `pos` falls inside a block, that block is split in two via `Vec::split_off` — no `memcpy` in the common case.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								## SeqChunkIter
 								```rust
 								pub struct SeqChunkIter<R: Read> { /* private */ }
 								impl<R: Read> Iterator for SeqChunkIter<R> {
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								    type Item = io::Result<Rope>;
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								}
 								pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
 								pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
 								```
 								`next()` loop:
 								```text
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+. read one block of block_size bytes → push onto Rope
 . call splitter(rope) → Option<abs_offset>
 								   if Some(pos):
 								       tail = rope.split_off(pos)    ← O(log n), may split one block
 								       chunk = mem::replace(&mut rope, tail)
 								       return Some(Ok(chunk))
 . if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
 . if EOF and rope empty: return None
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								```
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								The `Splitter` function signature is `fn(&Rope) -> Option<usize>`. It returns the absolute byte offset of the start of the last complete record, or `None` if no boundary was found in the accumulated rope (need more data).
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								## Boundary detection — FASTA
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								Backward scan with a 2-state machine. Searches (right to left) for `>` followed by `\n` or `\r` (i.e., a `>` that is preceded by a newline in forward order):
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								```mermaid
 								stateDiagram-v2
 								    direction LR
 								    [*]      --> Scanning
 								    Scanning --> FoundGt  : '>'
 								    FoundGt  --> Scanning : other
 								    FoundGt  --> [*]      : '\\n' / '\\r' ✓
 								```
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								Returns the byte offset of the `>` that starts the last complete record. Returns `None` if only one `>` is found (cannot confirm there is a prior complete record).
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								## Boundary detection — FASTQ
 								FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+-state machine (states 0–6), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								```mermaid
 								stateDiagram-v2
 								    direction LR
 								    [*]          --> Scanning
 								    Scanning     --> FoundPlus    : '+' (save restart)
 								    FoundPlus    --> AfterNlPlus  : '\\n' / '\\r'
 								    FoundPlus    --> Scanning     : other → backtrack
 								    AfterNlPlus  --> AfterNlPlus  : séparateur
 								    AfterNlPlus  --> InSequence   : lettre / - / . / [ / ]
 								    AfterNlPlus  --> Scanning     : other → backtrack
 								    InSequence   --> AfterSequence : '\\n' / '\\r'
 								    InSequence   --> InSequence    : lettre / - / . / [ / ]
 								    InSequence   --> Scanning      : other → backtrack
 								    AfterSequence --> AfterSequence : '\\n' / '\\r'
 								    AfterSequence --> InHeader      : other
 								    InHeader     --> FoundAt    : '@' (save cut)
 								    InHeader     --> Scanning   : '\\n' / '\\r' → backtrack
 								    InHeader     --> InHeader   : other
 								    FoundAt      --> [*]       : '\\n' / '\\r' ✓
 								    FoundAt      --> InHeader  : other
 								```
 								`restart` is updated each time a `+` is found. When any state fails its expected input, the scan jumps back to `restart` and continues from there — guaranteeing that a `@` in a quality line cannot be accepted as a record start, because the `\n+\n` structure immediately following it (going backward) will not be found.
 								Returns the byte offset of the `@` that starts the last complete record.