refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
@@ -2,19 +2,11 @@

 The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.

-## Output type: rope
+## Output type: Rope

-Each chunk is a `Vec<Bytes>` — a **rope**: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.
+Each chunk is a `Rope` — a segmented byte sequence: a `Vec` of blocks, where each block is a `Vec<Cell<u8>>`. The consumer iterates over the blocks via a forward or backward cursor.

-Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::split_to(n)` adjusts a reference counter, not memory. No `memcpy` in the common case.
-
-## Allocation policy
-
-| Case | Cost |
-|------|------|
-| Boundary found in the current block (common) | zero extra allocation — `split_to` only |
-| Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer |
-| EOF flush | zero extra allocation |
+`Rope::split_off(pos)` splits at an absolute byte offset in O(log n) (binary search over block-start index). If `pos` falls inside a block, that block is split in two via `Vec::split_off` — no `memcpy` in the common case.

 ## SeqChunkIter

@@ -22,7 +14,7 @@ Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::spl
 pub struct SeqChunkIter<R: Read> { /* private */ }

 impl<R: Read> Iterator for SeqChunkIter<R> {
-    type Item = io::Result<Vec<Bytes>>;
+    type Item = io::Result<Rope>;
 }

 pub fn fasta_chunks<R: Read>(source: R) -> SeqChunkIter<R>
@@ -32,22 +24,21 @@ pub fn fastq_chunks<R: Read>(source: R) -> SeqChunkIter<R>
 `next()` loop:

 ```text
-1. read one block of block_size bytes → push onto rope
-2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
-   last block, skip the splitter (avoids a full backward scan for nothing)
-3. call splitter on last block
-   if found at offset n:
-       remainder = last_block.split_to(n)    ← O(1), zero copy
-       return std::mem::take(&mut self.rope)  ← the chunk
-4. if rope.len() > 1 (multi-block accumulation):
-       pack rope into one flat buffer         ← one alloc
-       retry splitter on flat buffer
-5. if EOF: flush remaining rope as final chunk
+1. read one block of block_size bytes → push onto Rope
+2. call splitter(rope) → Option<abs_offset>
+   if Some(pos):
+       tail = rope.split_off(pos)    ← O(log n), may split one block
+       chunk = mem::replace(&mut rope, tail)
+       return Some(Ok(chunk))
+3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
+4. if EOF and rope empty: return None
 ```

+The `Splitter` function signature is `fn(&Rope) -> Option<usize>`. It returns the absolute byte offset of the start of the last complete record, or `None` if no boundary was found in the accumulated rope (need more data).
+
 ## Boundary detection — FASTA

-Backward scan with a 2-state machine. Searches for `>` immediately preceded by `\n` or `\r`:
+Backward scan with a 2-state machine. Searches (right to left) for `>` followed by `\n` or `\r` (i.e., a `>` that is preceded by a newline in forward order):

 ```mermaid
 stateDiagram-v2
@@ -58,13 +49,13 @@ stateDiagram-v2
    FoundGt  --> [*]      : '\\n' / '\\r' ✓
 ```

-Returns the byte offset of the `>` that starts the last complete record.
+Returns the byte offset of the `>` that starts the last complete record. Returns `None` if only one `>` is found (cannot confirm there is a prior complete record).

 ## Boundary detection — FASTQ

 FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`.

-7-state machine (port of Go's `EndOfLastFastqEntry`), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.
+7-state machine (states 0–6), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position.

 ```mermaid
 stateDiagram-v2