diff --git a/docmd/implementation/evidence_elimination.md b/docmd/implementation/evidence_elimination.md new file mode 100644 index 0000000..ecc2c6f --- /dev/null +++ b/docmd/implementation/evidence_elimination.md @@ -0,0 +1,321 @@ +# Evidence elimination — design discussion + +## Problem statement + +`evidence.bin` maps each MPHF slot to a position in the unitig store so that +query verification is possible: given a slot `s` returned by `mphf.index(kmer)`, +retrieve the k-mer stored at that position and compare with the query. + +On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition): + +| file | size/partition | total (2048 parts) | fraction of lookup layer | +|---|---|---|---| +| evidence.bin | 132 MB | ~270 GB | **66 %** | +| unitigs.bin | 58 MB | ~118 GB | 29 % | +| mphf.bin | 10 MB | ~20 GB | 5 % | + +Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage +optimisation available for index size. + +--- + +## Why evidence exists + +PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`. +For a query k-mer not in the indexed set, the returned slot is meaningless but +indistinguishable from a real hit without external information. Evidence provides +that information: `evidence[s]` encodes the location of the k-mer that legitimately +occupies slot `s`, allowing the verification: + +``` +slot = mphf.index(query) +(chunk_id, rank) = evidence.decode(slot) +stored_kmer = unitigs.kmer_at(chunk_id, rank) +return canonical(stored_kmer) == canonical(query) +``` + +Evidence is a **permutation** from MPHF-space to unitig-position-space. +Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding. + +--- + +## Information-theoretic lower bound + +For a partition with P k-mers and U unitigs of average length m_u k-mers: + +- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits +- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits + +Current implementation: 25 + 7 = 32 bits (aligned u32). +Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M. + +**Packing headroom: ~22 %.** Not a path to elimination. + +--- + +## Two-index architecture + +The exact index is mandatory for set operations (union, intersection, diff) and +exact k-mer retrieval. A separate approximate index, built for query operations, +can tolerate a controlled false positive rate in exchange for a much smaller +footprint. + +| component | exact index | approximate index | +|---|---|---| +| `mphf.bin` | ✓ | ✓ (same structure) | +| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ | +| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) | +| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) | +| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) | + +The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps +`unitigs.bin` for sequential enumeration of K-mers. + +--- + +## MPHF as a perfect Bloom filter + +A standard Bloom filter with a single hash function and N bits storing M keys has +occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of +landing on a set bit. The empty space (fraction 1 − M/N of bits at 0) is what +rejects foreign k-mers. + +An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer +occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any** +input to a slot in [0, M) — there is no empty space. Every query lands on an +occupied slot. The MPHF alone cannot reject foreign k-mers at all. + +Adding a B-bit fingerprint restores the discrimination: + +``` +slot = mphf.index(query) +fingerprint = hash(query) & mask_B +present = fingerprint_table[slot] == fingerprint +``` + +The fingerprint plays the role of the sparse space in the Bloom filter: it provides +the B bits of information needed to reject foreign k-mers. + +Both structures reach the same fundamental cost for a given FP rate. For 1% FP: + +- Bloom filter (optimal, k hash functions): ~9.6 bits/key +- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key + +This is a fundamental bound, not an implementation detail. + +--- + +## Approach A — MPHF + fingerprint (approximate index) + +### Size + +| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) | +|---|---|---| +| 8 | 33 MB | 4× smaller | +| 12 | 49 MB | 2.7× smaller | +| 16 | 66 MB | 2× smaller | + +Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer). + +### False positive rate — per k-mer query + +For a specific non-indexed query k-mer q: + +1. MPHF(q) → slot s, some value in [0, M) +2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s +3. FP event: hash(q) & mask_B == fingerprint_table[s] + +Since q is not the legitimate k-mer at s, its fingerprint is independent of +fingerprint_table[s], giving: + +``` +P(FP per k-mer) = 1 / 2^B +``` + +This is the probability of error **for one specific query k-mer**. It is not the +fraction of the k-mer universe that would be misclassified: querying all 4^k +possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but +that is not the relevant quantity for practical use. + +### Equivalence classes + +The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B +equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true +indexed k-mer and 4^k/(M·2^B) − 1 false positives. A larger M (fewer partitions) +produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B +remains constant. + +### Read-level use case + +The relevant decision unit is the **read**, not the individual k-mer. For a read +of ~100 nucleotides and k=31, there are ~70 k-mers. + +- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are + true positives → high coverage fraction. +- A plant read queried against a bacterial index: k-mers are foreign; each has + P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B. + +A coverage threshold separates the two cases decisively. This is the same +principle as Findere: local coverage continuity distinguishes true hits from noise. + +--- + +## Approach B — z-consecutive k-mer matching + +A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers. +Declaring a match only when **all z are present** reduces the per-window FP rate: + +``` +P(FP per window of z) = (1/2^B)^z = 1/2^(B·z) +``` + +For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z. +The probability that at least one window is a false positive: + +``` +P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z) +``` + +For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸. + +A plant read is misclassified as bacterial roughly once in 60 million reads — +negligible for any practical dataset. + +### Choosing B from (z, L, P_target) + +z is a query-time parameter and does not affect the index structure. However, +knowing z at build time allows computing the minimum B required to reach a target +FP rate P_target for reads of length L (giving W = L − k − z + 2 independent +windows): + +``` +P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z ) +``` + +Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**. + +(B, z) are co-determined at build time to minimise fingerprint size while +guaranteeing the target read-level FP rate. + +### Combined sizing + +| B | z | K = k+z−1 | P(FP_read) | fingerprint.bin/partition | +|---|---|---|---|---| +| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB | +| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB | +| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB | +| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB | + +Smaller B → smaller fingerprint table; larger z → longer minimum match length K +and fewer independent windows per read. + +--- + +## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index) + +Build the MPHF to output the global k-mer position directly: + +``` +mphf: kmer → global_pos ∈ [0, P) +``` + +Verification becomes: + +``` +global_pos = mphf.index(query) +stored_kmer = unitigs.kmer_at_global_pos(global_pos) +return canonical(stored_kmer) == canonical(query) +``` + +No evidence array. The unitig block index (see below) provides +`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time. + +### What is required + +A **retrieval data structure** (also called a value-based or function-based MPHF): +given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`, +build a compact structure that maps each key to its assigned value. + +Known constructions: + +- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph + + XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic + minimum. Construction: O(n). Query: O(1). +- **SSHash approach**: builds the MPHF to map k-mers to their positions in a + concatenated unitig string. Achieves elimination of external evidence using a + "skew" construction that aligns the MPHF output with the sequential unitig layout. + +### Rust availability + +No Rust crate implements a retrieval data structure suitable for this use case as +of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain +MPHFs. **This is the key blocking factor.** + +### SSHash construction (reference) + +SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over +(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential +positions in the concatenated unitig string. A port to obikmer would require: + +1. Concatenating all unitig sequences into a single flat buffer per partition. +2. Assigning each k-mer a global position (its offset in that buffer). +3. Building the MPHF to output that position directly (retrieval step). +4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`. + +--- + +## Approach 2 — block index prefix sums (reduces evidence to negligible) + +A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled +offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset +table. + +### Extension: k-mer prefix sums per block + +Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before +block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**. + +This enables `kmer_at_global_pos(p)`: + +1. Binary search in `kmer_prefix[]` to find block `b`. +2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`. +3. Extract the k-mer at the remaining rank within the found chunk. + +Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses. + +### Combined with Approach 1 + +- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions) +- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible) + +--- + +## Recommended path + +1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose + (B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep + `unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8 + vs ~142 MB for the exact lookup layer. + +2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`. + Zero cost if evidence.bin is kept; enables Approach 1 when ready. + +3. **Medium term**: implement GOV retrieval data structure in Rust, or port + SSHash construction. + +4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected + index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset. + +--- + +## Open questions + +- Is a GOV construction compatible with the parallel MPHF build currently used + (PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential + (hypergraph peeling); parallelisation is non-trivial. +- Can the SSHash "skew" insight be reused without the minimizer-bucket structure? + The obikmer partitioning already uses minimizers — there may be natural alignment. +- What is the query latency impact of replacing O(1) evidence lookup with + O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale. +- What is the optimal (B, z) trade-off for the approximate index given the target + read length and acceptable P(FP_read)? diff --git a/docmd/implementation/unitig_evidence.md b/docmd/implementation/unitig_evidence.md index 61ae3df..b0474ea 100644 --- a/docmd/implementation/unitig_evidence.md +++ b/docmd/implementation/unitig_evidence.md @@ -268,3 +268,5 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv ## Open questions - **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u. + +- **Eliminating evidence.bin**: at ~66 % of the per-layer lookup footprint (132 MB vs 200 MB total per partition on the bacterial BCT dataset), evidence.bin dominates index size. A dedicated design investigation is open — see [Evidence elimination design discussion](evidence_elimination.md). diff --git a/mkdocs.yml b/mkdocs.yml index 9f2ea83..6d9bc3a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -44,6 +44,7 @@ nav: - On-disk storage: implementation/storage.md - MPHF selection: implementation/mphf.md - Unitig evidence encoding: implementation/unitig_evidence.md + - Evidence elimination (discussion): implementation/evidence_elimination.md - obilayeredmap crate: implementation/obilayeredmap.md - PersistentCompactIntVec: implementation/persistent_compact_int_vec.md - PersistentBitVec: implementation/persistent_bit_vec.md diff --git a/src/obikmer/src/steps/scatter.rs b/src/obikmer/src/steps/scatter.rs index f451ea0..777e92b 100644 --- a/src/obikmer/src/steps/scatter.rs +++ b/src/obikmer/src/steps/scatter.rs @@ -1,4 +1,6 @@ use std::path::PathBuf; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Arc; use std::time::{Duration, Instant}; use indicatif::{ProgressBar, ProgressStyle}; @@ -42,15 +44,21 @@ pub fn scatter( // Throttle in the source thread — never in a worker — to prevent deadlock. let throttled = throttle_paths(path_source, max_open); + let file_count = Arc::new(AtomicU64::new(0)); + let t = Stage::start("scatter"); let pipe = obipipeline::make_pipe! { PipelineData : PathWithSlot => Vec, - ||? { move |pw: PathWithSlot| { - let PathWithSlot { path, _guard } = pw; - info!("indexing: {}", path.display()); - // _guard travels into GuardedIter; released when all chunks are read - open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard }) - }} : Path => RawChunk, + ||? { + let file_count = Arc::clone(&file_count); + move |pw: PathWithSlot| { + let PathWithSlot { path, _guard } = pw; + let n = file_count.fetch_add(1, Ordering::Relaxed) + 1; + info!("indexing [{}]: {}", n, path.display()); + // _guard travels into GuardedIter; released when all chunks are read + open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard }) + } + } : Path => RawChunk, |? { move |rope| obiread::normalize_sequence_chunk(rope, k) } : RawChunk => NormChunk, | { move |rope| obiskbuilder::build_superkmers(rope, k, level_max, theta) } : NormChunk => Batch, }; @@ -84,7 +92,8 @@ pub fn scatter( } else { (format!("{:.0} Mbp", bp / 1e6), format!("{:.0} Mbp/s", ema_rate / 1e6)) }; - pb.set_message(format!("{count_str} {rate_str}")); + let n_files = file_count.load(Ordering::Relaxed); + pb.set_message(format!("{count_str} {rate_str} {n_files} files")); } kp.write_batch(batch).unwrap_or_else(|e| { eprintln!("error: {e}"); diff --git a/src/obiskio/src/unitig_index.rs b/src/obiskio/src/unitig_index.rs index 479740b..0fe4444 100644 --- a/src/obiskio/src/unitig_index.rs +++ b/src/obiskio/src/unitig_index.rs @@ -9,21 +9,20 @@ pub use obikseq::MAX_KMERS_PER_CHUNK; use crate::error::{SKError, SKResult}; -// ── Index file format ───────────────────────────────────────────────────────── +// ── Block index parameters ──────────────────────────────────────────────────── // -// magic: [u8; 4] = b"UIDX" -// n_unitigs: u32 LE -// n_kmers: u64 LE total kmer count across all chunks -// seqls: [u8; n_unitigs] max kmer index per chunk (= n_kmers − 1) -// packed_offsets: [u32; n_unitigs + 1] byte offsets to packed bytes in the -// sequence file; last entry is sentinel +// One offset entry per BLOCK_SIZE chunks. BLOCK_SIZE must be a power of two +// so that block = i >> LOG2_BLOCK_SIZE and rem = i & (BLOCK_SIZE − 1) are +// branchless shifts/masks rather than divisions. // -// Each sequence record in the binary file: [u8: n_kmers−1][packed bytes]. -// Offsets point to the first packed byte of each record, past the leading u8. -// Unitigs with more than MAX_KMERS_PER_CHUNK kmers are transparently split by the -// writer into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost. +// With BLOCK_SIZE = 64 and an average chunk size of ~10 bytes, a random lookup +// scans at most 63 × 10 = 630 bytes sequentially — negligible next to the MPHF +// lookup that precedes it. The index file shrinks from ~5 bytes/chunk to +// ~1/64 bytes/chunk (≈ 300× for typical workloads). -const MAGIC: [u8; 4] = *b"UIDX"; +const MAGIC: [u8; 4] = *b"UIX2"; +const BLOCK_SIZE: usize = 64; +const LOG2_BLOCK_SIZE: u32 = 6; // 2^6 = BLOCK_SIZE fn idx_path(path: &Path) -> PathBuf { crate::append_path_suffix(path, ".idx") @@ -32,21 +31,21 @@ fn idx_path(path: &Path) -> PathBuf { // ── Writer ──────────────────────────────────────────────────────────────────── /// Writes a sequence of [`Unitig`] to an uncompressed binary file and builds -/// an offset index at close time. +/// a block-sampled offset index at close time. /// -/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] kmers are transparently split -/// into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost. +/// One offset is stored every [`BLOCK_SIZE`] chunks; random access to chunk `i` +/// costs at most `BLOCK_SIZE − 1` sequential chunk scans after the block lookup. /// -/// The companion index file (`path.idx`) is written on [`close`]. -/// The binary format per record is `[u8: n_kmers−1][packed 2-bit bytes]`. +/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] k-mers are transparently split +/// into overlapping chunks (k−1 nucleotide overlap) so no k-mer is lost. pub struct UnitigFileWriter { - path: PathBuf, - file: BufWriter, - seqls: Vec, - packed_offsets: Vec, - next_offset: u32, - n_kmers: usize, - k: usize, + path: PathBuf, + file: BufWriter, + block_offsets: Vec, // byte offset of first record in each block + chunk_count: usize, + next_offset: u32, // byte offset of the START of the next record + n_kmers: usize, + k: usize, } impl UnitigFileWriter { @@ -55,15 +54,16 @@ impl UnitigFileWriter { Ok(Self { path: path.to_owned(), file: BufWriter::new(file), - seqls: Vec::new(), - packed_offsets: Vec::new(), + block_offsets: Vec::new(), + chunk_count: 0, next_offset: 0, n_kmers: 0, k: obikseq::params::k(), }) } - /// Write a unitig, splitting it into chunks if it exceeds [`MAX_KMERS_PER_CHUNK`]. + /// Write a unitig, splitting into overlapping chunks if it exceeds + /// [`MAX_KMERS_PER_CHUNK`]. pub fn write(&mut self, unitig: &Unitig) -> SKResult<()> { let seql = unitig.seql(); let k = self.k; @@ -77,17 +77,13 @@ impl UnitigFileWriter { return self.write_chunk(unitig); } - // Split into overlapping chunks of MAX_KMERS_PER_CHUNK kmers. - // Overlap of k-1 nucleotides ensures no kmer is lost at boundaries. let chunk_nucl = MAX_KMERS_PER_CHUNK + k - 1; let stride = MAX_KMERS_PER_CHUNK; let mut start = 0; while start < seql { let end = (start + chunk_nucl).min(seql); self.write_chunk(&unitig.sub(start, end))?; - if end == seql { - break; - } + if end == seql { break; } start += stride; } Ok(()) @@ -97,54 +93,48 @@ impl UnitigFileWriter { let seql = unitig.seql(); let byte_len = (seql + 3) / 4; - // Header is 1 byte (u8: n_kmers − 1 = seql − k); packed bytes follow. debug_assert!(seql - self.k <= u8::MAX as usize, "chunk exceeds MAX_KMERS_PER_CHUNK"); - self.packed_offsets.push(self.next_offset + 1); - self.seqls.push((seql - self.k) as u8); - self.n_kmers += seql - self.k + 1; - unitig - .write_to_binary(&mut self.file) - .map_err(SKError::Io)?; + // Record a block offset at the start of every BLOCK_SIZE-th chunk. + if self.chunk_count & (BLOCK_SIZE - 1) == 0 { + self.block_offsets.push(self.next_offset); + } + + self.n_kmers += seql - self.k + 1; + self.chunk_count += 1; + + unitig.write_to_binary(&mut self.file).map_err(SKError::Io)?; self.next_offset += 1 + byte_len as u32; Ok(()) } - /// Flush the sequence file and write the companion `.idx`. pub fn close(mut self) -> SKResult<()> { self.file.flush().map_err(SKError::Io)?; drop(self.file); - // Sentinel: byte offset past the last record's packed bytes. - let sentinel = match (self.packed_offsets.last(), self.seqls.last()) { - (Some(&last_off), Some(&last_seql)) => { - let seql = last_seql as u32 + self.k as u32; - last_off + (seql + 3) / 4 - } - _ => 0, - }; - self.packed_offsets.push(sentinel); + // Sentinel: byte offset past the last record (needed for end-of-file detection). + self.block_offsets.push(self.next_offset); - write_idx(&idx_path(&self.path), &self.seqls, &self.packed_offsets, self.n_kmers) + write_idx( + &idx_path(&self.path), + self.chunk_count as u32, + self.n_kmers as u64, + &self.block_offsets, + ) } - pub fn len(&self) -> usize { - self.seqls.len() - } - - pub fn is_empty(&self) -> bool { - self.seqls.is_empty() - } + pub fn len(&self) -> usize { self.chunk_count } + pub fn is_empty(&self) -> bool { self.chunk_count == 0 } } -fn write_idx(path: &Path, seqls: &[u8], packed_offsets: &[u32], n_kmers: usize) -> SKResult<()> { +fn write_idx(path: &Path, n_unitigs: u32, n_kmers: u64, block_offsets: &[u32]) -> SKResult<()> { let mut w = BufWriter::new(File::create(path).map_err(SKError::Io)?); w.write_all(&MAGIC).map_err(SKError::Io)?; - w.write_all(&(seqls.len() as u32).to_le_bytes()).map_err(SKError::Io)?; - w.write_all(&(n_kmers as u64).to_le_bytes()).map_err(SKError::Io)?; - w.write_all(seqls).map_err(SKError::Io)?; - for &off in packed_offsets { + w.write_all(&(BLOCK_SIZE as u32).to_le_bytes()).map_err(SKError::Io)?; + w.write_all(&n_unitigs.to_le_bytes()).map_err(SKError::Io)?; + w.write_all(&n_kmers.to_le_bytes()).map_err(SKError::Io)?; + for &off in block_offsets { w.write_all(&off.to_le_bytes()).map_err(SKError::Io)?; } w.flush().map_err(SKError::Io) @@ -154,105 +144,116 @@ fn write_idx(path: &Path, seqls: &[u8], packed_offsets: &[u32], n_kmers: usize) /// Read-only random-access view of a unitig file. /// -/// The sequence file is memory-mapped; the index is loaded into RAM on open. -/// All per-kmer operations are O(1) and allocation-free. +/// The sequence file is memory-mapped; the block offset table is loaded into RAM +/// on open (≈ n_chunks / BLOCK_SIZE entries, negligible memory). +/// +/// Random access to chunk `i`: O(BLOCK_SIZE) sequential mmap reads — branchless +/// shift/mask arithmetic, cache-friendly, negligible versus the MPHF lookup. +/// +/// Sequential iteration: O(n) via a running-offset cursor (no per-chunk overhead). pub struct UnitigFileReader { - mmap: Mmap, - seqls: Vec, - packed_offsets: Vec, - n_kmers: usize, - k: usize, + mmap: Mmap, + block_offsets: Vec, + n_unitigs: usize, + n_kmers: usize, + k: usize, } impl UnitigFileReader { pub fn open(path: &Path) -> SKResult { let file = File::open(path).map_err(SKError::Io)?; let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? }; - let (seqls, packed_offsets, n_kmers) = read_idx(&idx_path(path))?; + let (n_unitigs, n_kmers, block_offsets) = read_idx(&idx_path(path))?; let k = obikseq::params::k(); - Ok(Self { mmap, seqls, packed_offsets, n_kmers, k }) + Ok(Self { mmap, block_offsets, n_unitigs, n_kmers, k }) } - pub fn len(&self) -> usize { - self.seqls.len() + pub fn len(&self) -> usize { self.n_unitigs } + pub fn is_empty(&self) -> bool { self.n_unitigs == 0 } + pub fn n_kmers(&self) -> usize { self.n_kmers } + + /// Byte offset of the START of record `i` (the seql byte) in the mmap. + /// O(BLOCK_SIZE) sequential scan within the block. + #[inline] + fn chunk_start(&self, i: usize) -> usize { + let block = i >> LOG2_BLOCK_SIZE; + let rem = i & (BLOCK_SIZE - 1); + let mut offset = self.block_offsets[block] as usize; + for _ in 0..rem { + let seql_minus_k = self.mmap[offset] as usize; + offset += 1 + (seql_minus_k + self.k + 3) / 4; + } + offset } - pub fn is_empty(&self) -> bool { - self.seqls.is_empty() - } - - /// Total number of kmers across all chunks. - pub fn n_kmers(&self) -> usize { - self.n_kmers - } - - /// Return the nucleotide length of chunk `i`. + /// Nucleotide length of chunk `i`. #[inline] pub fn seql(&self, i: usize) -> usize { - self.seqls[i] as usize + self.k + self.mmap[self.chunk_start(i)] as usize + self.k } - /// Reconstruct chunk `i` as a [`Unitig`]. Allocates a copy of the packed bytes. + /// Reconstruct chunk `i` as a [`Unitig`]. pub fn unitig(&self, i: usize) -> Unitig { - let seql = self.seqls[i] as usize + self.k; - let start = self.packed_offsets[i] as usize; + let offset = self.chunk_start(i); + let seql = self.mmap[offset] as usize + self.k; let byte_len = (seql + 3) / 4; - let tail = (seql % 4) as u8; - let bytes = self.mmap[start..start + byte_len].to_vec().into_boxed_slice(); - Unitig::new(tail, bytes) + let bytes = self.mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice(); + Unitig::new((seql % 4) as u8, bytes) } - /// Extract the raw left-aligned u64 of the kmer at position `j` within chunk `i`. + /// Raw left-aligned u64 of the k-mer at position `j` within chunk `i`. #[inline] pub fn raw_kmer(&self, i: usize, j: usize) -> u64 { - let start = self.packed_offsets[i] as usize; - extract_kmer_raw(&self.mmap[start..], j, self.k) + let offset = self.chunk_start(i); + extract_kmer_raw(&self.mmap[offset + 1..], j, self.k) } - /// Return `true` iff the kmer at position `j` of chunk `i` equals `query`. - /// - /// O(1), zero allocation. The chunk may store either orientation of the kmer; - /// canonicalization is applied before comparison. + /// `true` iff the k-mer at position `j` of chunk `i` equals `query` (canonical). #[inline] pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool { canonical_raw(self.raw_kmer(i, j), self.k) == query.raw() } - /// Iterate over all kmers in file order (all positions of chunk 0, then chunk 1, …). - /// - /// Each chunk is copied from the mmap once; iteration within the chunk is - /// zero-allocation (sliding-window via [`OwnedPackedSeqKmerIter`]). + // ── Sequential iterators (O(n) running-offset cursor) ───────────────────── + + /// Iterate all chunks in file order with a running byte offset — O(n) total. + fn iter_chunks_sequential(&self) -> impl Iterator + '_ { + let k = self.k; + let mmap = &*self.mmap; + let n = self.n_unitigs; + let mut offset = 0usize; + (0..n).map(move |chunk_id| { + let seql = mmap[offset] as usize + k; + let byte_len = (seql + 3) / 4; + let bytes = mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice(); + offset += 1 + byte_len; + (chunk_id, Unitig::new((seql % 4) as u8, bytes)) + }) + } + pub fn iter_kmers(&self) -> impl Iterator + '_ { - (0..self.len()).flat_map(move |i| self.unitig(i).into_kmers()) + self.iter_chunks_sequential() + .flat_map(|(_, u)| u.into_kmers()) } - /// Iterate over all canonical kmers in file order. - /// - /// Equivalent to `iter_kmers().map(|km| km.canonical())` but uses the - /// built-in canonical iterator on each chunk, which avoids a separate - /// canonicalization pass. pub fn iter_canonical_kmers(&self) -> impl Iterator + '_ { - (0..self.len()).flat_map(move |i| self.unitig(i).into_canonical_kmers()) + self.iter_chunks_sequential() + .flat_map(|(_, u)| u.into_canonical_kmers()) } - /// Iterate over `(kmer, chunk_id, rank)` for every canonical kmer in the file. - /// - /// `chunk_id` is the index of the chunk within this file; `rank` is the - /// 0-based position of the kmer within that chunk. Used to build the - /// evidence table in `obilayeredmap`. pub fn iter_indexed_canonical_kmers( &self, ) -> impl Iterator + '_ { - (0..self.len()).flat_map(move |chunk_id| { - self.unitig(chunk_id) - .into_canonical_kmers() - .enumerate() - .map(move |(rank, kmer)| (kmer, chunk_id, rank)) - }) + self.iter_chunks_sequential() + .flat_map(|(chunk_id, u)| { + u.into_canonical_kmers() + .enumerate() + .map(move |(rank, kmer)| (kmer, chunk_id, rank)) + }) } } -fn read_idx(path: &Path) -> SKResult<(Vec, Vec, usize)> { +fn read_idx(path: &Path) -> SKResult<(usize, usize, Vec)> { let data = std::fs::read(path).map_err(SKError::Io)?; let mut pos = 0; @@ -260,15 +261,27 @@ fn read_idx(path: &Path) -> SKResult<(Vec, Vec, usize)> { .ok_or(SKError::Truncated { context: "unitig index: magic" })?; if magic_bytes != &MAGIC { return Err(SKError::BadMagic { - expected: "UIDX", + expected: "UIX2", got: magic_bytes.try_into().unwrap(), }); } pos += 4; + // block_size stored for forward-compatibility verification + let bs_bytes = data.get(pos..pos + 4) + .ok_or(SKError::Truncated { context: "unitig index: block_size" })?; + let stored_bs = u32::from_le_bytes(bs_bytes.try_into().unwrap()) as usize; + if stored_bs != BLOCK_SIZE { + return Err(SKError::InvalidData { + context: "unitig index", + detail: format!("block_size mismatch: file={stored_bs} code={BLOCK_SIZE}"), + }); + } + pos += 4; + let n_bytes = data.get(pos..pos + 4) .ok_or(SKError::Truncated { context: "unitig index: n_unitigs" })?; - let n = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize; + let n_unitigs = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize; pos += 4; let nk_bytes = data.get(pos..pos + 8) @@ -276,25 +289,21 @@ fn read_idx(path: &Path) -> SKResult<(Vec, Vec, usize)> { let n_kmers = u64::from_le_bytes(nk_bytes.try_into().unwrap()) as usize; pos += 8; - let seqls = data.get(pos..pos + n) - .ok_or(SKError::Truncated { context: "unitig index: seqls" })? - .to_vec(); - pos += n; - - let mut packed_offsets = Vec::with_capacity(n + 1); - for _ in 0..=n { + let n_blocks = (n_unitigs + BLOCK_SIZE - 1) >> LOG2_BLOCK_SIZE; + let n_offsets = n_blocks + 1; // +1 for sentinel + let mut block_offsets = Vec::with_capacity(n_offsets); + for _ in 0..n_offsets { let off_bytes = data.get(pos..pos + 4) - .ok_or(SKError::Truncated { context: "unitig index: packed_offsets" })?; - packed_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap())); + .ok_or(SKError::Truncated { context: "unitig index: block_offsets" })?; + block_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap())); pos += 4; } - Ok((seqls, packed_offsets, n_kmers)) + Ok((n_unitigs, n_kmers, block_offsets)) } // ── Kmer utilities ──────────────────────────────────────────────────────────── -/// Reverse complement of a left-aligned 2-bit kmer (same algorithm as [`KmerOf::revcomp`]). #[inline] fn revcomp_raw(raw: u64, k: usize) -> u64 { let x = !raw; @@ -304,22 +313,17 @@ fn revcomp_raw(raw: u64, k: usize) -> u64 { x << (64 - 2 * k) } -/// Canonical form of a left-aligned 2-bit kmer: `min(kmer, revcomp(kmer))`. #[inline] fn canonical_raw(raw: u64, k: usize) -> u64 { raw.min(revcomp_raw(raw, k)) } -// ── Bit extraction ──────────────────────────────────────────────────────────── - -/// Extract the kmer at nucleotide position `j` from MSB-first 2-bit packed `bytes`. -/// Returns a left-aligned u64 matching [`KmerOf`]'s internal representation. #[inline] fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 { - let bit_start = j * 2; - let byte_start = bit_start / 8; - let bit_offset = bit_start % 8; // always 0, 2, 4, or 6 - let bytes_needed = (bit_offset + 2 * k + 7) / 8; // ≤ 9 for k ≤ 32 + let bit_start = j * 2; + let byte_start = bit_start / 8; + let bit_offset = bit_start % 8; + let bytes_needed = (bit_offset + 2 * k + 7) / 8; let mut acc = 0u128; for idx in 0..bytes_needed { @@ -327,8 +331,8 @@ fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 { } let shift = bytes_needed * 8 - bit_offset - 2 * k; - let mask = !0u64 >> (64 - 2 * k); - let raw = (acc >> shift) as u64 & mask; + let mask = !0u64 >> (64 - 2 * k); + let raw = (acc >> shift) as u64 & mask; raw << (64 - 2 * k) }