From 4a5ab0b8c24457c40fc2be720fd6e722881e21c5 Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 07:51:59 +0200 Subject: [PATCH 01/10] feat: optimize unitig index and document evidence elimination MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability. --- docmd/implementation/evidence_elimination.md | 321 +++++++++++++++++++ docmd/implementation/unitig_evidence.md | 2 + mkdocs.yml | 1 + src/obikmer/src/steps/scatter.rs | 23 +- src/obiskio/src/unitig_index.rs | 292 ++++++++--------- 5 files changed, 488 insertions(+), 151 deletions(-) create mode 100644 docmd/implementation/evidence_elimination.md diff --git a/docmd/implementation/evidence_elimination.md b/docmd/implementation/evidence_elimination.md new file mode 100644 index 0000000..ecc2c6f --- /dev/null +++ b/docmd/implementation/evidence_elimination.md @@ -0,0 +1,321 @@ +# Evidence elimination — design discussion + +## Problem statement + +`evidence.bin` maps each MPHF slot to a position in the unitig store so that +query verification is possible: given a slot `s` returned by `mphf.index(kmer)`, +retrieve the k-mer stored at that position and compare with the query. + +On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition): + +| file | size/partition | total (2048 parts) | fraction of lookup layer | +|---|---|---|---| +| evidence.bin | 132 MB | ~270 GB | **66 %** | +| unitigs.bin | 58 MB | ~118 GB | 29 % | +| mphf.bin | 10 MB | ~20 GB | 5 % | + +Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage +optimisation available for index size. + +--- + +## Why evidence exists + +PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`. +For a query k-mer not in the indexed set, the returned slot is meaningless but +indistinguishable from a real hit without external information. Evidence provides +that information: `evidence[s]` encodes the location of the k-mer that legitimately +occupies slot `s`, allowing the verification: + +``` +slot = mphf.index(query) +(chunk_id, rank) = evidence.decode(slot) +stored_kmer = unitigs.kmer_at(chunk_id, rank) +return canonical(stored_kmer) == canonical(query) +``` + +Evidence is a **permutation** from MPHF-space to unitig-position-space. +Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding. + +--- + +## Information-theoretic lower bound + +For a partition with P k-mers and U unitigs of average length m_u k-mers: + +- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits +- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits + +Current implementation: 25 + 7 = 32 bits (aligned u32). +Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M. + +**Packing headroom: ~22 %.** Not a path to elimination. + +--- + +## Two-index architecture + +The exact index is mandatory for set operations (union, intersection, diff) and +exact k-mer retrieval. A separate approximate index, built for query operations, +can tolerate a controlled false positive rate in exchange for a much smaller +footprint. + +| component | exact index | approximate index | +|---|---|---| +| `mphf.bin` | ✓ | ✓ (same structure) | +| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ | +| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) | +| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) | +| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) | + +The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps +`unitigs.bin` for sequential enumeration of K-mers. + +--- + +## MPHF as a perfect Bloom filter + +A standard Bloom filter with a single hash function and N bits storing M keys has +occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of +landing on a set bit. The empty space (fraction 1 − M/N of bits at 0) is what +rejects foreign k-mers. + +An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer +occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any** +input to a slot in [0, M) — there is no empty space. Every query lands on an +occupied slot. The MPHF alone cannot reject foreign k-mers at all. + +Adding a B-bit fingerprint restores the discrimination: + +``` +slot = mphf.index(query) +fingerprint = hash(query) & mask_B +present = fingerprint_table[slot] == fingerprint +``` + +The fingerprint plays the role of the sparse space in the Bloom filter: it provides +the B bits of information needed to reject foreign k-mers. + +Both structures reach the same fundamental cost for a given FP rate. For 1% FP: + +- Bloom filter (optimal, k hash functions): ~9.6 bits/key +- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key + +This is a fundamental bound, not an implementation detail. + +--- + +## Approach A — MPHF + fingerprint (approximate index) + +### Size + +| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) | +|---|---|---| +| 8 | 33 MB | 4× smaller | +| 12 | 49 MB | 2.7× smaller | +| 16 | 66 MB | 2× smaller | + +Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer). + +### False positive rate — per k-mer query + +For a specific non-indexed query k-mer q: + +1. MPHF(q) → slot s, some value in [0, M) +2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s +3. FP event: hash(q) & mask_B == fingerprint_table[s] + +Since q is not the legitimate k-mer at s, its fingerprint is independent of +fingerprint_table[s], giving: + +``` +P(FP per k-mer) = 1 / 2^B +``` + +This is the probability of error **for one specific query k-mer**. It is not the +fraction of the k-mer universe that would be misclassified: querying all 4^k +possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but +that is not the relevant quantity for practical use. + +### Equivalence classes + +The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B +equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true +indexed k-mer and 4^k/(M·2^B) − 1 false positives. A larger M (fewer partitions) +produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B +remains constant. + +### Read-level use case + +The relevant decision unit is the **read**, not the individual k-mer. For a read +of ~100 nucleotides and k=31, there are ~70 k-mers. + +- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are + true positives → high coverage fraction. +- A plant read queried against a bacterial index: k-mers are foreign; each has + P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B. + +A coverage threshold separates the two cases decisively. This is the same +principle as Findere: local coverage continuity distinguishes true hits from noise. + +--- + +## Approach B — z-consecutive k-mer matching + +A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers. +Declaring a match only when **all z are present** reduces the per-window FP rate: + +``` +P(FP per window of z) = (1/2^B)^z = 1/2^(B·z) +``` + +For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z. +The probability that at least one window is a false positive: + +``` +P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z) +``` + +For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸. + +A plant read is misclassified as bacterial roughly once in 60 million reads — +negligible for any practical dataset. + +### Choosing B from (z, L, P_target) + +z is a query-time parameter and does not affect the index structure. However, +knowing z at build time allows computing the minimum B required to reach a target +FP rate P_target for reads of length L (giving W = L − k − z + 2 independent +windows): + +``` +P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z ) +``` + +Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**. + +(B, z) are co-determined at build time to minimise fingerprint size while +guaranteeing the target read-level FP rate. + +### Combined sizing + +| B | z | K = k+z−1 | P(FP_read) | fingerprint.bin/partition | +|---|---|---|---|---| +| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB | +| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB | +| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB | +| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB | + +Smaller B → smaller fingerprint table; larger z → longer minimum match length K +and fewer independent windows per read. + +--- + +## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index) + +Build the MPHF to output the global k-mer position directly: + +``` +mphf: kmer → global_pos ∈ [0, P) +``` + +Verification becomes: + +``` +global_pos = mphf.index(query) +stored_kmer = unitigs.kmer_at_global_pos(global_pos) +return canonical(stored_kmer) == canonical(query) +``` + +No evidence array. The unitig block index (see below) provides +`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time. + +### What is required + +A **retrieval data structure** (also called a value-based or function-based MPHF): +given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`, +build a compact structure that maps each key to its assigned value. + +Known constructions: + +- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph + + XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic + minimum. Construction: O(n). Query: O(1). +- **SSHash approach**: builds the MPHF to map k-mers to their positions in a + concatenated unitig string. Achieves elimination of external evidence using a + "skew" construction that aligns the MPHF output with the sequential unitig layout. + +### Rust availability + +No Rust crate implements a retrieval data structure suitable for this use case as +of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain +MPHFs. **This is the key blocking factor.** + +### SSHash construction (reference) + +SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over +(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential +positions in the concatenated unitig string. A port to obikmer would require: + +1. Concatenating all unitig sequences into a single flat buffer per partition. +2. Assigning each k-mer a global position (its offset in that buffer). +3. Building the MPHF to output that position directly (retrieval step). +4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`. + +--- + +## Approach 2 — block index prefix sums (reduces evidence to negligible) + +A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled +offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset +table. + +### Extension: k-mer prefix sums per block + +Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before +block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**. + +This enables `kmer_at_global_pos(p)`: + +1. Binary search in `kmer_prefix[]` to find block `b`. +2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`. +3. Extract the k-mer at the remaining rank within the found chunk. + +Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses. + +### Combined with Approach 1 + +- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions) +- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible) + +--- + +## Recommended path + +1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose + (B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep + `unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8 + vs ~142 MB for the exact lookup layer. + +2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`. + Zero cost if evidence.bin is kept; enables Approach 1 when ready. + +3. **Medium term**: implement GOV retrieval data structure in Rust, or port + SSHash construction. + +4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected + index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset. + +--- + +## Open questions + +- Is a GOV construction compatible with the parallel MPHF build currently used + (PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential + (hypergraph peeling); parallelisation is non-trivial. +- Can the SSHash "skew" insight be reused without the minimizer-bucket structure? + The obikmer partitioning already uses minimizers — there may be natural alignment. +- What is the query latency impact of replacing O(1) evidence lookup with + O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale. +- What is the optimal (B, z) trade-off for the approximate index given the target + read length and acceptable P(FP_read)? diff --git a/docmd/implementation/unitig_evidence.md b/docmd/implementation/unitig_evidence.md index 61ae3df..b0474ea 100644 --- a/docmd/implementation/unitig_evidence.md +++ b/docmd/implementation/unitig_evidence.md @@ -268,3 +268,5 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv ## Open questions - **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u. + +- **Eliminating evidence.bin**: at ~66 % of the per-layer lookup footprint (132 MB vs 200 MB total per partition on the bacterial BCT dataset), evidence.bin dominates index size. A dedicated design investigation is open — see [Evidence elimination design discussion](evidence_elimination.md). diff --git a/mkdocs.yml b/mkdocs.yml index 9f2ea83..6d9bc3a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -44,6 +44,7 @@ nav: - On-disk storage: implementation/storage.md - MPHF selection: implementation/mphf.md - Unitig evidence encoding: implementation/unitig_evidence.md + - Evidence elimination (discussion): implementation/evidence_elimination.md - obilayeredmap crate: implementation/obilayeredmap.md - PersistentCompactIntVec: implementation/persistent_compact_int_vec.md - PersistentBitVec: implementation/persistent_bit_vec.md diff --git a/src/obikmer/src/steps/scatter.rs b/src/obikmer/src/steps/scatter.rs index f451ea0..777e92b 100644 --- a/src/obikmer/src/steps/scatter.rs +++ b/src/obikmer/src/steps/scatter.rs @@ -1,4 +1,6 @@ use std::path::PathBuf; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Arc; use std::time::{Duration, Instant}; use indicatif::{ProgressBar, ProgressStyle}; @@ -42,15 +44,21 @@ pub fn scatter( // Throttle in the source thread — never in a worker — to prevent deadlock. let throttled = throttle_paths(path_source, max_open); + let file_count = Arc::new(AtomicU64::new(0)); + let t = Stage::start("scatter"); let pipe = obipipeline::make_pipe! { PipelineData : PathWithSlot => Vec, - ||? { move |pw: PathWithSlot| { - let PathWithSlot { path, _guard } = pw; - info!("indexing: {}", path.display()); - // _guard travels into GuardedIter; released when all chunks are read - open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard }) - }} : Path => RawChunk, + ||? { + let file_count = Arc::clone(&file_count); + move |pw: PathWithSlot| { + let PathWithSlot { path, _guard } = pw; + let n = file_count.fetch_add(1, Ordering::Relaxed) + 1; + info!("indexing [{}]: {}", n, path.display()); + // _guard travels into GuardedIter; released when all chunks are read + open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard }) + } + } : Path => RawChunk, |? { move |rope| obiread::normalize_sequence_chunk(rope, k) } : RawChunk => NormChunk, | { move |rope| obiskbuilder::build_superkmers(rope, k, level_max, theta) } : NormChunk => Batch, }; @@ -84,7 +92,8 @@ pub fn scatter( } else { (format!("{:.0} Mbp", bp / 1e6), format!("{:.0} Mbp/s", ema_rate / 1e6)) }; - pb.set_message(format!("{count_str} {rate_str}")); + let n_files = file_count.load(Ordering::Relaxed); + pb.set_message(format!("{count_str} {rate_str} {n_files} files")); } kp.write_batch(batch).unwrap_or_else(|e| { eprintln!("error: {e}"); diff --git a/src/obiskio/src/unitig_index.rs b/src/obiskio/src/unitig_index.rs index 479740b..0fe4444 100644 --- a/src/obiskio/src/unitig_index.rs +++ b/src/obiskio/src/unitig_index.rs @@ -9,21 +9,20 @@ pub use obikseq::MAX_KMERS_PER_CHUNK; use crate::error::{SKError, SKResult}; -// ── Index file format ───────────────────────────────────────────────────────── +// ── Block index parameters ──────────────────────────────────────────────────── // -// magic: [u8; 4] = b"UIDX" -// n_unitigs: u32 LE -// n_kmers: u64 LE total kmer count across all chunks -// seqls: [u8; n_unitigs] max kmer index per chunk (= n_kmers − 1) -// packed_offsets: [u32; n_unitigs + 1] byte offsets to packed bytes in the -// sequence file; last entry is sentinel +// One offset entry per BLOCK_SIZE chunks. BLOCK_SIZE must be a power of two +// so that block = i >> LOG2_BLOCK_SIZE and rem = i & (BLOCK_SIZE − 1) are +// branchless shifts/masks rather than divisions. // -// Each sequence record in the binary file: [u8: n_kmers−1][packed bytes]. -// Offsets point to the first packed byte of each record, past the leading u8. -// Unitigs with more than MAX_KMERS_PER_CHUNK kmers are transparently split by the -// writer into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost. +// With BLOCK_SIZE = 64 and an average chunk size of ~10 bytes, a random lookup +// scans at most 63 × 10 = 630 bytes sequentially — negligible next to the MPHF +// lookup that precedes it. The index file shrinks from ~5 bytes/chunk to +// ~1/64 bytes/chunk (≈ 300× for typical workloads). -const MAGIC: [u8; 4] = *b"UIDX"; +const MAGIC: [u8; 4] = *b"UIX2"; +const BLOCK_SIZE: usize = 64; +const LOG2_BLOCK_SIZE: u32 = 6; // 2^6 = BLOCK_SIZE fn idx_path(path: &Path) -> PathBuf { crate::append_path_suffix(path, ".idx") @@ -32,21 +31,21 @@ fn idx_path(path: &Path) -> PathBuf { // ── Writer ──────────────────────────────────────────────────────────────────── /// Writes a sequence of [`Unitig`] to an uncompressed binary file and builds -/// an offset index at close time. +/// a block-sampled offset index at close time. /// -/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] kmers are transparently split -/// into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost. +/// One offset is stored every [`BLOCK_SIZE`] chunks; random access to chunk `i` +/// costs at most `BLOCK_SIZE − 1` sequential chunk scans after the block lookup. /// -/// The companion index file (`path.idx`) is written on [`close`]. -/// The binary format per record is `[u8: n_kmers−1][packed 2-bit bytes]`. +/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] k-mers are transparently split +/// into overlapping chunks (k−1 nucleotide overlap) so no k-mer is lost. pub struct UnitigFileWriter { - path: PathBuf, - file: BufWriter, - seqls: Vec, - packed_offsets: Vec, - next_offset: u32, - n_kmers: usize, - k: usize, + path: PathBuf, + file: BufWriter, + block_offsets: Vec, // byte offset of first record in each block + chunk_count: usize, + next_offset: u32, // byte offset of the START of the next record + n_kmers: usize, + k: usize, } impl UnitigFileWriter { @@ -55,15 +54,16 @@ impl UnitigFileWriter { Ok(Self { path: path.to_owned(), file: BufWriter::new(file), - seqls: Vec::new(), - packed_offsets: Vec::new(), + block_offsets: Vec::new(), + chunk_count: 0, next_offset: 0, n_kmers: 0, k: obikseq::params::k(), }) } - /// Write a unitig, splitting it into chunks if it exceeds [`MAX_KMERS_PER_CHUNK`]. + /// Write a unitig, splitting into overlapping chunks if it exceeds + /// [`MAX_KMERS_PER_CHUNK`]. pub fn write(&mut self, unitig: &Unitig) -> SKResult<()> { let seql = unitig.seql(); let k = self.k; @@ -77,17 +77,13 @@ impl UnitigFileWriter { return self.write_chunk(unitig); } - // Split into overlapping chunks of MAX_KMERS_PER_CHUNK kmers. - // Overlap of k-1 nucleotides ensures no kmer is lost at boundaries. let chunk_nucl = MAX_KMERS_PER_CHUNK + k - 1; let stride = MAX_KMERS_PER_CHUNK; let mut start = 0; while start < seql { let end = (start + chunk_nucl).min(seql); self.write_chunk(&unitig.sub(start, end))?; - if end == seql { - break; - } + if end == seql { break; } start += stride; } Ok(()) @@ -97,54 +93,48 @@ impl UnitigFileWriter { let seql = unitig.seql(); let byte_len = (seql + 3) / 4; - // Header is 1 byte (u8: n_kmers − 1 = seql − k); packed bytes follow. debug_assert!(seql - self.k <= u8::MAX as usize, "chunk exceeds MAX_KMERS_PER_CHUNK"); - self.packed_offsets.push(self.next_offset + 1); - self.seqls.push((seql - self.k) as u8); - self.n_kmers += seql - self.k + 1; - unitig - .write_to_binary(&mut self.file) - .map_err(SKError::Io)?; + // Record a block offset at the start of every BLOCK_SIZE-th chunk. + if self.chunk_count & (BLOCK_SIZE - 1) == 0 { + self.block_offsets.push(self.next_offset); + } + + self.n_kmers += seql - self.k + 1; + self.chunk_count += 1; + + unitig.write_to_binary(&mut self.file).map_err(SKError::Io)?; self.next_offset += 1 + byte_len as u32; Ok(()) } - /// Flush the sequence file and write the companion `.idx`. pub fn close(mut self) -> SKResult<()> { self.file.flush().map_err(SKError::Io)?; drop(self.file); - // Sentinel: byte offset past the last record's packed bytes. - let sentinel = match (self.packed_offsets.last(), self.seqls.last()) { - (Some(&last_off), Some(&last_seql)) => { - let seql = last_seql as u32 + self.k as u32; - last_off + (seql + 3) / 4 - } - _ => 0, - }; - self.packed_offsets.push(sentinel); + // Sentinel: byte offset past the last record (needed for end-of-file detection). + self.block_offsets.push(self.next_offset); - write_idx(&idx_path(&self.path), &self.seqls, &self.packed_offsets, self.n_kmers) + write_idx( + &idx_path(&self.path), + self.chunk_count as u32, + self.n_kmers as u64, + &self.block_offsets, + ) } - pub fn len(&self) -> usize { - self.seqls.len() - } - - pub fn is_empty(&self) -> bool { - self.seqls.is_empty() - } + pub fn len(&self) -> usize { self.chunk_count } + pub fn is_empty(&self) -> bool { self.chunk_count == 0 } } -fn write_idx(path: &Path, seqls: &[u8], packed_offsets: &[u32], n_kmers: usize) -> SKResult<()> { +fn write_idx(path: &Path, n_unitigs: u32, n_kmers: u64, block_offsets: &[u32]) -> SKResult<()> { let mut w = BufWriter::new(File::create(path).map_err(SKError::Io)?); w.write_all(&MAGIC).map_err(SKError::Io)?; - w.write_all(&(seqls.len() as u32).to_le_bytes()).map_err(SKError::Io)?; - w.write_all(&(n_kmers as u64).to_le_bytes()).map_err(SKError::Io)?; - w.write_all(seqls).map_err(SKError::Io)?; - for &off in packed_offsets { + w.write_all(&(BLOCK_SIZE as u32).to_le_bytes()).map_err(SKError::Io)?; + w.write_all(&n_unitigs.to_le_bytes()).map_err(SKError::Io)?; + w.write_all(&n_kmers.to_le_bytes()).map_err(SKError::Io)?; + for &off in block_offsets { w.write_all(&off.to_le_bytes()).map_err(SKError::Io)?; } w.flush().map_err(SKError::Io) @@ -154,105 +144,116 @@ fn write_idx(path: &Path, seqls: &[u8], packed_offsets: &[u32], n_kmers: usize) /// Read-only random-access view of a unitig file. /// -/// The sequence file is memory-mapped; the index is loaded into RAM on open. -/// All per-kmer operations are O(1) and allocation-free. +/// The sequence file is memory-mapped; the block offset table is loaded into RAM +/// on open (≈ n_chunks / BLOCK_SIZE entries, negligible memory). +/// +/// Random access to chunk `i`: O(BLOCK_SIZE) sequential mmap reads — branchless +/// shift/mask arithmetic, cache-friendly, negligible versus the MPHF lookup. +/// +/// Sequential iteration: O(n) via a running-offset cursor (no per-chunk overhead). pub struct UnitigFileReader { - mmap: Mmap, - seqls: Vec, - packed_offsets: Vec, - n_kmers: usize, - k: usize, + mmap: Mmap, + block_offsets: Vec, + n_unitigs: usize, + n_kmers: usize, + k: usize, } impl UnitigFileReader { pub fn open(path: &Path) -> SKResult { let file = File::open(path).map_err(SKError::Io)?; let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? }; - let (seqls, packed_offsets, n_kmers) = read_idx(&idx_path(path))?; + let (n_unitigs, n_kmers, block_offsets) = read_idx(&idx_path(path))?; let k = obikseq::params::k(); - Ok(Self { mmap, seqls, packed_offsets, n_kmers, k }) + Ok(Self { mmap, block_offsets, n_unitigs, n_kmers, k }) } - pub fn len(&self) -> usize { - self.seqls.len() + pub fn len(&self) -> usize { self.n_unitigs } + pub fn is_empty(&self) -> bool { self.n_unitigs == 0 } + pub fn n_kmers(&self) -> usize { self.n_kmers } + + /// Byte offset of the START of record `i` (the seql byte) in the mmap. + /// O(BLOCK_SIZE) sequential scan within the block. + #[inline] + fn chunk_start(&self, i: usize) -> usize { + let block = i >> LOG2_BLOCK_SIZE; + let rem = i & (BLOCK_SIZE - 1); + let mut offset = self.block_offsets[block] as usize; + for _ in 0..rem { + let seql_minus_k = self.mmap[offset] as usize; + offset += 1 + (seql_minus_k + self.k + 3) / 4; + } + offset } - pub fn is_empty(&self) -> bool { - self.seqls.is_empty() - } - - /// Total number of kmers across all chunks. - pub fn n_kmers(&self) -> usize { - self.n_kmers - } - - /// Return the nucleotide length of chunk `i`. + /// Nucleotide length of chunk `i`. #[inline] pub fn seql(&self, i: usize) -> usize { - self.seqls[i] as usize + self.k + self.mmap[self.chunk_start(i)] as usize + self.k } - /// Reconstruct chunk `i` as a [`Unitig`]. Allocates a copy of the packed bytes. + /// Reconstruct chunk `i` as a [`Unitig`]. pub fn unitig(&self, i: usize) -> Unitig { - let seql = self.seqls[i] as usize + self.k; - let start = self.packed_offsets[i] as usize; + let offset = self.chunk_start(i); + let seql = self.mmap[offset] as usize + self.k; let byte_len = (seql + 3) / 4; - let tail = (seql % 4) as u8; - let bytes = self.mmap[start..start + byte_len].to_vec().into_boxed_slice(); - Unitig::new(tail, bytes) + let bytes = self.mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice(); + Unitig::new((seql % 4) as u8, bytes) } - /// Extract the raw left-aligned u64 of the kmer at position `j` within chunk `i`. + /// Raw left-aligned u64 of the k-mer at position `j` within chunk `i`. #[inline] pub fn raw_kmer(&self, i: usize, j: usize) -> u64 { - let start = self.packed_offsets[i] as usize; - extract_kmer_raw(&self.mmap[start..], j, self.k) + let offset = self.chunk_start(i); + extract_kmer_raw(&self.mmap[offset + 1..], j, self.k) } - /// Return `true` iff the kmer at position `j` of chunk `i` equals `query`. - /// - /// O(1), zero allocation. The chunk may store either orientation of the kmer; - /// canonicalization is applied before comparison. + /// `true` iff the k-mer at position `j` of chunk `i` equals `query` (canonical). #[inline] pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool { canonical_raw(self.raw_kmer(i, j), self.k) == query.raw() } - /// Iterate over all kmers in file order (all positions of chunk 0, then chunk 1, …). - /// - /// Each chunk is copied from the mmap once; iteration within the chunk is - /// zero-allocation (sliding-window via [`OwnedPackedSeqKmerIter`]). + // ── Sequential iterators (O(n) running-offset cursor) ───────────────────── + + /// Iterate all chunks in file order with a running byte offset — O(n) total. + fn iter_chunks_sequential(&self) -> impl Iterator + '_ { + let k = self.k; + let mmap = &*self.mmap; + let n = self.n_unitigs; + let mut offset = 0usize; + (0..n).map(move |chunk_id| { + let seql = mmap[offset] as usize + k; + let byte_len = (seql + 3) / 4; + let bytes = mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice(); + offset += 1 + byte_len; + (chunk_id, Unitig::new((seql % 4) as u8, bytes)) + }) + } + pub fn iter_kmers(&self) -> impl Iterator + '_ { - (0..self.len()).flat_map(move |i| self.unitig(i).into_kmers()) + self.iter_chunks_sequential() + .flat_map(|(_, u)| u.into_kmers()) } - /// Iterate over all canonical kmers in file order. - /// - /// Equivalent to `iter_kmers().map(|km| km.canonical())` but uses the - /// built-in canonical iterator on each chunk, which avoids a separate - /// canonicalization pass. pub fn iter_canonical_kmers(&self) -> impl Iterator + '_ { - (0..self.len()).flat_map(move |i| self.unitig(i).into_canonical_kmers()) + self.iter_chunks_sequential() + .flat_map(|(_, u)| u.into_canonical_kmers()) } - /// Iterate over `(kmer, chunk_id, rank)` for every canonical kmer in the file. - /// - /// `chunk_id` is the index of the chunk within this file; `rank` is the - /// 0-based position of the kmer within that chunk. Used to build the - /// evidence table in `obilayeredmap`. pub fn iter_indexed_canonical_kmers( &self, ) -> impl Iterator + '_ { - (0..self.len()).flat_map(move |chunk_id| { - self.unitig(chunk_id) - .into_canonical_kmers() - .enumerate() - .map(move |(rank, kmer)| (kmer, chunk_id, rank)) - }) + self.iter_chunks_sequential() + .flat_map(|(chunk_id, u)| { + u.into_canonical_kmers() + .enumerate() + .map(move |(rank, kmer)| (kmer, chunk_id, rank)) + }) } } -fn read_idx(path: &Path) -> SKResult<(Vec, Vec, usize)> { +fn read_idx(path: &Path) -> SKResult<(usize, usize, Vec)> { let data = std::fs::read(path).map_err(SKError::Io)?; let mut pos = 0; @@ -260,15 +261,27 @@ fn read_idx(path: &Path) -> SKResult<(Vec, Vec, usize)> { .ok_or(SKError::Truncated { context: "unitig index: magic" })?; if magic_bytes != &MAGIC { return Err(SKError::BadMagic { - expected: "UIDX", + expected: "UIX2", got: magic_bytes.try_into().unwrap(), }); } pos += 4; + // block_size stored for forward-compatibility verification + let bs_bytes = data.get(pos..pos + 4) + .ok_or(SKError::Truncated { context: "unitig index: block_size" })?; + let stored_bs = u32::from_le_bytes(bs_bytes.try_into().unwrap()) as usize; + if stored_bs != BLOCK_SIZE { + return Err(SKError::InvalidData { + context: "unitig index", + detail: format!("block_size mismatch: file={stored_bs} code={BLOCK_SIZE}"), + }); + } + pos += 4; + let n_bytes = data.get(pos..pos + 4) .ok_or(SKError::Truncated { context: "unitig index: n_unitigs" })?; - let n = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize; + let n_unitigs = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize; pos += 4; let nk_bytes = data.get(pos..pos + 8) @@ -276,25 +289,21 @@ fn read_idx(path: &Path) -> SKResult<(Vec, Vec, usize)> { let n_kmers = u64::from_le_bytes(nk_bytes.try_into().unwrap()) as usize; pos += 8; - let seqls = data.get(pos..pos + n) - .ok_or(SKError::Truncated { context: "unitig index: seqls" })? - .to_vec(); - pos += n; - - let mut packed_offsets = Vec::with_capacity(n + 1); - for _ in 0..=n { + let n_blocks = (n_unitigs + BLOCK_SIZE - 1) >> LOG2_BLOCK_SIZE; + let n_offsets = n_blocks + 1; // +1 for sentinel + let mut block_offsets = Vec::with_capacity(n_offsets); + for _ in 0..n_offsets { let off_bytes = data.get(pos..pos + 4) - .ok_or(SKError::Truncated { context: "unitig index: packed_offsets" })?; - packed_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap())); + .ok_or(SKError::Truncated { context: "unitig index: block_offsets" })?; + block_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap())); pos += 4; } - Ok((seqls, packed_offsets, n_kmers)) + Ok((n_unitigs, n_kmers, block_offsets)) } // ── Kmer utilities ──────────────────────────────────────────────────────────── -/// Reverse complement of a left-aligned 2-bit kmer (same algorithm as [`KmerOf::revcomp`]). #[inline] fn revcomp_raw(raw: u64, k: usize) -> u64 { let x = !raw; @@ -304,22 +313,17 @@ fn revcomp_raw(raw: u64, k: usize) -> u64 { x << (64 - 2 * k) } -/// Canonical form of a left-aligned 2-bit kmer: `min(kmer, revcomp(kmer))`. #[inline] fn canonical_raw(raw: u64, k: usize) -> u64 { raw.min(revcomp_raw(raw, k)) } -// ── Bit extraction ──────────────────────────────────────────────────────────── - -/// Extract the kmer at nucleotide position `j` from MSB-first 2-bit packed `bytes`. -/// Returns a left-aligned u64 matching [`KmerOf`]'s internal representation. #[inline] fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 { - let bit_start = j * 2; - let byte_start = bit_start / 8; - let bit_offset = bit_start % 8; // always 0, 2, 4, or 6 - let bytes_needed = (bit_offset + 2 * k + 7) / 8; // ≤ 9 for k ≤ 32 + let bit_start = j * 2; + let byte_start = bit_start / 8; + let bit_offset = bit_start % 8; + let bytes_needed = (bit_offset + 2 * k + 7) / 8; let mut acc = 0u128; for idx in 0..bytes_needed { @@ -327,8 +331,8 @@ fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 { } let shift = bytes_needed * 8 - bit_offset - 2 * k; - let mask = !0u64 >> (64 - 2 * k); - let raw = (acc >> shift) as u64 & mask; + let mask = !0u64 >> (64 - 2 * k); + let raw = (acc >> shift) as u64 & mask; raw << (64 - 2 * k) } -- 2.52.0 From 8478072b7868b54515fcd66c8147ba3bf1b00984 Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 08:06:32 +0200 Subject: [PATCH 02/10] feat: make index granularity configurable via block_bits Replaces the hardcoded BLOCK_SIZE constant with a configurable block_bits parameter, enabling variable index granularity to balance index size and sequential scan cost. Both the reader and writer now store block_bits and a precomputed mask for branchless offset arithmetic, while the index file format is upgraded to UIX3 to persist the configuration. Comprehensive unit tests verify serialization, chunk offset indexing, random access consistency, and kmer count accuracy across various block sizes. --- src/obiskio/src/tests/unitig_index.rs | 91 +++++++++++++++++++++ src/obiskio/src/unitig_index.rs | 110 ++++++++++++++++---------- 2 files changed, 159 insertions(+), 42 deletions(-) diff --git a/src/obiskio/src/tests/unitig_index.rs b/src/obiskio/src/tests/unitig_index.rs index 4223249..89e9a4f 100644 --- a/src/obiskio/src/tests/unitig_index.rs +++ b/src/obiskio/src/tests/unitig_index.rs @@ -269,3 +269,94 @@ fn long_unitig_split_no_kmer_lost() { // First kmer of chunk 1 = original nucl 256..260 — a different, adjacent kmer. assert!(r.verify_canonical_kmer(1, 0, canonical_of(&seq[256..260]))); } + +// ── block_bits parametrisation ──────────────────────────────────────────────── + +fn write_read_bb(seqs: &[&[u8]], block_bits: u8) -> (tempfile::TempDir, UnitigFileReader) { + let dir = tempdir().unwrap(); + let path = dir.path().join("unitigs.bin"); + let mut w = UnitigFileWriter::create_with_block_bits(&path, block_bits).unwrap(); + for s in seqs { + w.write(&make_unitig(s)).unwrap(); + } + w.close().unwrap(); + let r = UnitigFileReader::open(&path).unwrap(); + (dir, r) +} + +#[test] +fn block_bits_stored_and_read_back() { + set_k(4); + for bb in [0u8, 1, 2, 3, 6] { + let dir = tempdir().unwrap(); + let path = dir.path().join("unitigs.bin"); + let w = UnitigFileWriter::create_with_block_bits(&path, bb).unwrap(); + assert_eq!(w.block_bits(), bb); + w.close().unwrap(); + let r = UnitigFileReader::open(&path).unwrap(); + assert_eq!(r.block_bits(), bb, "block_bits={bb} not preserved"); + } +} + +#[test] +fn block_bits_zero_exact_offsets() { + // block_bits=0 → one offset per chunk, no sequential scan in chunk_start + set_k(4); + let seqs: &[&[u8]] = &[b"ACGTACGT", b"TTTTCCCC", b"GGGAAA", b"AAAACG"]; + let (_dir, r) = write_read_bb(seqs, 0); + assert_eq!(r.len(), seqs.len()); + for (i, s) in seqs.iter().enumerate() { + assert_eq!(r.unitig(i), make_unitig(s), "block_bits=0: unitig {i} mismatch"); + } +} + +#[test] +fn block_bits_one_block_size_two() { + // block_bits=1 → BLOCK_SIZE=2: every other chunk gets an offset + set_k(4); + let seqs: &[&[u8]] = &[b"ACGTACGT", b"TTTTCCCC", b"GGGAAA", b"AAAACG", b"CCCCGT"]; + let (_dir, r) = write_read_bb(seqs, 1); + assert_eq!(r.len(), seqs.len()); + for (i, s) in seqs.iter().enumerate() { + assert_eq!(r.unitig(i), make_unitig(s), "block_bits=1: unitig {i} mismatch"); + } +} + +#[test] +fn block_bits_larger_than_chunk_count() { + // block_bits=6 (BLOCK_SIZE=64) with only 3 chunks: all in block 0 + set_k(4); + let seqs: &[&[u8]] = &[b"ACGTACGT", b"TTTTCCCC", b"GGGAAA"]; + let (_dir, r) = write_read_bb(seqs, 6); + assert_eq!(r.len(), seqs.len()); + for (i, s) in seqs.iter().enumerate() { + assert_eq!(r.unitig(i), make_unitig(s), "block_bits=6: unitig {i} mismatch"); + } +} + +#[test] +fn block_bits_random_access_matches_sequential() { + // Write many chunks with block_bits=2 (BLOCK_SIZE=4), verify random access + // against sequential iteration for every chunk. + set_k(4); + let seqs: Vec> = (0..20_usize) + .map(|i| (0..8_usize).map(|j| b"ACGT"[(i + j) % 4]).collect()) + .collect(); + let seq_refs: Vec<&[u8]> = seqs.iter().map(|s| s.as_slice()).collect(); + let (_dir, r) = write_read_bb(&seq_refs, 2); + assert_eq!(r.len(), seqs.len()); + let sequential: Vec = r.iter_chunks_sequential().map(|(_, u)| u).collect(); + for i in 0..seqs.len() { + assert_eq!(r.unitig(i), sequential[i], "random vs sequential mismatch at chunk {i}"); + } +} + +#[test] +fn block_bits_kmer_count_preserved() { + set_k(4); + // "AAAACG" → 3 kmers, "CCCCAG" → 3 kmers; total = 6 + for bb in [0u8, 1, 2, 6] { + let (_dir, r) = write_read_bb(&[b"AAAACG", b"CCCCAG"], bb); + assert_eq!(r.n_kmers(), 6, "block_bits={bb}: n_kmers mismatch"); + } +} diff --git a/src/obiskio/src/unitig_index.rs b/src/obiskio/src/unitig_index.rs index 0fe4444..074dc32 100644 --- a/src/obiskio/src/unitig_index.rs +++ b/src/obiskio/src/unitig_index.rs @@ -11,18 +11,17 @@ use crate::error::{SKError, SKResult}; // ── Block index parameters ──────────────────────────────────────────────────── // -// One offset entry per BLOCK_SIZE chunks. BLOCK_SIZE must be a power of two -// so that block = i >> LOG2_BLOCK_SIZE and rem = i & (BLOCK_SIZE − 1) are -// branchless shifts/masks rather than divisions. +// BLOCK_SIZE = 1 << block_bits chunks share one offset entry in the index. +// block_bits=0 → one entry per chunk (exact offsets, no scan). +// block_bits=6 → one entry per 64 chunks (default; O(64) scan per lookup). // -// With BLOCK_SIZE = 64 and an average chunk size of ~10 bytes, a random lookup -// scans at most 63 × 10 = 630 bytes sequentially — negligible next to the MPHF -// lookup that precedes it. The index file shrinks from ~5 bytes/chunk to -// ~1/64 bytes/chunk (≈ 300× for typical workloads). +// block_bits is stored in the index file so the reader derives all parameters +// at runtime — no compile-time constant constrains the format. -const MAGIC: [u8; 4] = *b"UIX2"; -const BLOCK_SIZE: usize = 64; -const LOG2_BLOCK_SIZE: u32 = 6; // 2^6 = BLOCK_SIZE +const MAGIC: [u8; 4] = *b"UIX3"; + +/// Default block granularity used by [`UnitigFileWriter::create`]. +pub const DEFAULT_BLOCK_BITS: u8 = 6; fn idx_path(path: &Path) -> PathBuf { crate::append_path_suffix(path, ".idx") @@ -33,23 +32,36 @@ fn idx_path(path: &Path) -> PathBuf { /// Writes a sequence of [`Unitig`] to an uncompressed binary file and builds /// a block-sampled offset index at close time. /// -/// One offset is stored every [`BLOCK_SIZE`] chunks; random access to chunk `i` -/// costs at most `BLOCK_SIZE − 1` sequential chunk scans after the block lookup. +/// One offset is stored every `1 << block_bits` chunks; random access to chunk +/// `i` costs at most `(1 << block_bits) − 1` sequential chunk scans after the +/// block lookup. /// /// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] k-mers are transparently split /// into overlapping chunks (k−1 nucleotide overlap) so no k-mer is lost. pub struct UnitigFileWriter { path: PathBuf, file: BufWriter, - block_offsets: Vec, // byte offset of first record in each block + block_offsets: Vec, chunk_count: usize, - next_offset: u32, // byte offset of the START of the next record + next_offset: u32, n_kmers: usize, k: usize, + block_bits: u8, + mask: usize, // (1 << block_bits) - 1 } impl UnitigFileWriter { + /// Create a writer with the default block size (`DEFAULT_BLOCK_BITS = 6`). pub fn create(path: &Path) -> SKResult { + Self::create_with_block_bits(path, DEFAULT_BLOCK_BITS) + } + + /// Create a writer with a custom block size. + /// + /// `block_bits` must be in 0..=31. `block_bits=0` stores one offset per + /// chunk (exact, no scan); larger values trade index size for scan length. + pub fn create_with_block_bits(path: &Path, block_bits: u8) -> SKResult { + assert!(block_bits <= 31, "block_bits must be ≤ 31"); let file = File::create(path).map_err(SKError::Io)?; Ok(Self { path: path.to_owned(), @@ -59,6 +71,8 @@ impl UnitigFileWriter { next_offset: 0, n_kmers: 0, k: obikseq::params::k(), + block_bits, + mask: (1usize << block_bits) - 1, }) } @@ -95,8 +109,7 @@ impl UnitigFileWriter { debug_assert!(seql - self.k <= u8::MAX as usize, "chunk exceeds MAX_KMERS_PER_CHUNK"); - // Record a block offset at the start of every BLOCK_SIZE-th chunk. - if self.chunk_count & (BLOCK_SIZE - 1) == 0 { + if self.chunk_count & self.mask == 0 { self.block_offsets.push(self.next_offset); } @@ -113,25 +126,32 @@ impl UnitigFileWriter { self.file.flush().map_err(SKError::Io)?; drop(self.file); - // Sentinel: byte offset past the last record (needed for end-of-file detection). self.block_offsets.push(self.next_offset); write_idx( &idx_path(&self.path), self.chunk_count as u32, self.n_kmers as u64, + self.block_bits, &self.block_offsets, ) } pub fn len(&self) -> usize { self.chunk_count } pub fn is_empty(&self) -> bool { self.chunk_count == 0 } + pub fn block_bits(&self) -> u8 { self.block_bits } } -fn write_idx(path: &Path, n_unitigs: u32, n_kmers: u64, block_offsets: &[u32]) -> SKResult<()> { +fn write_idx( + path: &Path, + n_unitigs: u32, + n_kmers: u64, + block_bits: u8, + block_offsets: &[u32], +) -> SKResult<()> { let mut w = BufWriter::new(File::create(path).map_err(SKError::Io)?); w.write_all(&MAGIC).map_err(SKError::Io)?; - w.write_all(&(BLOCK_SIZE as u32).to_le_bytes()).map_err(SKError::Io)?; + w.write_all(&(block_bits as u32).to_le_bytes()).map_err(SKError::Io)?; w.write_all(&n_unitigs.to_le_bytes()).map_err(SKError::Io)?; w.write_all(&n_kmers.to_le_bytes()).map_err(SKError::Io)?; for &off in block_offsets { @@ -145,39 +165,45 @@ fn write_idx(path: &Path, n_unitigs: u32, n_kmers: u64, block_offsets: &[u32]) - /// Read-only random-access view of a unitig file. /// /// The sequence file is memory-mapped; the block offset table is loaded into RAM -/// on open (≈ n_chunks / BLOCK_SIZE entries, negligible memory). -/// -/// Random access to chunk `i`: O(BLOCK_SIZE) sequential mmap reads — branchless -/// shift/mask arithmetic, cache-friendly, negligible versus the MPHF lookup. -/// -/// Sequential iteration: O(n) via a running-offset cursor (no per-chunk overhead). +/// on open. Random access to chunk `i`: O(1 << block_bits) sequential mmap +/// reads. Sequential iteration: O(n) via a running-offset cursor. pub struct UnitigFileReader { mmap: Mmap, block_offsets: Vec, n_unitigs: usize, n_kmers: usize, k: usize, + block_bits: u8, + mask: usize, // (1 << block_bits) - 1 } impl UnitigFileReader { pub fn open(path: &Path) -> SKResult { let file = File::open(path).map_err(SKError::Io)?; let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? }; - let (n_unitigs, n_kmers, block_offsets) = read_idx(&idx_path(path))?; + let (n_unitigs, n_kmers, block_bits, block_offsets) = read_idx(&idx_path(path))?; let k = obikseq::params::k(); - Ok(Self { mmap, block_offsets, n_unitigs, n_kmers, k }) + Ok(Self { + mmap, + block_offsets, + n_unitigs, + n_kmers, + k, + block_bits, + mask: (1usize << block_bits) - 1, + }) } pub fn len(&self) -> usize { self.n_unitigs } pub fn is_empty(&self) -> bool { self.n_unitigs == 0 } pub fn n_kmers(&self) -> usize { self.n_kmers } + pub fn block_bits(&self) -> u8 { self.block_bits } /// Byte offset of the START of record `i` (the seql byte) in the mmap. - /// O(BLOCK_SIZE) sequential scan within the block. #[inline] fn chunk_start(&self, i: usize) -> usize { - let block = i >> LOG2_BLOCK_SIZE; - let rem = i & (BLOCK_SIZE - 1); + let block = i >> self.block_bits; + let rem = i & self.mask; let mut offset = self.block_offsets[block] as usize; for _ in 0..rem { let seql_minus_k = self.mmap[offset] as usize; @@ -216,7 +242,6 @@ impl UnitigFileReader { // ── Sequential iterators (O(n) running-offset cursor) ───────────────────── - /// Iterate all chunks in file order with a running byte offset — O(n) total. fn iter_chunks_sequential(&self) -> impl Iterator + '_ { let k = self.k; let mmap = &*self.mmap; @@ -253,7 +278,7 @@ impl UnitigFileReader { } } -fn read_idx(path: &Path) -> SKResult<(usize, usize, Vec)> { +fn read_idx(path: &Path) -> SKResult<(usize, usize, u8, Vec)> { let data = std::fs::read(path).map_err(SKError::Io)?; let mut pos = 0; @@ -261,22 +286,22 @@ fn read_idx(path: &Path) -> SKResult<(usize, usize, Vec)> { .ok_or(SKError::Truncated { context: "unitig index: magic" })?; if magic_bytes != &MAGIC { return Err(SKError::BadMagic { - expected: "UIX2", + expected: "UIX3", got: magic_bytes.try_into().unwrap(), }); } pos += 4; - // block_size stored for forward-compatibility verification - let bs_bytes = data.get(pos..pos + 4) - .ok_or(SKError::Truncated { context: "unitig index: block_size" })?; - let stored_bs = u32::from_le_bytes(bs_bytes.try_into().unwrap()) as usize; - if stored_bs != BLOCK_SIZE { + let bb_bytes = data.get(pos..pos + 4) + .ok_or(SKError::Truncated { context: "unitig index: block_bits" })?; + let block_bits_u32 = u32::from_le_bytes(bb_bytes.try_into().unwrap()); + if block_bits_u32 > 31 { return Err(SKError::InvalidData { context: "unitig index", - detail: format!("block_size mismatch: file={stored_bs} code={BLOCK_SIZE}"), + detail: format!("block_bits out of range: {block_bits_u32}"), }); } + let block_bits = block_bits_u32 as u8; pos += 4; let n_bytes = data.get(pos..pos + 4) @@ -289,8 +314,9 @@ fn read_idx(path: &Path) -> SKResult<(usize, usize, Vec)> { let n_kmers = u64::from_le_bytes(nk_bytes.try_into().unwrap()) as usize; pos += 8; - let n_blocks = (n_unitigs + BLOCK_SIZE - 1) >> LOG2_BLOCK_SIZE; - let n_offsets = n_blocks + 1; // +1 for sentinel + let block_size = 1usize << block_bits; + let n_blocks = (n_unitigs + block_size - 1) >> block_bits; + let n_offsets = n_blocks + 1; let mut block_offsets = Vec::with_capacity(n_offsets); for _ in 0..n_offsets { let off_bytes = data.get(pos..pos + 4) @@ -299,7 +325,7 @@ fn read_idx(path: &Path) -> SKResult<(usize, usize, Vec)> { pos += 4; } - Ok((n_unitigs, n_kmers, block_offsets)) + Ok((n_unitigs, n_kmers, block_bits, block_offsets)) } // ── Kmer utilities ──────────────────────────────────────────────────────────── -- 2.52.0 From 24afd74e2fef0cc6ba7ceee3e6b239c3cf198b38 Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 08:26:52 +0200 Subject: [PATCH 03/10] refactor: decouple unitig index generation and add exact evidence Decouple index generation by introducing `build_unitig_idx()` for retroactive `.idx` creation and optional immediate writing on close. Add `open_sequential()` for index-less iteration while enforcing index requirements for random access. Refactor the MPHF layer to pre-generate the unitig index for parallel random access, integrate `rayon` for k-mer processing, and enforce mapping integrity via duplicate slot validation. Additionally, implement `build_exact_evidence()` to reconstruct evidence from existing artifacts, and update tests to leverage the new index generation and simplified k-mer iteration helpers. --- src/obilayeredmap/src/layer.rs | 12 +++- src/obilayeredmap/src/mphf_layer.rs | 72 ++++++++++++++++++--- src/obilayeredmap/src/tests/layer.rs | 18 ++---- src/obiskio/src/lib.rs | 2 +- src/obiskio/src/tests/unitig_index.rs | 5 ++ src/obiskio/src/unitig_index.rs | 93 +++++++++++++++++++++++---- 6 files changed, 167 insertions(+), 35 deletions(-) diff --git a/src/obilayeredmap/src/layer.rs b/src/obilayeredmap/src/layer.rs index 9f41675..a25ee62 100644 --- a/src/obilayeredmap/src/layer.rs +++ b/src/obilayeredmap/src/layer.rs @@ -76,6 +76,14 @@ impl Layer { pub fn unitig_writer(out_dir: &Path) -> OLMResult { MphfLayer::unitig_writer(out_dir) } + + /// Build `unitigs.bin.idx` and `evidence.bin` from `unitigs.bin` and + /// `mphf.bin` already present in `layer_dir`. + /// + /// See [`MphfLayer::build_exact_evidence`] for the full contract. + pub fn build_exact_evidence(layer_dir: &Path) -> OLMResult { + MphfLayer::build_exact_evidence(layer_dir) + } } // ── Mode 1 — set membership ─────────────────────────────────────────────────── @@ -106,7 +114,7 @@ impl Layer<()> { impl Layer { pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult { - let n = UnitigFileReader::open(&out_dir.join(UNITIGS_FILE))?.n_kmers(); + let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers(); let counts_dir = out_dir.join(COUNTS_DIR); let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir) .map_err(OLMError::Io)?; @@ -158,7 +166,7 @@ impl Layer { n_genomes: usize, present_in: impl Fn(CanonicalKmer, usize) -> bool, ) -> OLMResult { - let n = UnitigFileReader::open(&out_dir.join(UNITIGS_FILE))?.n_kmers(); + let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers(); let presence_dir = out_dir.join(PRESENCE_DIR); let mut mb = PersistentBitMatrixBuilder::new(n, &presence_dir).map_err(OLMError::Io)?; let mut cols: Vec<_> = (0..n_genomes) diff --git a/src/obilayeredmap/src/mphf_layer.rs b/src/obilayeredmap/src/mphf_layer.rs index 0844237..1713bce 100644 --- a/src/obilayeredmap/src/mphf_layer.rs +++ b/src/obilayeredmap/src/mphf_layer.rs @@ -4,7 +4,7 @@ use std::path::Path; use cacheline_ef::{CachelineEf, CachelineEfVec}; use epserde::prelude::*; use obikseq::CanonicalKmer; -use obiskio::{UnitigFileReader, UnitigFileWriter}; +use obiskio::{UnitigFileReader, UnitigFileWriter, build_unitig_idx, DEFAULT_BLOCK_BITS}; use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64}; use crate::error::{OLMError, OLMResult}; @@ -41,7 +41,6 @@ impl MphfLayer { #[inline] pub fn find(&self, kmer: CanonicalKmer) -> Option { let slot = self.mphf.index(&kmer.raw()); - // PtrHash guarantees slot < n only for its key set; arbitrary queries may exceed bounds. if slot >= self.n { return None; } let (chunk_id, rank) = self.evidence.decode(slot); if self.unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) { @@ -58,16 +57,73 @@ impl MphfLayer { Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?) } - /// Builds the MPHF and evidence from the unitigs file already present in `dir`. - /// Calls `fill_slot(slot, kmer)` once per kmer (second pass) for DataStore population. - /// Returns the number of kmers indexed. + /// Build `unitigs.bin.idx` and `evidence.bin` from `unitigs.bin` and + /// `mphf.bin` already present in `dir`. + /// + /// This is the exact-evidence construction route. It can be called: + /// - after the initial build (via [`Self::build`] which calls it internally) + /// - standalone to promote an existing (unitigs + mphf) into an exact index + /// - standalone to rebuild evidence after a format change + /// Build `unitigs.bin.idx` and `evidence.bin` from `unitigs.bin` and + /// `mphf.bin` already present in `dir`. + /// + /// Uses sequential iteration — no `.idx` is required on entry. + /// Writes both `evidence.bin` and `unitigs.bin.idx` on success. + pub fn build_exact_evidence(dir: &Path) -> OLMResult { + let unitig_path = dir.join(UNITIGS_FILE); + + // Sequential scan — no .idx required for iteration + let unitigs = UnitigFileReader::open_sequential(&unitig_path)?; + let n = unitigs.n_kmers(); + + if n == 0 { + fs::File::create(dir.join(EVIDENCE_FILE))?; + build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + return Ok(0); + } + + let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE)) + .map_err(|e| OLMError::InvalidLayer(e.to_string()))?; + + let mut ev = EvidenceWriter::new(n); + let mut seen = vec![0u8; (n + 7) / 8]; + + for (kmer, chunk_id, rank) in unitigs.iter_indexed_canonical_kmers() { + let slot = mphf.index(&kmer.raw()); + if slot >= n { + return Err(OLMError::Mphf("slot out of bounds".into())); + } + let byte = slot / 8; + let bit = 1u8 << (slot % 8); + if seen[byte] & bit != 0 { + return Err(OLMError::Mphf("duplicate slot".into())); + } + seen[byte] |= bit; + ev.set(slot, chunk_id as u32, rank as u8); + } + + ev.write(&dir.join(EVIDENCE_FILE))?; + // Write .idx last: it is only needed for random access (queries). + build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + Ok(n) + } + + /// Build MPHF and exact evidence from the unitigs file already present in + /// `dir`. Calls `fill_slot(slot, kmer)` once per kmer for DataStore + /// population. Returns the number of kmers indexed. pub(crate) fn build( dir: &Path, fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>, ) -> OLMResult { use rayon::prelude::*; - let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?; + let unitig_path = dir.join(UNITIGS_FILE); + + // Write .idx so that UnitigFileReader::open succeeds and parallel + // random access is available for MPHF construction. + build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + + let unitigs = UnitigFileReader::open(&unitig_path)?; let n = unitigs.n_kmers(); if n == 0 { @@ -80,7 +136,7 @@ impl MphfLayer { return Ok(0); } - // Pass 1 — build MPHF + // Pass 1 — build MPHF (parallel, uses random access via .idx) let keys = (0..unitigs.len()) .into_par_iter() .flat_map_iter(|ci| unitigs.unitig(ci).into_canonical_kmers().map(|km| km.raw())); @@ -90,7 +146,7 @@ impl MphfLayer { .map_err(|e| OLMError::InvalidLayer(e.to_string()))?; // Pass 2 — fill evidence + mode-specific data via callback - let unitigs2 = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?; + let unitigs2 = UnitigFileReader::open(&unitig_path)?; let mut ev = EvidenceWriter::new(n); let mut seen = vec![0u8; (n + 7) / 8]; diff --git a/src/obilayeredmap/src/tests/layer.rs b/src/obilayeredmap/src/tests/layer.rs index 7b0ec32..47f6e82 100644 --- a/src/obilayeredmap/src/tests/layer.rs +++ b/src/obilayeredmap/src/tests/layer.rs @@ -11,16 +11,10 @@ fn write_unitigs(dir: &Path, seqs: &[&[u8]]) { w.close().unwrap(); } -fn all_canonical_kmers(dir: &Path, k: usize) -> Vec { - let r = UnitigFileReader::open(&dir.join(UNITIGS_FILE)).unwrap(); - let mut out = Vec::new(); - for ci in 0..r.len() { - let n = r.seql(ci) - k + 1; - for rank in 0..n { - out.push(Kmer::from_raw(r.raw_kmer(ci, rank)).canonical()); - } - } - out +fn all_canonical_kmers(dir: &Path) -> Vec { + UnitigFileReader::open_sequential(&dir.join(UNITIGS_FILE)).unwrap() + .iter_canonical_kmers() + .collect() } #[test] @@ -28,7 +22,7 @@ fn build_and_query_all_kmers_found() { set_k(4); let dir = tempdir().unwrap(); write_unitigs(dir.path(), &[b"AAAACGT"]); - let kmers = all_canonical_kmers(dir.path(), 4); + let kmers = all_canonical_kmers(dir.path()); Layer::<()>::build(dir.path()).unwrap(); let layer = Layer::<()>::open(dir.path()).unwrap(); for kmer in kmers { @@ -41,7 +35,7 @@ fn counts_are_stored_and_retrieved() { set_k(4); let dir = tempdir().unwrap(); write_unitigs(dir.path(), &[b"AAAACGT"]); - let kmers = all_canonical_kmers(dir.path(), 4); + let kmers = all_canonical_kmers(dir.path()); let count_map: HashMap = kmers.iter().enumerate().map(|(i, &k)| (k, i as u32 + 1)).collect(); Layer::::build( diff --git a/src/obiskio/src/lib.rs b/src/obiskio/src/lib.rs index 8f32ca2..1d960cc 100644 --- a/src/obiskio/src/lib.rs +++ b/src/obiskio/src/lib.rs @@ -8,7 +8,7 @@ pub use error::{SKError, SKResult}; pub use meta::SKFileMeta; pub use pool::{SKFilePool, SKFileWriter, SharedPool, create_token, create_token_with}; pub use reader::{SKFileIter, SKFileReader}; -pub use unitig_index::{UnitigFileReader, UnitigFileWriter}; +pub use unitig_index::{UnitigFileReader, UnitigFileWriter, build_unitig_idx, DEFAULT_BLOCK_BITS}; use std::path::{Path, PathBuf}; diff --git a/src/obiskio/src/tests/unitig_index.rs b/src/obiskio/src/tests/unitig_index.rs index 89e9a4f..5553357 100644 --- a/src/obiskio/src/tests/unitig_index.rs +++ b/src/obiskio/src/tests/unitig_index.rs @@ -18,6 +18,7 @@ fn write_read(seqs: &[&[u8]]) -> (tempfile::TempDir, UnitigFileReader) { w.write(&make_unitig(s)).unwrap(); } w.close().unwrap(); + super::build_unitig_idx(&path, super::DEFAULT_BLOCK_BITS).unwrap(); let r = UnitigFileReader::open(&path).unwrap(); (dir, r) } @@ -31,6 +32,7 @@ fn roundtrip_empty_index() { let path = dir.path().join("unitigs.bin"); let w = UnitigFileWriter::create(&path).unwrap(); w.close().unwrap(); + super::build_unitig_idx(&path, super::DEFAULT_BLOCK_BITS).unwrap(); let r = UnitigFileReader::open(&path).unwrap(); assert_eq!(r.len(), 0); } @@ -145,6 +147,7 @@ fn iter_kmers_empty_file() { let dir = tempdir().unwrap(); let path = dir.path().join("unitigs.bin"); UnitigFileWriter::create(&path).unwrap().close().unwrap(); + super::build_unitig_idx(&path, super::DEFAULT_BLOCK_BITS).unwrap(); let r = UnitigFileReader::open(&path).unwrap(); assert_eq!(r.iter_kmers().count(), 0); } @@ -280,6 +283,7 @@ fn write_read_bb(seqs: &[&[u8]], block_bits: u8) -> (tempfile::TempDir, UnitigFi w.write(&make_unitig(s)).unwrap(); } w.close().unwrap(); + super::build_unitig_idx(&path, block_bits).unwrap(); let r = UnitigFileReader::open(&path).unwrap(); (dir, r) } @@ -293,6 +297,7 @@ fn block_bits_stored_and_read_back() { let w = UnitigFileWriter::create_with_block_bits(&path, bb).unwrap(); assert_eq!(w.block_bits(), bb); w.close().unwrap(); + super::build_unitig_idx(&path, bb).unwrap(); let r = UnitigFileReader::open(&path).unwrap(); assert_eq!(r.block_bits(), bb, "block_bits={bb} not preserved"); } diff --git a/src/obiskio/src/unitig_index.rs b/src/obiskio/src/unitig_index.rs index 074dc32..75535ae 100644 --- a/src/obiskio/src/unitig_index.rs +++ b/src/obiskio/src/unitig_index.rs @@ -39,7 +39,6 @@ fn idx_path(path: &Path) -> PathBuf { /// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] k-mers are transparently split /// into overlapping chunks (k−1 nucleotide overlap) so no k-mer is lost. pub struct UnitigFileWriter { - path: PathBuf, file: BufWriter, block_offsets: Vec, chunk_count: usize, @@ -64,7 +63,6 @@ impl UnitigFileWriter { assert!(block_bits <= 31, "block_bits must be ≤ 31"); let file = File::create(path).map_err(SKError::Io)?; Ok(Self { - path: path.to_owned(), file: BufWriter::new(file), block_offsets: Vec::new(), chunk_count: 0, @@ -122,19 +120,14 @@ impl UnitigFileWriter { Ok(()) } + /// Flush and close the binary sequence file. + /// + /// The companion `.idx` file is **not** written here; call + /// [`build_unitig_idx`] separately when exact evidence is needed. pub fn close(mut self) -> SKResult<()> { self.file.flush().map_err(SKError::Io)?; drop(self.file); - - self.block_offsets.push(self.next_offset); - - write_idx( - &idx_path(&self.path), - self.chunk_count as u32, - self.n_kmers as u64, - self.block_bits, - &self.block_offsets, - ) + Ok(()) } pub fn len(&self) -> usize { self.chunk_count } @@ -160,6 +153,48 @@ fn write_idx( w.flush().map_err(SKError::Io) } +/// Scan an existing `unitigs.bin` file and write its companion `.idx`. +/// +/// Called by the exact-evidence construction route after the sequence file is +/// closed. `block_bits` controls index granularity (1 << block_bits chunks per +/// offset entry); use [`DEFAULT_BLOCK_BITS`] for the default. +pub fn build_unitig_idx(unitigs_path: &Path, block_bits: u8) -> SKResult<()> { + assert!(block_bits <= 31, "block_bits must be ≤ 31"); + + let file = File::open(unitigs_path).map_err(SKError::Io)?; + let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? }; + + let k = obikseq::params::k(); + let block_size = 1usize << block_bits; + let mask = block_size - 1; + + let mut block_offsets: Vec = Vec::new(); + let mut offset = 0usize; + let mut chunk_count = 0usize; + let mut n_kmers = 0usize; + + while offset < mmap.len() { + if chunk_count & mask == 0 { + block_offsets.push(offset as u32); + } + let seql_minus_k = mmap[offset] as usize; + let byte_len = (seql_minus_k + k + 3) / 4; + n_kmers += seql_minus_k + 1; + offset += 1 + byte_len; + chunk_count += 1; + } + + block_offsets.push(offset as u32); // sentinel + + write_idx( + &idx_path(unitigs_path), + chunk_count as u32, + n_kmers as u64, + block_bits, + &block_offsets, + ) +} + // ── Reader ──────────────────────────────────────────────────────────────────── /// Read-only random-access view of a unitig file. @@ -178,6 +213,7 @@ pub struct UnitigFileReader { } impl UnitigFileReader { + /// Open with `.idx` — enables both sequential iteration and random access. pub fn open(path: &Path) -> SKResult { let file = File::open(path).map_err(SKError::Io)?; let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? }; @@ -194,6 +230,37 @@ impl UnitigFileReader { }) } + /// Open without `.idx` — sequential iteration only, no random access. + /// + /// Scans the binary file once to count chunks and k-mers. Use when only + /// [`Self::iter_kmers`], [`Self::iter_canonical_kmers`], or + /// [`Self::iter_indexed_canonical_kmers`] are needed. + pub fn open_sequential(path: &Path) -> SKResult { + let file = File::open(path).map_err(SKError::Io)?; + let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? }; + let k = obikseq::params::k(); + + let mut offset = 0usize; + let mut n_unitigs = 0usize; + let mut n_kmers = 0usize; + while offset < mmap.len() { + let seql_minus_k = mmap[offset] as usize; + n_kmers += seql_minus_k + 1; + offset += 1 + (seql_minus_k + k + 3) / 4; + n_unitigs += 1; + } + + Ok(Self { + mmap, + block_offsets: Vec::new(), // empty → random access disabled + n_unitigs, + n_kmers, + k, + block_bits: DEFAULT_BLOCK_BITS, + mask: (1usize << DEFAULT_BLOCK_BITS) - 1, + }) + } + pub fn len(&self) -> usize { self.n_unitigs } pub fn is_empty(&self) -> bool { self.n_unitigs == 0 } pub fn n_kmers(&self) -> usize { self.n_kmers } @@ -202,6 +269,8 @@ impl UnitigFileReader { /// Byte offset of the START of record `i` (the seql byte) in the mmap. #[inline] fn chunk_start(&self, i: usize) -> usize { + assert!(!self.block_offsets.is_empty(), + "random access requires UnitigFileReader::open(); use open_sequential() for iteration only"); let block = i >> self.block_bits; let rem = i & self.mask; let mut offset = self.block_offsets[block] as usize; -- 2.52.0 From e1dab86daf06b87cc4ba0100236f29d114ac42ec Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 08:48:59 +0200 Subject: [PATCH 04/10] feat: add approximate kmer fingerprinting with memory-mapped storage Introduce a new `fingerprint` module that stores packed B-bit vectors via memory-mapped files. Expose the module publicly and add `build_approx_evidence` to `Layer` and `MphfLayer` for generating compact `fingerprint.bin` files. Implement `find_approx` for fast, probabilistic kmer lookups using configurable bit-widths. Update dependencies to `bitvec` v1 and add `cacheline-ef`, `epserde`, and `memmap2` to support the new storage and serialization logic. --- src/Cargo.lock | 1 + src/obilayeredmap/Cargo.toml | 1 + src/obilayeredmap/src/fingerprint.rs | 151 +++++++++++++++++++++++++++ src/obilayeredmap/src/layer.rs | 8 ++ src/obilayeredmap/src/lib.rs | 1 + src/obilayeredmap/src/mphf_layer.rs | 54 +++++++++- 6 files changed, 213 insertions(+), 3 deletions(-) create mode 100644 src/obilayeredmap/src/fingerprint.rs diff --git a/src/Cargo.lock b/src/Cargo.lock index 93e5039..bc710eb 100644 --- a/src/Cargo.lock +++ b/src/Cargo.lock @@ -1592,6 +1592,7 @@ dependencies = [ name = "obilayeredmap" version = "0.1.0" dependencies = [ + "bitvec", "cacheline-ef", "epserde", "memmap2", diff --git a/src/obilayeredmap/Cargo.toml b/src/obilayeredmap/Cargo.toml index 1bb3d94..2f94494 100644 --- a/src/obilayeredmap/Cargo.toml +++ b/src/obilayeredmap/Cargo.toml @@ -12,6 +12,7 @@ cacheline-ef = "1.1" epserde = "0.8" rayon = "1" ndarray = "0.16" +bitvec = "1" memmap2 = "0.9" serde = { version = "1", features = ["derive"] } serde_json = "1" diff --git a/src/obilayeredmap/src/fingerprint.rs b/src/obilayeredmap/src/fingerprint.rs new file mode 100644 index 0000000..ff0701e --- /dev/null +++ b/src/obilayeredmap/src/fingerprint.rs @@ -0,0 +1,151 @@ +// Packed B-bit fingerprint vector, one entry per MPHF slot. +// +// File format (fingerprint.bin): +// magic: b"FPVF" (4 bytes) +// b: u8 (bits per fingerprint, 1..=64) +// padding: [0u8; 3] +// n: u64 LE (number of slots) +// data: packed bits, ceil(n*b/8) bytes, Lsb0 order + +use std::fs::File; +use std::io::{BufWriter, Write}; +use std::path::Path; + +use bitvec::prelude::*; +use memmap2::Mmap; + +use crate::error::{OLMError, OLMResult}; + +const MAGIC: &[u8; 4] = b"FPVF"; +const HEADER: usize = 16; + +// ── Reader ──────────────────────────────────────────────────────────────────── + +pub struct FingerprintVec { + _mmap: Mmap, + bits: &'static BitSlice, + n: usize, + b: u8, + mask: u64, +} + +impl FingerprintVec { + pub fn open(path: &Path) -> OLMResult { + let f = File::open(path)?; + let mmap = unsafe { Mmap::map(&f)? }; + if mmap.len() < HEADER || &mmap[..4] != MAGIC { + return Err(OLMError::InvalidLayer("bad fingerprint magic".into())); + } + let b = mmap[4]; + if b == 0 || b > 64 { + return Err(OLMError::InvalidLayer("invalid fingerprint width".into())); + } + let n = u64::from_le_bytes(mmap[8..16].try_into().unwrap()) as usize; + let mask: u64 = if b == 64 { u64::MAX } else { (1u64 << b) - 1 }; + // SAFETY: the mmap lives as long as Self (kept via _mmap); data is read-only. + let data: &'static [u8] = unsafe { + std::slice::from_raw_parts(mmap[HEADER..].as_ptr(), (n * b as usize + 7) / 8) + }; + let bits = BitSlice::::from_slice(data); + Ok(Self { _mmap: mmap, bits, n, b, mask }) + } + + #[inline] + pub fn get(&self, slot: usize) -> u64 { + debug_assert!(slot < self.n); + let lo = slot * self.b as usize; + self.bits[lo .. lo + self.b as usize].load_le::() + } + + #[inline] + pub fn matches(&self, slot: usize, fingerprint: u64) -> bool { + self.get(slot) == (fingerprint & self.mask) + } + + pub fn n(&self) -> usize { self.n } + pub fn b(&self) -> u8 { self.b } +} + +// ── Writer ──────────────────────────────────────────────────────────────────── + +pub struct FingerprintVecWriter { + buf: Vec, + n: usize, + b: u8, +} + +impl FingerprintVecWriter { + pub fn new(n: usize, b: u8) -> Self { + assert!(b > 0 && b <= 64, "fingerprint width must be 1..=64"); + let data_bytes = (n * b as usize + 7) / 8; + Self { buf: vec![0u8; data_bytes], n, b } + } + + #[inline] + pub fn set(&mut self, slot: usize, fingerprint: u64) { + debug_assert!(slot < self.n); + let lo = slot * self.b as usize; + let bits = BitSlice::::from_slice_mut(&mut self.buf); + bits[lo .. lo + self.b as usize].store_le(fingerprint); + } + + pub fn write(self, path: &Path) -> OLMResult<()> { + let mut f = BufWriter::new(File::create(path).map_err(OLMError::Io)?); + f.write_all(MAGIC).map_err(OLMError::Io)?; + f.write_all(&[self.b, 0, 0, 0]).map_err(OLMError::Io)?; + f.write_all(&(self.n as u64).to_le_bytes()).map_err(OLMError::Io)?; + f.write_all(&self.buf).map_err(OLMError::Io)?; + Ok(()) + } +} + +// ── Tests ───────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + use tempfile::tempdir; + + fn roundtrip(n: usize, b: u8, values: &[u64]) { + let dir = tempdir().unwrap(); + let path = dir.path().join("fp.bin"); + let mask: u64 = if b == 64 { u64::MAX } else { (1u64 << b) - 1 }; + let mut w = FingerprintVecWriter::new(n, b); + for (i, &v) in values.iter().enumerate() { + w.set(i, v); + } + w.write(&path).unwrap(); + let r = FingerprintVec::open(&path).unwrap(); + assert_eq!(r.n(), n); + assert_eq!(r.b(), b); + for (i, &v) in values.iter().enumerate() { + assert_eq!(r.get(i), v & mask, "slot {i} b={b}"); + } + } + + #[test] + fn roundtrip_b1() { roundtrip(8, 1, &[1, 0, 1, 1, 0, 0, 1, 0]); } + #[test] + fn roundtrip_b8() { roundtrip(4, 8, &[0, 127, 200, 255]); } + #[test] + fn roundtrip_b16() { roundtrip(3, 16, &[0, 0xABCD, 0xFFFF]); } + #[test] + fn roundtrip_b7() { roundtrip(5, 7, &[0, 63, 127, 1, 42]); } + #[test] + fn roundtrip_b13_unaligned() { + let vals: Vec = (0..20).map(|i| (i * 317) % (1 << 13)).collect(); + roundtrip(20, 13, &vals); + } + #[test] + fn matches_returns_true_for_exact() { + let dir = tempdir().unwrap(); + let path = dir.path().join("fp.bin"); + let mut w = FingerprintVecWriter::new(4, 8); + w.set(0, 42); w.set(1, 0); w.set(2, 255); w.set(3, 17); + w.write(&path).unwrap(); + let r = FingerprintVec::open(&path).unwrap(); + assert!( r.matches(0, 42)); + assert!(!r.matches(0, 43)); + assert!( r.matches(2, 255)); + } +} diff --git a/src/obilayeredmap/src/layer.rs b/src/obilayeredmap/src/layer.rs index a25ee62..b67709f 100644 --- a/src/obilayeredmap/src/layer.rs +++ b/src/obilayeredmap/src/layer.rs @@ -84,6 +84,14 @@ impl Layer { pub fn build_exact_evidence(layer_dir: &Path) -> OLMResult { MphfLayer::build_exact_evidence(layer_dir) } + + /// Build `fingerprint.bin` from `unitigs.bin` and `mphf.bin` already + /// present in `layer_dir`. `b` is the fingerprint width in bits (1..=64). + /// + /// See [`MphfLayer::build_approx_evidence`] for the full contract. + pub fn build_approx_evidence(layer_dir: &Path, b: u8) -> OLMResult { + MphfLayer::build_approx_evidence(layer_dir, b) + } } // ── Mode 1 — set membership ─────────────────────────────────────────────────── diff --git a/src/obilayeredmap/src/lib.rs b/src/obilayeredmap/src/lib.rs index a883041..dacbe7a 100644 --- a/src/obilayeredmap/src/lib.rs +++ b/src/obilayeredmap/src/lib.rs @@ -1,5 +1,6 @@ pub mod error; pub mod evidence; +pub mod fingerprint; pub mod layer; pub mod layered_store; pub mod map; diff --git a/src/obilayeredmap/src/mphf_layer.rs b/src/obilayeredmap/src/mphf_layer.rs index 1713bce..a025784 100644 --- a/src/obilayeredmap/src/mphf_layer.rs +++ b/src/obilayeredmap/src/mphf_layer.rs @@ -9,10 +9,12 @@ use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64}; use crate::error::{OLMError, OLMResult}; use crate::evidence::{Evidence, EvidenceWriter}; +use crate::fingerprint::{FingerprintVec, FingerprintVecWriter}; -pub(crate) const MPHF_FILE: &str = "mphf.bin"; -pub(crate) const UNITIGS_FILE: &str = "unitigs.bin"; -pub(crate) const EVIDENCE_FILE: &str = "evidence.bin"; +pub(crate) const MPHF_FILE: &str = "mphf.bin"; +pub(crate) const UNITIGS_FILE: &str = "unitigs.bin"; +pub(crate) const EVIDENCE_FILE: &str = "evidence.bin"; +pub(crate) const FINGERPRINT_FILE: &str = "fingerprint.bin"; pub(crate) type Mphf = PtrHash>, Xx64, Vec>; @@ -50,6 +52,17 @@ impl MphfLayer { } } + /// Returns `Some(slot)` if `kmer` passes the fingerprint check, `None` otherwise. + /// + /// False positive rate per query: 1/2^b (where b is the fingerprint width + /// used at build time). No `.idx` or `evidence.bin` needed at query time. + #[inline] + pub fn find_approx(&self, kmer: CanonicalKmer, fp: &FingerprintVec) -> Option { + let slot = self.mphf.index(&kmer.raw()); + if slot >= self.n { return None; } + if fp.matches(slot, kmer.seq_hash()) { Some(slot) } else { None } + } + pub fn n(&self) -> usize { self.n } pub fn unitig_writer(dir: &Path) -> OLMResult { @@ -108,6 +121,41 @@ impl MphfLayer { Ok(n) } + /// Build `fingerprint.bin` from `unitigs.bin` and `mphf.bin` already present + /// in `dir`. `b` is the number of fingerprint bits (1..=64). + /// + /// The fingerprint for each slot is the low `b` bits of `kmer.seq_hash()`. + /// No `.idx` file is written — approximate evidence needs no random access. + pub fn build_approx_evidence(dir: &Path, b: u8) -> OLMResult { + if b == 0 || b > 64 { + return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into())); + } + let unitig_path = dir.join(UNITIGS_FILE); + let unitigs = UnitigFileReader::open_sequential(&unitig_path)?; + let n = unitigs.n_kmers(); + + if n == 0 { + FingerprintVecWriter::new(0, b).write(&dir.join(FINGERPRINT_FILE))?; + return Ok(0); + } + + let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE)) + .map_err(|e| OLMError::InvalidLayer(e.to_string()))?; + + let mut fw = FingerprintVecWriter::new(n, b); + + for kmer in unitigs.iter_canonical_kmers() { + let slot = mphf.index(&kmer.raw()); + if slot >= n { + return Err(OLMError::Mphf("slot out of bounds".into())); + } + fw.set(slot, kmer.seq_hash()); + } + + fw.write(&dir.join(FINGERPRINT_FILE))?; + Ok(n) + } + /// Build MPHF and exact evidence from the unitigs file already present in /// `dir`. Calls `fill_slot(slot, kmer)` once per kmer for DataStore /// population. Returns the number of kmers indexed. -- 2.52.0 From 16a6b0d033c88f3f58f2dc0de3234fe3968e229c Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 08:59:11 +0200 Subject: [PATCH 05/10] feat: add evidence metadata and configurable k-mer parameters Introduces `EvidenceKind` and `LayerMeta` structs to manage per-layer evidence configuration and false-positive parameters. Adds JSON serialization for layer metadata persistence and updates `build_approx_evidence` to accept a `z` parameter for consecutive k-mer thresholds. Exposes these types publicly and documents a future `aggregate` command for merging index matrix columns. --- TODO.md | 18 +++++++++++ src/obikindex/src/meta.rs | 3 ++ src/obilayeredmap/src/layer.rs | 7 ++-- src/obilayeredmap/src/lib.rs | 1 + src/obilayeredmap/src/meta.rs | 50 ++++++++++++++++++++++++++++- src/obilayeredmap/src/mphf_layer.rs | 13 ++++++-- 6 files changed, 86 insertions(+), 6 deletions(-) diff --git a/TODO.md b/TODO.md index a874150..ceca1fe 100644 --- a/TODO.md +++ b/TODO.md @@ -1,3 +1,21 @@ +## A finir dans le cadre de l'extension des index à une forme approximative + +- Il faut avoir un chemin explicite pour construire en mode exact avec des méthodes qui ont ce mot exact à l'intérieur. + - pub fn find_exact (src/obilayeredmap/src/mphf_layer.rs) + - pub fn build_exact_evidence (src/obilayeredmap/src/layer.rs) + +Comme elles existent actuellement pour le mode approx. + +Ensuite, il faudra définir des méthodes génériques +- find() +- build_evidence() + +qui utilise la bonne version suivant le mode de l'index de manière complètement transparente. +Avec ce système, tout le reste du code devrait être insensible au fait que l'on utilise un index exact ou approximatif. + +Sauf qu'avec un index approximatif, les résultats seront approximatifs. + + ## commandes à ajouter - aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne. diff --git a/src/obikindex/src/meta.rs b/src/obikindex/src/meta.rs index 6832cee..6d691f8 100644 --- a/src/obikindex/src/meta.rs +++ b/src/obikindex/src/meta.rs @@ -3,6 +3,7 @@ use std::fs; use std::io; use std::path::Path; +use obilayeredmap::EvidenceKind; use serde::{Deserialize, Serialize}; pub const META_FILENAME: &str = "index.meta"; @@ -28,6 +29,8 @@ pub struct IndexConfig { pub minimizer_size: usize, pub n_bits: usize, pub with_counts: bool, + #[serde(default)] + pub evidence: EvidenceKind, } #[derive(Debug, Clone, Serialize, Deserialize)] diff --git a/src/obilayeredmap/src/layer.rs b/src/obilayeredmap/src/layer.rs index b67709f..04f95ec 100644 --- a/src/obilayeredmap/src/layer.rs +++ b/src/obilayeredmap/src/layer.rs @@ -86,11 +86,12 @@ impl Layer { } /// Build `fingerprint.bin` from `unitigs.bin` and `mphf.bin` already - /// present in `layer_dir`. `b` is the fingerprint width in bits (1..=64). + /// present in `layer_dir`. `b` — fingerprint bits (1..=64); `z` — Findere + /// consecutive k-mer parameter (≥1). /// /// See [`MphfLayer::build_approx_evidence`] for the full contract. - pub fn build_approx_evidence(layer_dir: &Path, b: u8) -> OLMResult { - MphfLayer::build_approx_evidence(layer_dir, b) + pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult { + MphfLayer::build_approx_evidence(layer_dir, b, z) } } diff --git a/src/obilayeredmap/src/lib.rs b/src/obilayeredmap/src/lib.rs index dacbe7a..98ca3c8 100644 --- a/src/obilayeredmap/src/lib.rs +++ b/src/obilayeredmap/src/lib.rs @@ -11,4 +11,5 @@ pub use error::{OLMError, OLMResult}; pub use layer::{Hit, Layer, LayerData}; pub use layered_store::LayeredStore; pub use map::LayeredMap; +pub use meta::{EvidenceKind, LayerMeta}; pub use mphf_layer::MphfLayer; diff --git a/src/obilayeredmap/src/meta.rs b/src/obilayeredmap/src/meta.rs index 12e17f7..bc8d93e 100644 --- a/src/obilayeredmap/src/meta.rs +++ b/src/obilayeredmap/src/meta.rs @@ -5,7 +5,55 @@ use serde::{Deserialize, Serialize}; use crate::error::OLMResult; -const META_FILE: &str = "meta.json"; +const META_FILE: &str = "meta.json"; +const LAYER_META_FILE: &str = "layer_meta.json"; + +// ── Layer-level metadata ────────────────────────────────────────────────────── + +/// Describes the evidence bundle stored alongside the MPHF for one layer. +#[derive(Debug, Clone, Serialize, Deserialize)] +#[serde(tag = "type", rename_all = "snake_case")] +pub enum EvidenceKind { + /// Exact evidence: `evidence.bin` + `unitigs.bin.idx`. Zero false positives. + Exact, + /// Approximate evidence: `fingerprint.bin` only. + /// `b` — fingerprint bits; false-positive rate per k-mer = 1/2^b. + /// `z` — consecutive k-mers that must all match (Findere trick); + /// effective FP rate per read ≈ (W / 2^(b*z)) where W = read windows. + Approx { b: u8, z: u8 }, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct LayerMeta { + pub evidence: EvidenceKind, +} + +impl Default for EvidenceKind { + fn default() -> Self { Self::Exact } +} + +impl LayerMeta { + pub fn exact() -> Self { + Self { evidence: EvidenceKind::Exact } + } + + pub fn approx(b: u8, z: u8) -> Self { + Self { evidence: EvidenceKind::Approx { b, z } } + } + + pub fn load(layer_dir: &Path) -> OLMResult { + let f = File::open(layer_dir.join(LAYER_META_FILE))?; + Ok(serde_json::from_reader(f)?) + } + + pub fn save(&self, layer_dir: &Path) -> OLMResult<()> { + let f = File::create(layer_dir.join(LAYER_META_FILE))?; + serde_json::to_writer_pretty(f, self)?; + Ok(()) + } +} + +// ── Partition-level metadata ────────────────────────────────────────────────── #[derive(Debug, Clone, Serialize, Deserialize)] pub struct PartitionMeta { diff --git a/src/obilayeredmap/src/mphf_layer.rs b/src/obilayeredmap/src/mphf_layer.rs index a025784..e5d4cf2 100644 --- a/src/obilayeredmap/src/mphf_layer.rs +++ b/src/obilayeredmap/src/mphf_layer.rs @@ -10,6 +10,7 @@ use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64}; use crate::error::{OLMError, OLMResult}; use crate::evidence::{Evidence, EvidenceWriter}; use crate::fingerprint::{FingerprintVec, FingerprintVecWriter}; +use crate::meta::LayerMeta; pub(crate) const MPHF_FILE: &str = "mphf.bin"; pub(crate) const UNITIGS_FILE: &str = "unitigs.bin"; @@ -118,24 +119,30 @@ impl MphfLayer { ev.write(&dir.join(EVIDENCE_FILE))?; // Write .idx last: it is only needed for random access (queries). build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + LayerMeta::exact().save(dir)?; Ok(n) } /// Build `fingerprint.bin` from `unitigs.bin` and `mphf.bin` already present - /// in `dir`. `b` is the number of fingerprint bits (1..=64). + /// in `dir`. `b` — fingerprint bits (1..=64); `z` — Findere consecutive + /// k-mer parameter (≥1). /// /// The fingerprint for each slot is the low `b` bits of `kmer.seq_hash()`. /// No `.idx` file is written — approximate evidence needs no random access. - pub fn build_approx_evidence(dir: &Path, b: u8) -> OLMResult { + pub fn build_approx_evidence(dir: &Path, b: u8, z: u8) -> OLMResult { if b == 0 || b > 64 { return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into())); } + if z == 0 { + return Err(OLMError::InvalidLayer("z must be ≥ 1".into())); + } let unitig_path = dir.join(UNITIGS_FILE); let unitigs = UnitigFileReader::open_sequential(&unitig_path)?; let n = unitigs.n_kmers(); if n == 0 { FingerprintVecWriter::new(0, b).write(&dir.join(FINGERPRINT_FILE))?; + LayerMeta::approx(b, z).save(dir)?; return Ok(0); } @@ -153,6 +160,7 @@ impl MphfLayer { } fw.write(&dir.join(FINGERPRINT_FILE))?; + LayerMeta::approx(b, z).save(dir)?; Ok(n) } @@ -214,6 +222,7 @@ impl MphfLayer { } ev.write(&dir.join(EVIDENCE_FILE))?; + LayerMeta::exact().save(dir)?; Ok(n) } } -- 2.52.0 From 876bc0127f6a24b3d931ef6ae95056fd372222b7 Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 12:43:32 +0200 Subject: [PATCH 06/10] feat: add approximate evidence matching and index estimation CLI Introduces a new `estimate` CLI subcommand to calculate bloom filter size, evidence bits, and false-positive rates for approximate indexing. Updates the index building and querying pipelines to support both exact and approximate evidence types via a unified `EvidenceKind` abstraction. Refactors `MphfLayer` and partition index builders to route operations based on the selected evidence mode, and adds the required `obilayeredmap` dependency. --- src/Cargo.lock | 1 + src/obikindex/src/index.rs | 3 +- src/obikmer/Cargo.toml | 1 + src/obikmer/src/cmd/estimate.rs | 38 ++++++++ src/obikmer/src/cmd/index.rs | 104 +++++++++++++++++++++ src/obikmer/src/cmd/mod.rs | 1 + src/obikmer/src/main.rs | 3 + src/obikpartitionner/src/index_layer.rs | 9 +- src/obilayeredmap/src/layer.rs | 7 ++ src/obilayeredmap/src/meta.rs | 5 +- src/obilayeredmap/src/mphf_layer.rs | 118 +++++++++++++++--------- 11 files changed, 243 insertions(+), 47 deletions(-) create mode 100644 src/obikmer/src/cmd/estimate.rs diff --git a/src/Cargo.lock b/src/Cargo.lock index bc710eb..dd85960 100644 --- a/src/Cargo.lock +++ b/src/Cargo.lock @@ -1529,6 +1529,7 @@ dependencies = [ "obikpartitionner", "obikrope", "obikseq", + "obilayeredmap", "obipipeline", "obiread", "obiskbuilder", diff --git a/src/obikindex/src/index.rs b/src/obikindex/src/index.rs index 502822e..321d905 100644 --- a/src/obikindex/src/index.rs +++ b/src/obikindex/src/index.rs @@ -144,6 +144,7 @@ impl KmerIndex { let n = self.partition.n_partitions(); let t = Stage::start("index"); let with_counts = self.meta.config.with_counts; + let evidence = self.meta.config.evidence.clone(); let total_kmers = AtomicUsize::new(0); let pb = Arc::new(Mutex::new( @@ -153,7 +154,7 @@ impl KmerIndex { )); (0..n).into_par_iter().for_each(|i| { - match self.partition.build_index_layer(i, min_ab, max_ab, with_counts) { + match self.partition.build_index_layer(i, min_ab, max_ab, with_counts, &evidence) { Ok(0) => {} Ok(n_kmers) => { total_kmers.fetch_add(n_kmers, Ordering::Relaxed); diff --git a/src/obikmer/Cargo.toml b/src/obikmer/Cargo.toml index 8d44035..cb7a7a2 100644 --- a/src/obikmer/Cargo.toml +++ b/src/obikmer/Cargo.toml @@ -18,6 +18,7 @@ obikpartitionner = { path = "../obikpartitionner" } obisys = { path = "../obisys" } obiskio = { path = "../obiskio" } obikindex = { path = "../obikindex" } +obilayeredmap = { path = "../obilayeredmap" } clap = { version = "4", features = ["derive"] } serde_json = "1" csv = "1" diff --git a/src/obikmer/src/cmd/estimate.rs b/src/obikmer/src/cmd/estimate.rs new file mode 100644 index 0000000..0d8fe98 --- /dev/null +++ b/src/obikmer/src/cmd/estimate.rs @@ -0,0 +1,38 @@ +use clap::Args; + +use super::index::resolve_approx_params; + +#[derive(Args)] +pub struct EstimateArgs { + /// k-mer size used for querying (same as --kmer-size in index) + #[arg(short = 'k', long, default_value_t = 31)] + pub kmer_size: usize, + + /// Findere z parameter: number of consecutive k-mers that must all match. + /// Effective indexed k-mer size is kmer_size - z + 1. + #[arg(short = 'z', long, default_value = None)] + pub findere_z: Option, + + /// Fingerprint bits per slot (b). FP per z-window = 1/2^(b·z). + #[arg(long, default_value = None)] + pub evidence_bits: Option, + + /// Target false-positive rate per z-window (e.g. 0.01). + #[arg(long, default_value = None)] + pub fp: Option, +} + +pub fn run(args: EstimateArgs) { + let (z, b, fp_window) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp); + + let k_query = args.kmer_size; + let k_index = k_query.saturating_sub(z as usize - 1); + let fp_kmer = 1.0_f64 / 2_f64.powi(b as i32); + + println!("{:<22} {}", "k (query):", k_query); + println!("{:<22} {}", "k (indexed):", k_index); + println!("{:<22} {}", "z:", z); + println!("{:<22} {}", "evidence bits (b):", b); + println!("{:<22} {:.3e} (1/2^{})", "FP per k-mer:", fp_kmer, b); + println!("{:<22} {:.3e} (1/2^{})", "FP per z-window:", fp_window, b as u32 * z as u32); +} diff --git a/src/obikmer/src/cmd/index.rs b/src/obikmer/src/cmd/index.rs index 5559f95..d5c9999 100644 --- a/src/obikmer/src/cmd/index.rs +++ b/src/obikmer/src/cmd/index.rs @@ -2,6 +2,7 @@ use std::path::PathBuf; use clap::Args; use obikindex::{GenomeInfo, IndexConfig, IndexState, KmerIndex}; +use obilayeredmap::EvidenceKind; fn parse_key_value(s: &str) -> Result<(String, String), String> { let pos = s.find('=').ok_or_else(|| format!("invalid key=value: no '=' in '{s}'"))?; @@ -48,14 +49,116 @@ pub struct IndexArgs { #[arg(long, default_value_t = false)] pub keep_intermediate: bool, + /// Use approximate (fingerprint-based) evidence instead of exact evidence. + /// False-positive rate per z-window: 1/2^(b·z). + #[arg(long, default_value_t = false)] + pub approx: bool, + + /// Findere z parameter: number of consecutive k-mers that must all match. + /// Effective indexed k-mer size is kmer_size - z + 1. + #[arg(short = 'z', long, default_value = None)] + pub findere_z: Option, + + /// Fingerprint bits per slot (b). FP per z-window = 1/2^(b·z). + #[arg(long, default_value = None)] + pub evidence_bits: Option, + + /// Target false-positive rate per z-window (e.g. 0.01). + /// Used to derive missing b or z. + #[arg(long, default_value = None)] + pub fp: Option, + #[command(flatten)] pub common: CommonArgs, } +/// Resolve the (z, b, fp) triplet from the user-supplied subset. +/// +/// Model: FP = 1/2^(b·z) ⟹ b·z = ⌈-log₂(fp)⌉ +/// +/// Rules when one value is missing (conservative = ceiling): +/// given z, b → fp = 1/2^(b·z) +/// given z, fp → b = ⌈-log₂(fp) / z⌉ +/// given b, fp → z = ⌈-log₂(fp) / b⌉ +/// given z only → b = 8 (default), fp derived +/// given b only → z = 1 (default), fp derived +/// given fp only → b = 8 (default), z derived +/// none given → z = 1, b = 8, fp = 1/256 +pub(crate) fn resolve_approx_params( + z_opt: Option, + b_opt: Option, + fp_opt: Option, +) -> (u8, u8, f64) { + const DEFAULT_B: u8 = 8; + const DEFAULT_Z: u8 = 1; + + let bits_needed = |fp: f64| -> u8 { + (-fp.log2()).ceil() as u8 + }; + + match (z_opt, b_opt, fp_opt) { + // All three given: use b and z, recompute fp conservatively. + (Some(z), Some(b), Some(_fp)) => { + let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, fp) + } + // Two given, derive third. + (Some(z), Some(b), None) => { + let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, fp) + } + (Some(z), None, Some(fp)) => { + let bz = (-fp.log2()).ceil() as u32; + let b = ((bz + z as u32 - 1) / z as u32).max(1) as u8; + let actual_fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, actual_fp) + } + (None, Some(b), Some(fp)) => { + let bz = (-fp.log2()).ceil() as u32; + let z = ((bz + b as u32 - 1) / b as u32).max(1) as u8; + let actual_fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, actual_fp) + } + // One given, apply defaults for the other. + (Some(z), None, None) => { + let b = DEFAULT_B; + let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, fp) + } + (None, Some(b), None) => { + let z = DEFAULT_Z; + let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, fp) + } + (None, None, Some(fp)) => { + let b = DEFAULT_B; + let z = ((bits_needed(fp) as u32 + b as u32 - 1) / b as u32).max(1) as u8; + let actual_fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, actual_fp) + } + // None given: defaults. + (None, None, None) => { + let b = DEFAULT_B; + let z = DEFAULT_Z; + let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64; + (z, b, fp) + } + } +} + pub fn run(args: IndexArgs) { let output = args.output.clone(); let mut rep = Reporter::new(); + // ── Resolve evidence kind ──────────────────────────────────────────────── + let evidence = if args.approx { + let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp); + info!("approximate evidence: b={b}, z={z}, fp={fp:.2e}"); + EvidenceKind::Approx { b, z } + } else { + EvidenceKind::Exact + }; + // ── Open or create the index ───────────────────────────────────────────── let mut idx = if KmerIndex::exists(&output) && !args.force { info!("resuming from existing index at {}", output.display()); @@ -81,6 +184,7 @@ pub fn run(args: IndexArgs) { minimizer_size: args.common.minimizer_size, n_bits, with_counts: args.with_counts, + evidence: evidence.clone(), }; let genome_info = args.label.as_ref().map(|label| { let mut info = GenomeInfo::new(label.clone()); diff --git a/src/obikmer/src/cmd/mod.rs b/src/obikmer/src/cmd/mod.rs index 43fc24a..1645c23 100644 --- a/src/obikmer/src/cmd/mod.rs +++ b/src/obikmer/src/cmd/mod.rs @@ -1,6 +1,7 @@ pub mod annotate; pub mod distance; pub mod dump; +pub mod estimate; pub mod index; pub mod merge; pub mod query; diff --git a/src/obikmer/src/main.rs b/src/obikmer/src/main.rs index c9f4a12..66665f8 100644 --- a/src/obikmer/src/main.rs +++ b/src/obikmer/src/main.rs @@ -32,6 +32,8 @@ enum Commands { Distance(cmd::distance::DistanceArgs), /// Dump unitigs from a built index to stdout (debug) Unitig(cmd::unitig::UnitigArgs), + /// Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing + Estimate(cmd::estimate::EstimateArgs), } fn main() { @@ -62,6 +64,7 @@ fn main() { Commands::Annotate(args) => cmd::annotate::run(args), Commands::Distance(args) => cmd::distance::run(args), Commands::Unitig(args) => cmd::unitig::run(args), + Commands::Estimate(args) => cmd::estimate::run(args), } #[cfg(feature = "profiling")] diff --git a/src/obikpartitionner/src/index_layer.rs b/src/obikpartitionner/src/index_layer.rs index a6401b5..e51f7cc 100644 --- a/src/obikpartitionner/src/index_layer.rs +++ b/src/obikpartitionner/src/index_layer.rs @@ -5,7 +5,7 @@ use cacheline_ef::{CachelineEf, CachelineEfVec}; use epserde::prelude::*; use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec}; use obidebruinj::GraphDeBruijn; -use obilayeredmap::{OLMError, layer::Layer}; +use obilayeredmap::{EvidenceKind, OLMError, layer::Layer}; use obilayeredmap::meta::PartitionMeta; use obiskio::{SKError, SKFileMeta, SKFileReader}; use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64}; @@ -44,6 +44,7 @@ impl KmerPartition { min_ab: u32, max_ab: Option, with_counts: bool, + evidence: &EvidenceKind, ) -> Result { let part_dir = self.part_dir(i); let dedup_path = part_dir.join("dereplicated.skmer.zst"); @@ -119,6 +120,12 @@ impl KmerPartition { Layer::<()>::build(&layer_dir).map_err(olm_to_sk)?; } + // For approximate evidence: replace the exact evidence bundle with a + // fingerprint. For exact evidence, build() already wrote it. + if let EvidenceKind::Approx { b, z } = evidence { + Layer::<()>::build_approx_evidence(&layer_dir, *b, *z).map_err(olm_to_sk)?; + } + // Write meta.json in the index/ directory so LayeredMap::open works // (e.g. for subsequent merge operations). let index_dir = layer_dir.parent().expect("layer_dir has a parent"); diff --git a/src/obilayeredmap/src/layer.rs b/src/obilayeredmap/src/layer.rs index 04f95ec..e5f531b 100644 --- a/src/obilayeredmap/src/layer.rs +++ b/src/obilayeredmap/src/layer.rs @@ -10,6 +10,7 @@ use obikseq::CanonicalKmer; use obiskio::{UnitigFileReader, UnitigFileWriter}; use crate::error::{OLMError, OLMResult}; +use crate::meta::EvidenceKind; use crate::mphf_layer::MphfLayer; pub(crate) use crate::mphf_layer::UNITIGS_FILE; @@ -93,6 +94,12 @@ impl Layer { pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult { MphfLayer::build_approx_evidence(layer_dir, b, z) } + + /// Dispatch to `build_exact_evidence` or `build_approx_evidence` based on + /// `kind`. + pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind) -> OLMResult { + MphfLayer::build_evidence(layer_dir, kind) + } } // ── Mode 1 — set membership ─────────────────────────────────────────────────── diff --git a/src/obilayeredmap/src/meta.rs b/src/obilayeredmap/src/meta.rs index bc8d93e..352cbdd 100644 --- a/src/obilayeredmap/src/meta.rs +++ b/src/obilayeredmap/src/meta.rs @@ -17,9 +17,10 @@ pub enum EvidenceKind { /// Exact evidence: `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Exact, /// Approximate evidence: `fingerprint.bin` only. - /// `b` — fingerprint bits; false-positive rate per k-mer = 1/2^b. + /// `b` — fingerprint bits; false-positive rate per k-mer query = 1/2^b. /// `z` — consecutive k-mers that must all match (Findere trick); - /// effective FP rate per read ≈ (W / 2^(b*z)) where W = read windows. + /// effective FP rate per sequencing read ≈ W / 2^(b·z) + /// where W = L - k - z + 2 is the number of windows in a read of length L. Approx { b: u8, z: u8 }, } diff --git a/src/obilayeredmap/src/mphf_layer.rs b/src/obilayeredmap/src/mphf_layer.rs index e5d4cf2..d6b9510 100644 --- a/src/obilayeredmap/src/mphf_layer.rs +++ b/src/obilayeredmap/src/mphf_layer.rs @@ -10,7 +10,7 @@ use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64}; use crate::error::{OLMError, OLMResult}; use crate::evidence::{Evidence, EvidenceWriter}; use crate::fingerprint::{FingerprintVec, FingerprintVecWriter}; -use crate::meta::LayerMeta; +use crate::meta::{EvidenceKind, LayerMeta}; pub(crate) const MPHF_FILE: &str = "mphf.bin"; pub(crate) const UNITIGS_FILE: &str = "unitigs.bin"; @@ -19,80 +19,117 @@ pub(crate) const FINGERPRINT_FILE: &str = "fingerprint.bin"; pub(crate) type Mphf = PtrHash>, Xx64, Vec>; +// ── Evidence store ──────────────────────────────────────────────────────────── + +enum LayerEvidence { + Exact { evidence: Evidence, unitigs: UnitigFileReader }, + Approx { fingerprint: FingerprintVec }, +} + +// ── MphfLayer ───────────────────────────────────────────────────────────────── + /// Autonomous kmer → slot mapping for one layer. /// -/// Answers presence/absence queries without any attached DataStore. -/// Build once, never rebuilt — data stores are attached and derived externally. +/// Dispatches queries to exact or approximate evidence transparently based on +/// the `layer_meta.json` written at build time. pub struct MphfLayer { - mphf: Mphf, - evidence: Evidence, - unitigs: UnitigFileReader, - n: usize, + mphf: Mphf, + ev: LayerEvidence, + n: usize, } impl MphfLayer { pub fn open(dir: &Path) -> OLMResult { + let meta = LayerMeta::load(dir)?; let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE)) .map_err(|e| OLMError::InvalidLayer(e.to_string()))?; - let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?; - let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?; - let n = evidence.len(); - Ok(Self { mphf, evidence, unitigs, n }) + let (ev, n) = match meta.evidence { + EvidenceKind::Exact => { + let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?; + let n = evidence.len(); + let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?; + (LayerEvidence::Exact { evidence, unitigs }, n) + } + EvidenceKind::Approx { .. } => { + let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?; + let n = fingerprint.n(); + (LayerEvidence::Approx { fingerprint }, n) + } + }; + Ok(Self { mphf, ev, n }) } - /// Returns `Some(slot)` if `kmer` belongs to this layer, `None` otherwise. + // ── Query API ───────────────────────────────────────────────────────────── + + /// Transparent dispatch: routes to `find_exact` or `find_approx` based on + /// the evidence loaded at `open` time. #[inline] pub fn find(&self, kmer: CanonicalKmer) -> Option { + match &self.ev { + LayerEvidence::Exact { .. } => self.find_exact(kmer), + LayerEvidence::Approx { .. } => self.find_approx(kmer), + } + } + + /// Exact lookup: zero false positives. Panics if the layer was opened with + /// approximate evidence. + #[inline] + pub fn find_exact(&self, kmer: CanonicalKmer) -> Option { + let LayerEvidence::Exact { evidence, unitigs } = &self.ev else { + panic!("find_exact called on an approximate layer"); + }; let slot = self.mphf.index(&kmer.raw()); if slot >= self.n { return None; } - let (chunk_id, rank) = self.evidence.decode(slot); - if self.unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) { + let (chunk_id, rank) = evidence.decode(slot); + if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) { Some(slot) } else { None } } - /// Returns `Some(slot)` if `kmer` passes the fingerprint check, `None` otherwise. - /// - /// False positive rate per query: 1/2^b (where b is the fingerprint width - /// used at build time). No `.idx` or `evidence.bin` needed at query time. + /// Approximate lookup: false-positive rate 1/2^b per k-mer query. Panics + /// if the layer was opened with exact evidence. #[inline] - pub fn find_approx(&self, kmer: CanonicalKmer, fp: &FingerprintVec) -> Option { + pub fn find_approx(&self, kmer: CanonicalKmer) -> Option { + let LayerEvidence::Approx { fingerprint } = &self.ev else { + panic!("find_approx called on an exact layer"); + }; let slot = self.mphf.index(&kmer.raw()); if slot >= self.n { return None; } - if fp.matches(slot, kmer.seq_hash()) { Some(slot) } else { None } + if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None } } pub fn n(&self) -> usize { self.n } + // ── Build helpers ───────────────────────────────────────────────────────── + pub fn unitig_writer(dir: &Path) -> OLMResult { fs::create_dir_all(dir)?; Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?) } - /// Build `unitigs.bin.idx` and `evidence.bin` from `unitigs.bin` and - /// `mphf.bin` already present in `dir`. + /// Dispatch to `build_exact_evidence` or `build_approx_evidence` based on + /// `kind`. + pub fn build_evidence(dir: &Path, kind: &EvidenceKind) -> OLMResult { + match kind { + EvidenceKind::Exact => Self::build_exact_evidence(dir), + EvidenceKind::Approx { b, z } => Self::build_approx_evidence(dir, *b, *z), + } + } + + /// Build `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`. /// - /// This is the exact-evidence construction route. It can be called: - /// - after the initial build (via [`Self::build`] which calls it internally) - /// - standalone to promote an existing (unitigs + mphf) into an exact index - /// - standalone to rebuild evidence after a format change - /// Build `unitigs.bin.idx` and `evidence.bin` from `unitigs.bin` and - /// `mphf.bin` already present in `dir`. - /// - /// Uses sequential iteration — no `.idx` is required on entry. - /// Writes both `evidence.bin` and `unitigs.bin.idx` on success. + /// Uses sequential iteration — no `.idx` required on entry. pub fn build_exact_evidence(dir: &Path) -> OLMResult { let unitig_path = dir.join(UNITIGS_FILE); - - // Sequential scan — no .idx required for iteration let unitigs = UnitigFileReader::open_sequential(&unitig_path)?; let n = unitigs.n_kmers(); if n == 0 { fs::File::create(dir.join(EVIDENCE_FILE))?; build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + LayerMeta::exact().save(dir)?; return Ok(0); } @@ -117,18 +154,14 @@ impl MphfLayer { } ev.write(&dir.join(EVIDENCE_FILE))?; - // Write .idx last: it is only needed for random access (queries). build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; LayerMeta::exact().save(dir)?; Ok(n) } - /// Build `fingerprint.bin` from `unitigs.bin` and `mphf.bin` already present - /// in `dir`. `b` — fingerprint bits (1..=64); `z` — Findere consecutive - /// k-mer parameter (≥1). - /// - /// The fingerprint for each slot is the low `b` bits of `kmer.seq_hash()`. - /// No `.idx` file is written — approximate evidence needs no random access. + /// Build `fingerprint.bin` from `unitigs.bin` + `mphf.bin`. + /// `b` — fingerprint bits (1..=64); `z` — Findere consecutive k-mer + /// parameter (≥1). No `.idx` is written. pub fn build_approx_evidence(dir: &Path, b: u8, z: u8) -> OLMResult { if b == 0 || b > 64 { return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into())); @@ -175,8 +208,6 @@ impl MphfLayer { let unitig_path = dir.join(UNITIGS_FILE); - // Write .idx so that UnitigFileReader::open succeeds and parallel - // random access is available for MPHF construction. build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; let unitigs = UnitigFileReader::open(&unitig_path)?; @@ -189,10 +220,11 @@ impl MphfLayer { .ok_or_else(|| OLMError::Mphf("construction failed".into()))?; mphf.store(&dir.join(MPHF_FILE)) .map_err(|e| OLMError::InvalidLayer(e.to_string()))?; + LayerMeta::exact().save(dir)?; return Ok(0); } - // Pass 1 — build MPHF (parallel, uses random access via .idx) + // Pass 1 — build MPHF (parallel, random access via .idx) let keys = (0..unitigs.len()) .into_par_iter() .flat_map_iter(|ci| unitigs.unitig(ci).into_canonical_kmers().map(|km| km.raw())); -- 2.52.0 From bc51cd9861d889911fdf505f7d98ec0a8c7fea96 Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 12:50:03 +0200 Subject: [PATCH 07/10] feat: add configurable block sizes and in-place reindex command Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests. --- src/obikindex/src/error.rs | 5 +- src/obikindex/src/index.rs | 3 +- src/obikindex/src/lib.rs | 1 + src/obikindex/src/merge.rs | 37 ++++++- src/obikindex/src/meta.rs | 5 + src/obikindex/src/rebuild.rs | 3 +- src/obikindex/src/reindex.rs | 126 ++++++++++++++++++++++ src/obikmer/src/cli.rs | 6 ++ src/obikmer/src/cmd/index.rs | 9 +- src/obikmer/src/cmd/mod.rs | 1 + src/obikmer/src/cmd/reindex.rs | 68 ++++++++++++ src/obikmer/src/cmd/unitig.rs | 6 +- src/obikmer/src/main.rs | 3 + src/obikpartitionner/src/index_layer.rs | 5 +- src/obikpartitionner/src/merge_layer.rs | 3 +- src/obikpartitionner/src/rebuild_layer.rs | 3 +- src/obilayeredmap/src/layer.rs | 44 ++++---- src/obilayeredmap/src/map.rs | 6 +- src/obilayeredmap/src/mphf_layer.rs | 20 ++-- src/obilayeredmap/src/tests/layer.rs | 8 +- src/obiskio/src/unitig_index.rs | 7 +- 21 files changed, 318 insertions(+), 51 deletions(-) create mode 100644 src/obikindex/src/reindex.rs create mode 100644 src/obikmer/src/cmd/reindex.rs diff --git a/src/obikindex/src/error.rs b/src/obikindex/src/error.rs index 96834cb..f9ddf7e 100644 --- a/src/obikindex/src/error.rs +++ b/src/obikindex/src/error.rs @@ -18,6 +18,8 @@ pub enum OKIError { DuplicateGenomeLabel(String), /// Operation not valid for this index configuration. InvalidInput(String), + /// Sources mix exact and approximate evidence, or use incompatible approx parameters. + IncompatibleEvidence(String), } pub type OKIResult = Result; @@ -32,7 +34,8 @@ impl fmt::Display for OKIError { OKIError::IncompatibleConfig => write!(f, "incompatible index configurations"), OKIError::MismatchedMode => write!(f, "count mode requires all sources to have with_counts=true"), OKIError::DuplicateGenomeLabel(l) => write!(f, "duplicate genome label across sources: {l}"), - OKIError::InvalidInput(m) => write!(f, "invalid input: {m}"), + OKIError::InvalidInput(m) => write!(f, "invalid input: {m}"), + OKIError::IncompatibleEvidence(m) => write!(f, "incompatible evidence: {m}"), } } } diff --git a/src/obikindex/src/index.rs b/src/obikindex/src/index.rs index 321d905..8e207a8 100644 --- a/src/obikindex/src/index.rs +++ b/src/obikindex/src/index.rs @@ -145,6 +145,7 @@ impl KmerIndex { let t = Stage::start("index"); let with_counts = self.meta.config.with_counts; let evidence = self.meta.config.evidence.clone(); + let block_bits = self.meta.config.block_bits; let total_kmers = AtomicUsize::new(0); let pb = Arc::new(Mutex::new( @@ -154,7 +155,7 @@ impl KmerIndex { )); (0..n).into_par_iter().for_each(|i| { - match self.partition.build_index_layer(i, min_ab, max_ab, with_counts, &evidence) { + match self.partition.build_index_layer(i, min_ab, max_ab, with_counts, &evidence, block_bits) { Ok(0) => {} Ok(n_kmers) => { total_kmers.fetch_add(n_kmers, Ordering::Relaxed); diff --git a/src/obikindex/src/lib.rs b/src/obikindex/src/lib.rs index 5c9bc85..9f84178 100644 --- a/src/obikindex/src/lib.rs +++ b/src/obikindex/src/lib.rs @@ -6,6 +6,7 @@ mod dump; mod index; mod merge; mod rebuild; +mod reindex; pub use error::{OKIError, OKIResult}; pub use distance::{DistanceMetric, DistanceOutput}; diff --git a/src/obikindex/src/merge.rs b/src/obikindex/src/merge.rs index bb51629..88dead2 100644 --- a/src/obikindex/src/merge.rs +++ b/src/obikindex/src/merge.rs @@ -9,6 +9,8 @@ use obisys::{Reporter, Stage}; use rayon::prelude::*; use tracing::info; +use obilayeredmap::EvidenceKind; + use crate::error::{OKIError, OKIResult}; use crate::index::KmerIndex; use crate::meta::{GenomeInfo, IndexMeta}; @@ -61,6 +63,9 @@ impl KmerIndex { } } + // ── Validate evidence compatibility ─────────────────────────────────── + let evidence = validate_evidence_compat(sources)?; + // ── Compute final genome labels (rename duplicates if requested) ─────── let (source_labels, all_genomes) = compute_labels(sources, rename_duplicates)?; @@ -91,6 +96,7 @@ impl KmerIndex { let mut meta = IndexMeta::read(output).map_err(OKIError::Io)?; meta.genomes = all_genomes; meta.config.with_counts = mode == MergeMode::Count; + meta.config.evidence = evidence; meta.write(output)?; // In presence/absence mode, purge counts/ directories inherited from @@ -134,13 +140,14 @@ impl KmerIndex { let pb = partition_bar(n_partitions as u64); let dst_partition = &dst.partition; + let block_bits = dst.meta.config.block_bits; let errors: Vec = (0..n_partitions) .into_par_iter() .filter_map(|i| { let srcs: Vec<(&obikpartitionner::KmerPartition, usize)> = remaining_sources.iter().map(|s| (&s.partition, s.meta.genomes.len())).collect(); - let result = dst_partition.merge_partition(i, &srcs, mode, n_dst_genomes).err(); + let result = dst_partition.merge_partition(i, &srcs, mode, n_dst_genomes, block_bits).err(); pb.inc(1); result }) @@ -258,6 +265,34 @@ fn partition_bar(n: u64) -> ProgressBar { pb } +/// Check that all sources share the same evidence kind. +/// +/// Rules: +/// - all `Exact` → OK, returns `Exact` +/// - all `Approx { b, z }` same params → OK, returns `Approx { b, z }` +/// - mixed exact/approx or different approx params → `IncompatibleEvidence` +fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult { + let ref_ev = &sources[0].meta.config.evidence; + for src in &sources[1..] { + let ev = &src.meta.config.evidence; + let compat = match (ref_ev, ev) { + (EvidenceKind::Exact, EvidenceKind::Exact) => true, + (EvidenceKind::Approx { b: b1, z: z1 }, + EvidenceKind::Approx { b: b2, z: z2 }) => b1 == b2 && z1 == z2, + _ => false, + }; + if !compat { + return Err(OKIError::IncompatibleEvidence(format!( + "source {:?} has evidence {:?}, expected {:?} — \ + convert all sources to the same evidence kind first \ + (use the `reindex` command)", + src.root_path.display(), ev, ref_ev, + ))); + } + } + Ok(ref_ev.clone()) +} + fn copy_dir_all(src: &Path, dst: &Path) -> io::Result<()> { fs::create_dir_all(dst)?; for entry in fs::read_dir(src)? { diff --git a/src/obikindex/src/meta.rs b/src/obikindex/src/meta.rs index 6d691f8..686c420 100644 --- a/src/obikindex/src/meta.rs +++ b/src/obikindex/src/meta.rs @@ -31,6 +31,11 @@ pub struct IndexConfig { pub with_counts: bool, #[serde(default)] pub evidence: EvidenceKind, + /// Block size for the unitig index as a power-of-two exponent. + /// The `.idx` block covers 2^block_bits consecutive unitigs. + /// 0 = one entry per unitig (O(1) access, largest `.idx`). + #[serde(default)] + pub block_bits: u8, } #[derive(Debug, Clone, Serialize, Deserialize)] diff --git a/src/obikindex/src/rebuild.rs b/src/obikindex/src/rebuild.rs index b360030..95dc4b8 100644 --- a/src/obikindex/src/rebuild.rs +++ b/src/obikindex/src/rebuild.rs @@ -88,12 +88,13 @@ impl KmerIndex { pb.enable_steady_tick(Duration::from_millis(100)); let src_partition = &src.partition; + let block_bits = meta.config.block_bits; let errors: Vec = (0..n_partitions) .into_par_iter() .filter_map(|i| { let result = dst_partition - .rebuild_partition(src_partition, i, filters, mode, n_genomes) + .rebuild_partition(src_partition, i, filters, mode, n_genomes, block_bits) .err(); pb.inc(1); result diff --git a/src/obikindex/src/reindex.rs b/src/obikindex/src/reindex.rs new file mode 100644 index 0000000..674bf4a --- /dev/null +++ b/src/obikindex/src/reindex.rs @@ -0,0 +1,126 @@ +use std::fs; +use std::path::Path; +use std::time::Duration; + +use indicatif::{ProgressBar, ProgressStyle}; +use obilayeredmap::{EvidenceKind, layer::Layer}; +use obilayeredmap::meta::PartitionMeta; +use obisys::{Reporter, Stage}; +use rayon::prelude::*; +use tracing::info; + +use crate::error::{OKIError, OKIResult}; +use crate::index::KmerIndex; +use crate::state::IndexState; + +const EVIDENCE_FILE: &str = "evidence.bin"; +const FINGERPRINT_FILE: &str = "fingerprint.bin"; +const UNITIG_IDX_FILE: &str = "unitigs.bin.idx"; + +fn olm_to_oki(e: obilayeredmap::OLMError) -> OKIError { + OKIError::InvalidInput(e.to_string()) +} + +impl KmerIndex { + /// Convert every layer's evidence bundle to `target` in-place. + /// + /// - `Exact` → builds `evidence.bin` + `unitigs.bin.idx`, removes `fingerprint.bin` + /// - `Approx` → builds `fingerprint.bin`, removes `evidence.bin` + `unitigs.bin.idx` + /// + /// The MPHF (`mphf.bin`) and unitigs (`unitigs.bin`) are never touched. + /// `index.meta` is updated with the new evidence kind on success. + pub fn reindex( + &mut self, + target: EvidenceKind, + block_bits: u8, + rep: &mut Reporter, + ) -> OKIResult<()> { + if self.state() != IndexState::Indexed { + return Err(OKIError::NotIndexed(self.root_path.clone())); + } + + let n = self.partition.n_partitions(); + info!( + "reindex {} partition(s): {:?} → {:?}", + n, self.meta.config.evidence, target, + ); + + let t = Stage::start("reindex"); + let pb = ProgressBar::new(n as u64).with_style( + ProgressStyle::with_template( + "reindex — [{bar:20}] {pos}/{len} | {msg}", + ) + .unwrap() + .tick_strings(&["⠋","⠙","⠹","⠸","⠼","⠴","⠦","⠧","⠇","⠏"]), + ); + pb.enable_steady_tick(Duration::from_millis(80)); + + let errors: Vec = (0..n) + .into_par_iter() + .filter_map(|i| { + let res = reindex_partition( + &self.partition.part_dir(i).join("index"), + &target, + block_bits, + ); + pb.inc(1); + res.err().map(|e| format!("partition {i}: {e}")) + }) + .collect(); + + pb.finish_and_clear(); + + if let Some(e) = errors.into_iter().next() { + return Err(OKIError::InvalidInput(e)); + } + + self.meta.config.evidence = target; + if matches!(self.meta.config.evidence, EvidenceKind::Exact) { + self.meta.config.block_bits = block_bits; + } + self.meta.write(&self.root_path)?; + rep.push(t.stop()); + Ok(()) + } +} + +/// Process all layers of one partition's index directory. +fn reindex_partition(index_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> { + if !index_dir.exists() { + return Ok(()); + } + let pm = PartitionMeta::load(index_dir).map_err(olm_to_oki)?; + for layer_idx in 0..pm.n_layers { + let layer_dir = index_dir.join(format!("layer_{layer_idx}")); + reindex_layer(&layer_dir, target, block_bits)?; + } + Ok(()) +} + +fn reindex_layer(layer_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> { + Layer::<()>::build_evidence(layer_dir, target, block_bits).map_err(olm_to_oki)?; + remove_stale_evidence(layer_dir, target) +} + +fn remove_stale_evidence(layer_dir: &Path, target: &EvidenceKind) -> OKIResult<()> { + match target { + EvidenceKind::Exact => { + // fingerprint.bin is no longer valid + remove_if_exists(&layer_dir.join(FINGERPRINT_FILE)); + } + EvidenceKind::Approx { .. } => { + // exact bundle is no longer valid + remove_if_exists(&layer_dir.join(EVIDENCE_FILE)); + remove_if_exists(&layer_dir.join(UNITIG_IDX_FILE)); + } + } + Ok(()) +} + +fn remove_if_exists(path: &Path) { + if let Err(e) = fs::remove_file(path) { + if e.kind() != std::io::ErrorKind::NotFound { + eprintln!("warning: could not remove {}: {e}", path.display()); + } + } +} diff --git a/src/obikmer/src/cli.rs b/src/obikmer/src/cli.rs index b20ae6b..83308d1 100644 --- a/src/obikmer/src/cli.rs +++ b/src/obikmer/src/cli.rs @@ -55,6 +55,12 @@ pub fn partitions_to_bits(n: usize) -> usize { n.max(1).next_power_of_two().trailing_zeros() as usize } +/// Convert a block size (number of unitigs per block) to its `block_bits` exponent. +/// `block_size=1` → `block_bits=0` (one entry per unitig, O(1) random access). +pub fn block_size_to_bits(n: usize) -> u8 { + n.max(1).next_power_of_two().trailing_zeros() as u8 +} + impl CommonArgs { pub fn seqfile_paths(&self) -> obiread::PathIter { let paths = self.inputs.iter().map(PathBuf::from).collect(); diff --git a/src/obikmer/src/cmd/index.rs b/src/obikmer/src/cmd/index.rs index d5c9999..4b53ea3 100644 --- a/src/obikmer/src/cmd/index.rs +++ b/src/obikmer/src/cmd/index.rs @@ -12,7 +12,7 @@ use obikseq::{set_k, set_m}; use obisys::Reporter; use tracing::info; -use crate::cli::{CommonArgs, partitions_to_bits}; +use crate::cli::{CommonArgs, block_size_to_bits, partitions_to_bits}; use crate::steps::scatter; #[derive(Args)] @@ -68,6 +68,11 @@ pub struct IndexArgs { #[arg(long, default_value = None)] pub fp: Option, + /// Block size for exact evidence `.idx` (number of unitigs per block). + /// Must be a power of two; rounded up if not. Default 1 = O(1) random access. + #[arg(long, default_value_t = 1)] + pub block_size: usize, + #[command(flatten)] pub common: CommonArgs, } @@ -179,12 +184,14 @@ pub fn run(args: IndexArgs) { if effective != args.common.partitions { info!("partitions: {} → {} (next power of 2)", args.common.partitions, effective); } + let block_bits = block_size_to_bits(args.block_size); let config = IndexConfig { kmer_size: args.common.kmer_size, minimizer_size: args.common.minimizer_size, n_bits, with_counts: args.with_counts, evidence: evidence.clone(), + block_bits, }; let genome_info = args.label.as_ref().map(|label| { let mut info = GenomeInfo::new(label.clone()); diff --git a/src/obikmer/src/cmd/mod.rs b/src/obikmer/src/cmd/mod.rs index 1645c23..39e9a89 100644 --- a/src/obikmer/src/cmd/mod.rs +++ b/src/obikmer/src/cmd/mod.rs @@ -6,5 +6,6 @@ pub mod index; pub mod merge; pub mod query; pub mod rebuild; +pub mod reindex; pub mod superkmer; pub mod unitig; diff --git a/src/obikmer/src/cmd/reindex.rs b/src/obikmer/src/cmd/reindex.rs new file mode 100644 index 0000000..5d0152b --- /dev/null +++ b/src/obikmer/src/cmd/reindex.rs @@ -0,0 +1,68 @@ +use std::path::PathBuf; + +use clap::Args; +use obikindex::KmerIndex; +use obilayeredmap::EvidenceKind; +use obisys::Reporter; +use tracing::info; + +use crate::cli::block_size_to_bits; +use super::index::resolve_approx_params; + +#[derive(Args)] +pub struct ReindexArgs { + /// Index directory to convert (modified in-place) + pub index: PathBuf, + + /// Convert to approximate evidence (default: convert to exact). + /// Requires --evidence-bits and/or -z and/or --fp. + #[arg(long, default_value_t = false)] + pub approx: bool, + + /// Findere z parameter (≥1). + #[arg(short = 'z', long, default_value = None)] + pub findere_z: Option, + + /// Fingerprint bits per slot (b). + #[arg(long, default_value = None)] + pub evidence_bits: Option, + + /// Target false-positive rate per z-window. + #[arg(long, default_value = None)] + pub fp: Option, + + /// Block size for exact evidence `.idx` (number of unitigs per block). + /// Ignored when converting to approximate evidence. + #[arg(long, default_value_t = 1)] + pub block_size: usize, +} + +pub fn run(args: ReindexArgs) { + let target = if args.approx { + let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp); + info!("target: approximate evidence — b={b}, z={z}, fp={fp:.2e}"); + EvidenceKind::Approx { b, z } + } else { + info!("target: exact evidence"); + EvidenceKind::Exact + }; + + let mut idx = KmerIndex::open(&args.index).unwrap_or_else(|e| { + eprintln!("error opening index: {e}"); + std::process::exit(1); + }); + + info!( + "current evidence: {:?}", + idx.meta().config.evidence, + ); + + let block_bits = block_size_to_bits(args.block_size); + let mut rep = Reporter::new(); + idx.reindex(target, block_bits, &mut rep).unwrap_or_else(|e| { + eprintln!("reindex error: {e}"); + std::process::exit(1); + }); + + rep.print(); +} diff --git a/src/obikmer/src/cmd/unitig.rs b/src/obikmer/src/cmd/unitig.rs index ceb71f8..82fe6bc 100644 --- a/src/obikmer/src/cmd/unitig.rs +++ b/src/obikmer/src/cmd/unitig.rs @@ -35,13 +35,13 @@ pub fn run(args: UnitigArgs) { return; } - let reader = UnitigFileReader::open(&path).unwrap_or_else(|e| { + // open_sequential: works with and without .idx (approx or exact index) + let reader = UnitigFileReader::open_sequential(&path).unwrap_or_else(|e| { eprintln!("error opening unitigs (partition {i}): {e}"); std::process::exit(1) }); - for j in 0..reader.len() { - let unitig = reader.unitig(j); + for (j, unitig) in reader.iter_unitigs() { let mut out = stdout.lock().unwrap(); write_unitig(&unitig, k, i, j, &mut *out).unwrap_or_else(|e| { eprintln!("write error: {e}"); diff --git a/src/obikmer/src/main.rs b/src/obikmer/src/main.rs index 66665f8..e6700e1 100644 --- a/src/obikmer/src/main.rs +++ b/src/obikmer/src/main.rs @@ -34,6 +34,8 @@ enum Commands { Unitig(cmd::unitig::UnitigArgs), /// Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing Estimate(cmd::estimate::EstimateArgs), + /// Convert an index's evidence in-place: exact ↔ approx + Reindex(cmd::reindex::ReindexArgs), } fn main() { @@ -65,6 +67,7 @@ fn main() { Commands::Distance(args) => cmd::distance::run(args), Commands::Unitig(args) => cmd::unitig::run(args), Commands::Estimate(args) => cmd::estimate::run(args), + Commands::Reindex(args) => cmd::reindex::run(args), } #[cfg(feature = "profiling")] diff --git a/src/obikpartitionner/src/index_layer.rs b/src/obikpartitionner/src/index_layer.rs index e51f7cc..4665197 100644 --- a/src/obikpartitionner/src/index_layer.rs +++ b/src/obikpartitionner/src/index_layer.rs @@ -45,6 +45,7 @@ impl KmerPartition { max_ab: Option, with_counts: bool, evidence: &EvidenceKind, + block_bits: u8, ) -> Result { let part_dir = self.part_dir(i); let dedup_path = part_dir.join("dereplicated.skmer.zst"); @@ -109,7 +110,7 @@ impl KmerPartition { uw.close()?; if with_counts { - Layer::::build(&layer_dir, |kmer| { + Layer::::build(&layer_dir, block_bits, |kmer| { match (&mphf1_opt, &counts1_opt) { (Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())), _ => 1, @@ -117,7 +118,7 @@ impl KmerPartition { }) .map_err(olm_to_sk)?; } else { - Layer::<()>::build(&layer_dir).map_err(olm_to_sk)?; + Layer::<()>::build(&layer_dir, block_bits).map_err(olm_to_sk)?; } // For approximate evidence: replace the exact evidence bundle with a diff --git a/src/obikpartitionner/src/merge_layer.rs b/src/obikpartitionner/src/merge_layer.rs index 21785d0..c07de97 100644 --- a/src/obikpartitionner/src/merge_layer.rs +++ b/src/obikpartitionner/src/merge_layer.rs @@ -161,6 +161,7 @@ impl KmerPartition { sources: &[(&KmerPartition, usize)], mode: MergeMode, n_dst_genomes: usize, + block_bits: u8, ) -> SKResult<()> { let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR); if !dst_index_dir.exists() { @@ -216,7 +217,7 @@ impl KmerPartition { uw.write(&unitig)?; } uw.close()?; - Layer::<()>::build(&new_layer_dir).map_err(olm_to_sk)?; + Layer::<()>::build(&new_layer_dir, block_bits).map_err(olm_to_sk)?; } drop(g); diff --git a/src/obikpartitionner/src/rebuild_layer.rs b/src/obikpartitionner/src/rebuild_layer.rs index fa581f9..6ba32e4 100644 --- a/src/obikpartitionner/src/rebuild_layer.rs +++ b/src/obikpartitionner/src/rebuild_layer.rs @@ -96,6 +96,7 @@ impl KmerPartition { filters: &[Box], mode: MergeMode, n_genomes: usize, + block_bits: u8, ) -> SKResult<()> { let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR); if !src_index_dir.exists() { @@ -145,7 +146,7 @@ impl KmerPartition { uw.close()?; drop(g); - Layer::<()>::build(&dst_layer_dir).map_err(olm_to_sk)?; + Layer::<()>::build(&dst_layer_dir, block_bits).map_err(olm_to_sk)?; let dst_mphf = MphfLayer::open(&dst_layer_dir).map_err(olm_to_sk)?; // ── Prepare matrix builders (one column per genome) ─────────────────── diff --git a/src/obilayeredmap/src/layer.rs b/src/obilayeredmap/src/layer.rs index e5f531b..6a4f06f 100644 --- a/src/obilayeredmap/src/layer.rs +++ b/src/obilayeredmap/src/layer.rs @@ -80,39 +80,33 @@ impl Layer { /// Build `unitigs.bin.idx` and `evidence.bin` from `unitigs.bin` and /// `mphf.bin` already present in `layer_dir`. - /// - /// See [`MphfLayer::build_exact_evidence`] for the full contract. - pub fn build_exact_evidence(layer_dir: &Path) -> OLMResult { - MphfLayer::build_exact_evidence(layer_dir) + /// `block_bits` controls the `.idx` block size (2^block_bits chunks/block). + pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult { + MphfLayer::build_exact_evidence(layer_dir, block_bits) } /// Build `fingerprint.bin` from `unitigs.bin` and `mphf.bin` already /// present in `layer_dir`. `b` — fingerprint bits (1..=64); `z` — Findere /// consecutive k-mer parameter (≥1). - /// - /// See [`MphfLayer::build_approx_evidence`] for the full contract. pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult { MphfLayer::build_approx_evidence(layer_dir, b, z) } - /// Dispatch to `build_exact_evidence` or `build_approx_evidence` based on - /// `kind`. - pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind) -> OLMResult { - MphfLayer::build_evidence(layer_dir, kind) + /// Dispatch to `build_exact_evidence` or `build_approx_evidence`. + /// `block_bits` is forwarded to exact evidence only. + pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult { + MphfLayer::build_evidence(layer_dir, kind, block_bits) } } // ── Mode 1 — set membership ─────────────────────────────────────────────────── impl Layer<()> { - pub fn build(out_dir: &Path) -> OLMResult { - MphfLayer::build(out_dir, &mut |_, _| Ok(())) + pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult { + MphfLayer::build(out_dir, block_bits, &mut |_, _| Ok(())) } /// Create a presence matrix for a set-membership layer (first merge). - /// - /// All `n_kmers` slots are set to `true`: every kmer in this layer belongs - /// to genome_0, so genome_0 is present at every slot. pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> { let presence_dir = layer_dir.join(PRESENCE_DIR); fs::create_dir_all(&presence_dir).map_err(OLMError::Io)?; @@ -126,16 +120,20 @@ impl Layer<()> { } } -// ── Mode 2 — count matrix (1 column per layer) ──────────────────────────────── +// ── Mode 2 — count matrix ───────────────────────────────────────────────────── impl Layer { - pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult { + pub fn build( + out_dir: &Path, + block_bits: u8, + count_of: impl Fn(CanonicalKmer) -> u32, + ) -> OLMResult { let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers(); let counts_dir = out_dir.join(COUNTS_DIR); let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir) .map_err(OLMError::Io)?; let mut col = mb.add_col().map_err(OLMError::Io)?; - let n_built = MphfLayer::build(out_dir, &mut |slot, kmer| { + let n_built = MphfLayer::build(out_dir, block_bits, &mut |slot, kmer| { col.set(slot, count_of(kmer)); Ok(()) })?; @@ -146,16 +144,16 @@ impl Layer { pub fn build_from_map( out_dir: &Path, + block_bits: u8, counts: &HashMap, ) -> OLMResult { - Self::build(out_dir, |kmer| counts.get(&kmer).copied().unwrap_or(0)) + Self::build(out_dir, block_bits, |kmer| counts.get(&kmer).copied().unwrap_or(0)) } } // ── Mode 2 — count matrix column append ────────────────────────────────────── impl Layer { - /// Append a genome column to an existing count matrix. pub fn append_genome_column( layer_dir: &Path, value_of: impl Fn(usize) -> u32, @@ -165,10 +163,9 @@ impl Layer { } } -// ── Mode 3 — presence/absence matrix (1 column per genome) ─────────────────── +// ── Mode 3 — presence/absence matrix ───────────────────────────────────────── impl Layer { - /// Append a genome column to an existing presence matrix. pub fn append_genome_column( layer_dir: &Path, value_of: impl Fn(usize) -> bool, @@ -179,6 +176,7 @@ impl Layer { pub fn build_presence( out_dir: &Path, + block_bits: u8, n_genomes: usize, present_in: impl Fn(CanonicalKmer, usize) -> bool, ) -> OLMResult { @@ -188,7 +186,7 @@ impl Layer { let mut cols: Vec<_> = (0..n_genomes) .map(|_| mb.add_col().map_err(OLMError::Io)) .collect::>()?; - let n_built = MphfLayer::build(out_dir, &mut |slot, kmer| { + let n_built = MphfLayer::build(out_dir, block_bits, &mut |slot, kmer| { for (g, col) in cols.iter_mut().enumerate() { col.set(slot, present_in(kmer, g)); } diff --git a/src/obilayeredmap/src/map.rs b/src/obilayeredmap/src/map.rs index e9cb591..da01aee 100644 --- a/src/obilayeredmap/src/map.rs +++ b/src/obilayeredmap/src/map.rs @@ -4,7 +4,7 @@ use std::path::{Path, PathBuf}; use obicompactvec::PersistentCompactIntMatrix; use obikseq::CanonicalKmer; -use obiskio::UnitigFileWriter; +use obiskio::{UnitigFileWriter, DEFAULT_BLOCK_BITS}; use crate::error::OLMResult; use crate::layer::{Hit, Layer, LayerData}; @@ -90,7 +90,7 @@ impl LayeredMap<()> { pub fn push_layer(&mut self) -> OLMResult { let i = self.layers.len(); let dir = layer_dir(&self.root, i); - Layer::<()>::build(&dir)?; + Layer::<()>::build(&dir, DEFAULT_BLOCK_BITS)?; self.append_layer()?; Ok(i) } @@ -102,7 +102,7 @@ impl LayeredMap { pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult { let i = self.layers.len(); let dir = layer_dir(&self.root, i); - Layer::::build(&dir, count_of)?; + Layer::::build(&dir, DEFAULT_BLOCK_BITS, count_of)?; self.append_layer()?; Ok(i) } diff --git a/src/obilayeredmap/src/mphf_layer.rs b/src/obilayeredmap/src/mphf_layer.rs index d6b9510..3bac003 100644 --- a/src/obilayeredmap/src/mphf_layer.rs +++ b/src/obilayeredmap/src/mphf_layer.rs @@ -4,7 +4,7 @@ use std::path::Path; use cacheline_ef::{CachelineEf, CachelineEfVec}; use epserde::prelude::*; use obikseq::CanonicalKmer; -use obiskio::{UnitigFileReader, UnitigFileWriter, build_unitig_idx, DEFAULT_BLOCK_BITS}; +use obiskio::{UnitigFileReader, UnitigFileWriter, build_unitig_idx}; use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64}; use crate::error::{OLMError, OLMResult}; @@ -110,25 +110,26 @@ impl MphfLayer { } /// Dispatch to `build_exact_evidence` or `build_approx_evidence` based on - /// `kind`. - pub fn build_evidence(dir: &Path, kind: &EvidenceKind) -> OLMResult { + /// `kind`. `block_bits` is forwarded to exact evidence only. + pub fn build_evidence(dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult { match kind { - EvidenceKind::Exact => Self::build_exact_evidence(dir), - EvidenceKind::Approx { b, z } => Self::build_approx_evidence(dir, *b, *z), + EvidenceKind::Exact => Self::build_exact_evidence(dir, block_bits), + EvidenceKind::Approx { b, z } => Self::build_approx_evidence(dir, *b, *z), } } /// Build `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`. /// + /// `block_bits` controls the `.idx` block size (2^block_bits chunks per block). /// Uses sequential iteration — no `.idx` required on entry. - pub fn build_exact_evidence(dir: &Path) -> OLMResult { + pub fn build_exact_evidence(dir: &Path, block_bits: u8) -> OLMResult { let unitig_path = dir.join(UNITIGS_FILE); let unitigs = UnitigFileReader::open_sequential(&unitig_path)?; let n = unitigs.n_kmers(); if n == 0 { fs::File::create(dir.join(EVIDENCE_FILE))?; - build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + build_unitig_idx(&unitig_path, block_bits)?; LayerMeta::exact().save(dir)?; return Ok(0); } @@ -154,7 +155,7 @@ impl MphfLayer { } ev.write(&dir.join(EVIDENCE_FILE))?; - build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + build_unitig_idx(&unitig_path, block_bits)?; LayerMeta::exact().save(dir)?; Ok(n) } @@ -202,13 +203,14 @@ impl MphfLayer { /// population. Returns the number of kmers indexed. pub(crate) fn build( dir: &Path, + block_bits: u8, fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>, ) -> OLMResult { use rayon::prelude::*; let unitig_path = dir.join(UNITIGS_FILE); - build_unitig_idx(&unitig_path, DEFAULT_BLOCK_BITS)?; + build_unitig_idx(&unitig_path, block_bits)?; let unitigs = UnitigFileReader::open(&unitig_path)?; let n = unitigs.n_kmers(); diff --git a/src/obilayeredmap/src/tests/layer.rs b/src/obilayeredmap/src/tests/layer.rs index 47f6e82..2a9c3da 100644 --- a/src/obilayeredmap/src/tests/layer.rs +++ b/src/obilayeredmap/src/tests/layer.rs @@ -1,6 +1,7 @@ use super::*; use obicompactvec::PersistentCompactIntMatrix; use obikseq::{set_k, Kmer, Sequence as _, Unitig}; +use obiskio::DEFAULT_BLOCK_BITS; use tempfile::tempdir; fn write_unitigs(dir: &Path, seqs: &[&[u8]]) { @@ -23,7 +24,7 @@ fn build_and_query_all_kmers_found() { let dir = tempdir().unwrap(); write_unitigs(dir.path(), &[b"AAAACGT"]); let kmers = all_canonical_kmers(dir.path()); - Layer::<()>::build(dir.path()).unwrap(); + Layer::<()>::build(dir.path(), DEFAULT_BLOCK_BITS).unwrap(); let layer = Layer::<()>::open(dir.path()).unwrap(); for kmer in kmers { assert!(layer.query(kmer).is_some(), "kmer should be present"); @@ -40,6 +41,7 @@ fn counts_are_stored_and_retrieved() { kmers.iter().enumerate().map(|(i, &k)| (k, i as u32 + 1)).collect(); Layer::::build( dir.path(), + DEFAULT_BLOCK_BITS, |kmer| count_map.get(&kmer).copied().unwrap_or(0), ).unwrap(); let layer = Layer::::open(dir.path()).unwrap(); @@ -54,7 +56,7 @@ fn query_absent_returns_none() { set_k(4); let dir = tempdir().unwrap(); write_unitigs(dir.path(), &[b"AAAACGT"]); - Layer::<()>::build(dir.path()).unwrap(); + Layer::<()>::build(dir.path(), DEFAULT_BLOCK_BITS).unwrap(); let layer = Layer::<()>::open(dir.path()).unwrap(); let absent = Kmer::from_ascii(b"CCCC").unwrap().canonical(); assert!(layer.query(absent).is_none()); @@ -65,7 +67,7 @@ fn open_after_build_is_consistent() { set_k(4); let dir = tempdir().unwrap(); write_unitigs(dir.path(), &[b"AAAACGT"]); - let n = Layer::::build(dir.path(), |_| 7).unwrap(); + let n = Layer::::build(dir.path(), DEFAULT_BLOCK_BITS, |_| 7).unwrap(); assert_eq!(n, 4); let layer = Layer::::open(dir.path()).unwrap(); let kmer = Kmer::from_ascii(b"AAAA").unwrap().canonical(); diff --git a/src/obiskio/src/unitig_index.rs b/src/obiskio/src/unitig_index.rs index 75535ae..eb3fa40 100644 --- a/src/obiskio/src/unitig_index.rs +++ b/src/obiskio/src/unitig_index.rs @@ -21,7 +21,7 @@ use crate::error::{SKError, SKResult}; const MAGIC: [u8; 4] = *b"UIX3"; /// Default block granularity used by [`UnitigFileWriter::create`]. -pub const DEFAULT_BLOCK_BITS: u8 = 6; +pub const DEFAULT_BLOCK_BITS: u8 = 0; fn idx_path(path: &Path) -> PathBuf { crate::append_path_suffix(path, ".idx") @@ -325,6 +325,11 @@ impl UnitigFileReader { }) } + /// Iterate all unitigs sequentially. Works without `.idx` (sequential open). + pub fn iter_unitigs(&self) -> impl Iterator + '_ { + self.iter_chunks_sequential() + } + pub fn iter_kmers(&self) -> impl Iterator + '_ { self.iter_chunks_sequential() .flat_map(|(_, u)| u.into_kmers()) -- 2.52.0 From b2a52bfb3763fc3d82f2d8c707b89e14ee4ecabb Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 13:06:07 +0200 Subject: [PATCH 08/10] perf: optimize chunk_start for single-block indexing Bypasses bitwise shift and mask operations when `block_bits == 0`, directly indexing `self.block_offsets[i]` instead. This eliminates unnecessary arithmetic overhead for single-block cases while preserving the original block-based offset calculation for larger block sizes. --- src/obiskio/src/unitig_index.rs | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/obiskio/src/unitig_index.rs b/src/obiskio/src/unitig_index.rs index eb3fa40..53bdaa5 100644 --- a/src/obiskio/src/unitig_index.rs +++ b/src/obiskio/src/unitig_index.rs @@ -271,6 +271,9 @@ impl UnitigFileReader { fn chunk_start(&self, i: usize) -> usize { assert!(!self.block_offsets.is_empty(), "random access requires UnitigFileReader::open(); use open_sequential() for iteration only"); + if self.block_bits == 0 { + return self.block_offsets[i] as usize; + } let block = i >> self.block_bits; let rem = i & self.mask; let mut offset = self.block_offsets[block] as usize; -- 2.52.0 From b7db3a33ed9bbfeec9e392a4078fdb56b3c9cc73 Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 13:19:31 +0200 Subject: [PATCH 09/10] docs: add coverage reference files and flag architectural drift These files catalog test coverage for Rust modules across architecture, implementation, and theory sections. They track recent structural changes, flag areas prone to documentation drift, and mandate verification of key parameters and routing logic to maintain alignment with the active codebase. --- docmd/architecture/index_architecture.refs.md | 22 +++++++++++++++++++ docmd/architecture/query.refs.md | 16 ++++++++++++++ .../architecture/sequences/invariant.refs.md | 12 ++++++++++ docmd/implementation/chunkreader.refs.md | 12 ++++++++++ .../evidence_elimination.refs.md | 22 +++++++++++++++++++ docmd/implementation/kmer.refs.md | 13 +++++++++++ docmd/implementation/merge.refs.md | 19 ++++++++++++++++ docmd/implementation/mphf.refs.md | 16 ++++++++++++++ docmd/implementation/obilayeredmap.refs.md | 22 +++++++++++++++++++ docmd/implementation/obipipeline.refs.md | 13 +++++++++++ .../implementation/persistent_bit_vec.refs.md | 13 +++++++++++ .../persistent_compact_int_vec.refs.md | 14 ++++++++++++ docmd/implementation/pipeline.refs.md | 19 ++++++++++++++++ docmd/implementation/storage.refs.md | 18 +++++++++++++++ docmd/implementation/superkmer.refs.md | 13 +++++++++++ docmd/implementation/unitig_evidence.refs.md | 18 +++++++++++++++ docmd/index.refs.md | 13 +++++++++++ docmd/kmers.refs.md | 13 +++++++++++ docmd/theory/encoding.refs.md | 11 ++++++++++ docmd/theory/entropy.refs.md | 12 ++++++++++ docmd/theory/indexing.refs.md | 12 ++++++++++ docmd/theory/minimizer.refs.md | 12 ++++++++++ 22 files changed, 335 insertions(+) create mode 100644 docmd/architecture/index_architecture.refs.md create mode 100644 docmd/architecture/query.refs.md create mode 100644 docmd/architecture/sequences/invariant.refs.md create mode 100644 docmd/implementation/chunkreader.refs.md create mode 100644 docmd/implementation/evidence_elimination.refs.md create mode 100644 docmd/implementation/kmer.refs.md create mode 100644 docmd/implementation/merge.refs.md create mode 100644 docmd/implementation/mphf.refs.md create mode 100644 docmd/implementation/obilayeredmap.refs.md create mode 100644 docmd/implementation/obipipeline.refs.md create mode 100644 docmd/implementation/persistent_bit_vec.refs.md create mode 100644 docmd/implementation/persistent_compact_int_vec.refs.md create mode 100644 docmd/implementation/pipeline.refs.md create mode 100644 docmd/implementation/storage.refs.md create mode 100644 docmd/implementation/superkmer.refs.md create mode 100644 docmd/implementation/unitig_evidence.refs.md create mode 100644 docmd/index.refs.md create mode 100644 docmd/kmers.refs.md create mode 100644 docmd/theory/encoding.refs.md create mode 100644 docmd/theory/entropy.refs.md create mode 100644 docmd/theory/indexing.refs.md create mode 100644 docmd/theory/minimizer.refs.md diff --git a/docmd/architecture/index_architecture.refs.md b/docmd/architecture/index_architecture.refs.md new file mode 100644 index 0000000..ff9e3c0 --- /dev/null +++ b/docmd/architecture/index_architecture.refs.md @@ -0,0 +1,22 @@ + +# Coverage: architecture/index_architecture.md + +## Code couvert + +- `obilayeredmap/src/layer.rs` — Layer, trait LayerData, modes () / PersistentCompactIntMatrix / PersistentBitMatrix +- `obilayeredmap/src/mphf_layer.rs` — MphfLayer, EvidenceKind (Exact / Approx), LayerEvidence enum +- `obilayeredmap/src/map.rs` — LayeredMap +- `obilayeredmap/src/meta.rs` — LayerMeta, PartitionMeta +- `obikindex/src/meta.rs` — IndexConfig (kmer_size, n_bits, with_counts, evidence, block_bits), IndexMeta +- `obikindex/src/index.rs` — KmerIndex, build_layers +- `obicompactvec/src/` — PersistentCompactIntMatrix, PersistentBitMatrix (DataStore implementations) + +## Notes + +FORT RISQUE DE DÉRIVE. Nombreux changements récents : +- Ajout de `EvidenceKind` (Exact / Approx { b, z }) dans `IndexConfig` et `LayerMeta` +- Ajout de `block_bits` dans `IndexConfig` +- `LayerEvidence` enum dans `mphf_layer.rs` remplace l'ancienne approche monolithique +- Distinction `open()` vs `open_sequential()` dans `UnitigFileReader` +- Commandes `reindex` et `estimate` ajoutées +Vérifier que la hiérarchie à 3 niveaux décrite est toujours exacte et que les nouveaux paramètres sont documentés. diff --git a/docmd/architecture/query.refs.md b/docmd/architecture/query.refs.md new file mode 100644 index 0000000..2dcbc50 --- /dev/null +++ b/docmd/architecture/query.refs.md @@ -0,0 +1,16 @@ + +# Coverage: architecture/query.md + +## Code couvert + +- `obikmer/src/cmd/query.rs` — commande query, format de sortie +- `obikpartitionner/src/query_layer.rs` — routage de la requête à travers les partitions +- `obiread/src/lib.rs` — lecture des séquences d'entrée pour la requête + +## Notes + +RISQUE DE DÉRIVE. Vérifier : +- La commande `unitig` a été modifiée pour utiliser `open_sequential()` — vérifier si query est concerné +- `find_exact` / `find_approx` / `find` générique ont été ajoutés dans `MphfLayer` — le chemin de requête a changé +- Si l'index est approximatif (Approx), la requête peut produire des faux positifs : la doc le mentionne-t-elle ? +- Format de sortie CSV (`obikindex/src/csv.rs` ou équivalent) à vérifier diff --git a/docmd/architecture/sequences/invariant.refs.md b/docmd/architecture/sequences/invariant.refs.md new file mode 100644 index 0000000..52130f3 --- /dev/null +++ b/docmd/architecture/sequences/invariant.refs.md @@ -0,0 +1,12 @@ + +# Coverage: architecture/sequences/invariant.md + +## Code couvert + +- `obikseq/src/sequence.rs` — invariants de représentation des séquences (ACGT, longueur max) +- `obikseq/src/unitig.rs` — type Unitig, contrainte MAX_KMERS_PER_CHUNK (255 kmers par chunk) + +## Notes + +Document court et stable. Vérifier que la limite de 256 nucléotides (ou 255 kmers) par chunk +est toujours la même dans `obiskio::MAX_KMERS_PER_CHUNK`. diff --git a/docmd/implementation/chunkreader.refs.md b/docmd/implementation/chunkreader.refs.md new file mode 100644 index 0000000..99be9cf --- /dev/null +++ b/docmd/implementation/chunkreader.refs.md @@ -0,0 +1,12 @@ + +# Coverage: implementation/chunkreader.md + +## Code couvert + +- `obiread/src/chunk.rs` — SeqChunkIter, détection de frontières FASTA/FASTQ, state machines +- `obikrope/src/lib.rs` — type Rope (Vec), opérations zero-copy + +## Notes + +Document stable (la stratégie de chunking rope ne devrait pas avoir changé). +Vérifier que le split FASTA/FASTQ reste correct si de nouveaux formats ont été ajoutés. diff --git a/docmd/implementation/evidence_elimination.refs.md b/docmd/implementation/evidence_elimination.refs.md new file mode 100644 index 0000000..0fedcb3 --- /dev/null +++ b/docmd/implementation/evidence_elimination.refs.md @@ -0,0 +1,22 @@ + +# Coverage: implementation/evidence_elimination.md + +## Code couvert + +- `obilayeredmap/src/fingerprint.rs` — FingerprintVec, FingerprintVecWriter, stockage b bits/slot, matches() +- `obilayeredmap/src/mphf_layer.rs` — build_approx_evidence(dir, b, z), find_approx() +- `obilayeredmap/src/meta.rs` — EvidenceKind::Approx { b, z }, LayerMeta +- `obikindex/src/reindex.rs` — KmerIndex::reindex(), conversion exact↔approx en place +- `obikmer/src/cmd/reindex.rs` — CLI reindex, options --approx, -z, --evidence-bits, --fp, --block-size +- `obikmer/src/cmd/index.rs` — resolve_approx_params(), options --approx, -z, --evidence-bits, --fp +- `obikmer/src/cmd/estimate.rs` — commande estimate (dry-run des paramètres) + +## Notes + +Ce document était à l'origine une discussion de design (4 approches). L'implémentation +a maintenant convergé vers l'approche fingerprint (Findere-style). +FORT RISQUE DE DÉRIVE — le contenu est probablement un mélange de design et d'implémentation : +- Le modèle FP = 1/2^(b·z) et les règles de résolution (2-of-3 parmi b, z, fp) sont implémentés +- La commande `reindex` permet la conversion a posteriori exact↔approx +- La commande `estimate` fait le dry-run des paramètres +Cette page doit être réécrite pour documenter l'implémentation Findere réelle plutôt que les alternatives abandonnées. diff --git a/docmd/implementation/kmer.refs.md b/docmd/implementation/kmer.refs.md new file mode 100644 index 0000000..747d5e3 --- /dev/null +++ b/docmd/implementation/kmer.refs.md @@ -0,0 +1,13 @@ + +# Coverage: implementation/kmer.md + +## Code couvert + +- `obikseq/src/kmer.rs` — layout mémoire (repr(transparent) u64), encodage/décodage, revcomp, forme canonique +- `obikseq/src/params.rs` — k global (set_k / k()) + +## Notes + +Document d'implémentation stable. L'algorithme de revcomp bit-à-bit est décrit — +vérifier qu'il correspond à `revcomp_raw` dans `obiskio/src/unitig_index.rs` (copie locale) +et à l'implémentation dans `obikseq/src/kmer.rs`. diff --git a/docmd/implementation/merge.refs.md b/docmd/implementation/merge.refs.md new file mode 100644 index 0000000..a9ffa07 --- /dev/null +++ b/docmd/implementation/merge.refs.md @@ -0,0 +1,19 @@ + +# Coverage: implementation/merge.md + +## Code couvert + +- `obikindex/src/merge.rs` — `KmerIndex::merge()`, validation de compatibilité d'évidence, `validate_evidence_compat()` +- `obikpartitionner/src/merge_layer.rs` — `merge_partition()`, construction de la nouvelle layer, paramètre `block_bits` +- `obikpartitionner/src/rebuild_layer.rs` — `rebuild_partition()`, paramètre `block_bits` +- `obilayeredmap/src/layer.rs` — `Layer::append_genome_column()` (PersistentCompactIntMatrix et PersistentBitMatrix) +- `obicompactvec/src/intmatrix.rs` — `append_column` pour PersistentCompactIntMatrix +- `obicompactvec/src/bitmatrix.rs` — `append_column` pour PersistentBitMatrix + +## Notes + +FORT RISQUE DE DÉRIVE. Changements récents : +- Ajout de la validation de compatibilité d'évidence : merge exact+approx → erreur (OKIError::IncompatibleEvidence) +- `merge_partition` reçoit maintenant `block_bits: u8` +- La commande `reindex` a été ajoutée comme outil de conversion exact↔approx avant merge +Vérifier que la doc décrit la politique de merge mixed-evidence et le recours à `reindex`. diff --git a/docmd/implementation/mphf.refs.md b/docmd/implementation/mphf.refs.md new file mode 100644 index 0000000..fd56c80 --- /dev/null +++ b/docmd/implementation/mphf.refs.md @@ -0,0 +1,16 @@ + +# Coverage: implementation/mphf.md + +## Code couvert + +- `obilayeredmap/src/mphf_layer.rs` — type Mphf (PtrHash + CubicEps + CachelineEfVec + Xx64), construction en 2 passes, `build()`, `build_exact_evidence()`, `build_approx_evidence()`, `build_evidence()` +- `obikpartitionner/src/index_layer.rs` — `build_index_layer()` avec passage de `block_bits` + +## Notes + +FORT RISQUE DE DÉRIVE. Changements récents : +- `build_exact_evidence(dir, block_bits)` — `block_bits` maintenant paramétrisé (défaut 0) +- `build_approx_evidence(dir, b, z)` — nouvelle fonction pour l'évidence fingerprint +- `build_evidence(dir, kind, block_bits)` — dispatch selon EvidenceKind +- Construction en 2 phases : pass 1 (Rayon parallèle) + pass 2 (callback `fill_slot`) +Vérifier que la doc décrit correctement les deux nouvelles routes d'évidence et le paramètre `block_bits`. diff --git a/docmd/implementation/obilayeredmap.refs.md b/docmd/implementation/obilayeredmap.refs.md new file mode 100644 index 0000000..5340ab3 --- /dev/null +++ b/docmd/implementation/obilayeredmap.refs.md @@ -0,0 +1,22 @@ + +# Coverage: implementation/obilayeredmap.md + +## Code couvert + +- `obilayeredmap/src/mphf_layer.rs` — MphfLayer, LayerEvidence enum (Exact/Approx), find(), find_exact(), find_approx() +- `obilayeredmap/src/layer.rs` — Layer, trait LayerData, modes () / PersistentCompactIntMatrix / PersistentBitMatrix, build(), build_evidence(), append_genome_column() +- `obilayeredmap/src/map.rs` — LayeredMap, push_layer(), query() +- `obilayeredmap/src/evidence.rs` — Evidence, EvidenceWriter, encodage chunk_id:rank +- `obilayeredmap/src/fingerprint.rs` — FingerprintVec, FingerprintVecWriter, matches() +- `obilayeredmap/src/meta.rs` — LayerMeta, EvidenceKind (Exact / Approx { b, z }) + +## Notes + +FORT RISQUE DE DÉRIVE. C'est le fichier le plus affecté par les changements récents : +- EvidenceKind (Exact / Approx) est désormais un concept de premier plan — toute la sémantique de query en dépend +- `LayerEvidence` enum interne à `MphfLayer` : dispatch transparent find() → find_exact() ou find_approx() +- `fingerprint.rs` : module entièrement nouveau (FingerprintVec + FingerprintVecWriter) +- `build_evidence()` / `build_exact_evidence()` / `build_approx_evidence()` sont nouveaux +- `block_bits` dans les fonctions build : O(1) garanti avec le chemin chaud explicit pour block_bits=0 +- Séparation open() (accès aléatoire, requiert .idx) vs open_sequential() (itération seule) +Pratiquement toute cette page est à réécrire. diff --git a/docmd/implementation/obipipeline.refs.md b/docmd/implementation/obipipeline.refs.md new file mode 100644 index 0000000..82fb2e8 --- /dev/null +++ b/docmd/implementation/obipipeline.refs.md @@ -0,0 +1,13 @@ + +# Coverage: implementation/obipipeline.md + +## Code couvert + +- `obipipeline/src/lib.rs` — WorkerPool, Pipeline, macro make_pipeline! +- `obipipeline/src/scheduler.rs` — Scheduler avec Select biaisé sur les entrées de canaux + +## Notes + +Document stable (librairie générique, peu de risque de dérive). +Vérifier si `obipipeline` est toujours utilisé dans la phase scatter de `obikpartitionner` +ou s'il a été remplacé par Rayon dans certains chemins. diff --git a/docmd/implementation/persistent_bit_vec.refs.md b/docmd/implementation/persistent_bit_vec.refs.md new file mode 100644 index 0000000..67fed55 --- /dev/null +++ b/docmd/implementation/persistent_bit_vec.refs.md @@ -0,0 +1,13 @@ + +# Coverage: implementation/persistent_bit_vec.md + +## Code couvert + +- `obicompactvec/src/bitvec.rs` — PersistentBitVec, opérations mot u64, invariant de padding +- `obicompactvec/src/bitmatrix.rs` — PersistentBitMatrix, wrapper colonne-major, append_column +- `obicompactvec/src/bitmatrix.rs` — PersistentBitMatrixBuilder + +## Notes + +Document d'implémentation stable. Vérifier que `PersistentBitMatrixBuilder` et `append_column` +sont couverts (utilisés dans `Layer::::build_presence` et `append_genome_column`). diff --git a/docmd/implementation/persistent_compact_int_vec.refs.md b/docmd/implementation/persistent_compact_int_vec.refs.md new file mode 100644 index 0000000..c5da1b0 --- /dev/null +++ b/docmd/implementation/persistent_compact_int_vec.refs.md @@ -0,0 +1,14 @@ + +# Coverage: implementation/persistent_compact_int_vec.md + +## Code couvert + +- `obicompactvec/src/builder.rs` — PersistentCompactIntVecBuilder, cycle de vie +- `obicompactvec/src/reader.rs` — PersistentCompactIntVec, accès aléatoire et séquentiel +- `obicompactvec/src/intmatrix.rs` — PersistentCompactIntMatrix, wrapper colonne-major, append_column +- `obicompactvec/src/format.rs` — format de fichier (magic PCIV, header, primary u8, overflow, index) + +## Notes + +Document d'implémentation stable. Vérifier que `append_column` (utilisé dans merge et reindex) +est décrit. Vérifier que `PersistentCompactIntMatrixBuilder` est couvert (utilisé dans `layer.rs`). diff --git a/docmd/implementation/pipeline.refs.md b/docmd/implementation/pipeline.refs.md new file mode 100644 index 0000000..a5656a0 --- /dev/null +++ b/docmd/implementation/pipeline.refs.md @@ -0,0 +1,19 @@ + +# Coverage: implementation/pipeline.md + +## Code couvert + +- `obikpartitionner/src/partition.rs` — estimation des paramètres (phase 0) +- `obiskbuilder/src/iter.rs` — scatter : filtre entropie, extraction superkmers, routage partition (phase 1) +- `obikpartitionner/src/filter.rs` — déduplication bucket-sort (phase 2) +- `obikpartitionner/src/kmer_sort.rs` — tri externe + agrégation de comptages (phase 3) +- `obidebruinj/src/debruijn.rs` — graphe De Bruijn, extraction des unitigs (phase 5) +- `obikpartitionner/src/index_layer.rs` — construction MPHF + évidence (phase 6), paramètre `block_bits` +- `obikindex/src/index.rs` — `build_layers()`, `dereplicate_and_count()` + +## Notes + +RISQUE DE DÉRIVE modéré. Vérifier : +- Phase 6 : la doc mentionne-t-elle le filtre d'abondance (`min_ab`, `max_ab`) ? +- Phase 6 : `block_bits` passé à `build_index_layer` depuis `IndexConfig` +- Phase 6 : dispatch exact/approx selon `EvidenceKind` dans `build_index_layer` diff --git a/docmd/implementation/storage.refs.md b/docmd/implementation/storage.refs.md new file mode 100644 index 0000000..6987401 --- /dev/null +++ b/docmd/implementation/storage.refs.md @@ -0,0 +1,18 @@ + +# Coverage: implementation/storage.md + +## Code couvert + +- `obikindex/src/meta.rs` — IndexMeta, IndexConfig (version, config, genomes) +- `obikindex/src/index.rs` — layout sur disque : partitions/, index.meta +- `obilayeredmap/src/meta.rs` — LayerMeta (evidence kind), PartitionMeta (n_layers) +- `obiskio/src/unitig_index.rs` — fichiers unitigs.bin + unitigs.bin.idx + +## Notes + +FORT RISQUE DE DÉRIVE. Nombreux champs ajoutés : +- `IndexConfig` : champs `evidence` (EvidenceKind) et `block_bits` ajoutés +- Nouveau fichier `fingerprint.bin` pour l'évidence approximative +- `LayerMeta` / `layer_meta.json` introduit pour stocker EvidenceKind par layer +- Structure du répertoire layer : `evidence.bin` vs `fingerprint.bin` selon le mode +Mettre à jour le schéma de layout sur disque en conséquence. diff --git a/docmd/implementation/superkmer.refs.md b/docmd/implementation/superkmer.refs.md new file mode 100644 index 0000000..8aedb67 --- /dev/null +++ b/docmd/implementation/superkmer.refs.md @@ -0,0 +1,13 @@ + +# Coverage: implementation/superkmer.md + +## Code couvert + +- `obikseq/src/superkmer.rs` — layout mémoire SuperKmer (header 32 bits + séquence byte-alignée), encodage ASCII, revcomp, deque minimiseur +- `obiskbuilder/src/lib.rs` — fenêtre glissante monotone pour le maintien du minimiseur + +## Notes + +Document d'implémentation détaillé. Vérifier que le layout header (longueur, orientation, +position minimiseur) n'a pas changé. La doc mentionne un revcomp SIMD — vérifier si c'est +toujours le cas ou si l'implémentation est scalaire. diff --git a/docmd/implementation/unitig_evidence.refs.md b/docmd/implementation/unitig_evidence.refs.md new file mode 100644 index 0000000..d25b200 --- /dev/null +++ b/docmd/implementation/unitig_evidence.refs.md @@ -0,0 +1,18 @@ + +# Coverage: implementation/unitig_evidence.md + +## Code couvert + +- `obiskio/src/unitig_index.rs` — format unitigs.bin + unitigs.bin.idx, UnitigFileWriter, UnitigFileReader, build_unitig_idx(), DEFAULT_BLOCK_BITS=0, chemin chaud block_bits=0 dans chunk_start() +- `obilayeredmap/src/evidence.rs` — encodage Evidence (chunk_id 25 bits | rank 7 bits), EvidenceWriter +- `obidebruinj/src/debruijn.rs` — extraction unitigs, chunking à MAX_KMERS_PER_CHUNK + +## Notes + +FORT RISQUE DE DÉRIVE. Changements récents : +- `DEFAULT_BLOCK_BITS` est passé de 6 à 0 (accès O(1) par défaut) +- `block_bits` est maintenant un paramètre runtime de `build_unitig_idx()` et `UnitigFileWriter` +- `chunk_start()` a un chemin chaud explicite pour block_bits=0 (accès tableau direct, 0 scan) +- `open()` vs `open_sequential()` : distinction nouvelle, importante pour la compréhension du coût +- `iter_unitigs()` ajouté comme alias public de `iter_chunks_sequential()` +Mettre à jour la description du format .idx et le modèle de coût d'accès aléatoire. diff --git a/docmd/index.refs.md b/docmd/index.refs.md new file mode 100644 index 0000000..60e0163 --- /dev/null +++ b/docmd/index.refs.md @@ -0,0 +1,13 @@ + +# Coverage: index.md + +## Code couvert + +- `obikmer/src/main.rs` — point d'entrée CLI, contraintes globales +- `obikmer/src/cli.rs` — structure des arguments communs + +## Notes + +Document de niveau projet (vue d'ensemble, motivations, contraintes fondamentales). +Pas de code Rust spécifique à vérifier au-delà des contraintes générales (k impair, formats d'entrée). +À mettre à jour si de nouvelles sous-commandes ou formats sont ajoutés. diff --git a/docmd/kmers.refs.md b/docmd/kmers.refs.md new file mode 100644 index 0000000..4dbe0a3 --- /dev/null +++ b/docmd/kmers.refs.md @@ -0,0 +1,13 @@ + +# Coverage: kmers.md + +## Code couvert + +- `obikseq/src/kmer.rs` — type Kmer, propriétés, forme canonique +- `obikseq/src/superkmer.rs` — type SuperKmer, longueur attendue +- `obiskbuilder/src/lib.rs` — extraction de superkmers par minimiseur + +## Notes + +Chevauche `theory/encoding.md` (encodage 2 bits) et `theory/minimizer.md` (choix du minimiseur). +Vérifier que la définition de SuperKmer est cohérente avec les invariants actuels de `obikseq`. diff --git a/docmd/theory/encoding.refs.md b/docmd/theory/encoding.refs.md new file mode 100644 index 0000000..125b588 --- /dev/null +++ b/docmd/theory/encoding.refs.md @@ -0,0 +1,11 @@ + +# Coverage: theory/encoding.md + +## Code couvert + +- `obikseq/src/kmer.rs` — encodage 2 bits/base, revcomp, forme canonique + +## Notes + +Document purement théorique. Peu de risque de dérive sauf si l'encodage interne de Kmer change. +Vérifier que la table d'encodage A=00, C=01, G=10, T=11 est toujours celle du code. diff --git a/docmd/theory/entropy.refs.md b/docmd/theory/entropy.refs.md new file mode 100644 index 0000000..79ca55c --- /dev/null +++ b/docmd/theory/entropy.refs.md @@ -0,0 +1,12 @@ + +# Coverage: theory/entropy.md + +## Code couvert + +- `obiskbuilder/src/entropy_table.rs` — filtre Shannon sur les kmers à basse complexité +- `obiskbuilder/src/lib.rs` — application du filtre lors du scatter (phase 1) + +## Notes + +Document théorique stable. Vérifier que les paramètres `theta` et `level_max` dans le CLI +(`obikmer/src/cli.rs` → `CommonArgs`) correspondent bien à ce qui est décrit. diff --git a/docmd/theory/indexing.refs.md b/docmd/theory/indexing.refs.md new file mode 100644 index 0000000..444abff --- /dev/null +++ b/docmd/theory/indexing.refs.md @@ -0,0 +1,12 @@ + +# Coverage: theory/indexing.md + +## Code couvert + +- `obikpartitionner/src/partition.rs` — routage par hash de minimiseur, choix des paramètres +- `obikpartitionner/src/lib.rs` — structure KmerPartition, nombre de partitions + +## Notes + +Vérifier que la doc mentionne bien que le nombre de partitions est une puissance de 2 +(converti par `partitions_to_bits` dans `obikmer/src/cli.rs`). diff --git a/docmd/theory/minimizer.refs.md b/docmd/theory/minimizer.refs.md new file mode 100644 index 0000000..277a89f --- /dev/null +++ b/docmd/theory/minimizer.refs.md @@ -0,0 +1,12 @@ + +# Coverage: theory/minimizer.md + +## Code couvert + +- `obiskbuilder/src/lib.rs` — sélection du minimiseur par hash seedé (splitmix64 finalizer) +- `obikseq/src/superkmer.rs` — forme canonique du minimiseur, fenêtre glissante + +## Notes + +Vérifier que la fonction de hash décrite (splitmix64 finalizer avec graine) correspond +au code actuel. Vérifier aussi que la définition de « minimiseur canonique » est toujours cohérente. -- 2.52.0 From da56c3e2907f6df1686c391b6e60b440338e10fd Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 13:24:25 +0200 Subject: [PATCH 10/10] docs: update architecture and storage specs for approximate index Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details. --- docmd/architecture/index_architecture.md | 491 +++++++++---------- docmd/implementation/evidence_elimination.md | 417 ++++++---------- docmd/implementation/obilayeredmap.md | 345 ++++++++----- docmd/implementation/storage.md | 137 +++++- docmd/implementation/unitig_evidence.md | 306 ++++-------- 5 files changed, 814 insertions(+), 882 deletions(-) diff --git a/docmd/architecture/index_architecture.md b/docmd/architecture/index_architecture.md index 0915a9c..752334a 100644 --- a/docmd/architecture/index_architecture.md +++ b/docmd/architecture/index_architecture.md @@ -2,169 +2,155 @@ ## Fundamental invariant -A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination. +A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination. --- ## Three-level hierarchy ``` -PartitionedIndex -├── LayeredPartition (one per minimiser bucket) -│ ├── MphfLayer 0 kmer → slot (immutable bijection) -│ │ ├── DataStore A slot → T (e.g. counts) -│ │ └── DataStore B slot → T (e.g. presence/absence, derived) -│ ├── MphfLayer 1 -│ │ └── DataStore A -│ └── ... -├── LayeredPartition -│ └── ... +KmerIndex (index.meta + KmerPartition) +├── partition_0/index/ one directory per minimiser bucket +│ ├── meta.json PartitionMeta { n_layers } +│ ├── layer_0/ +│ │ ├── layer_meta.json LayerMeta { evidence: EvidenceKind } +│ │ ├── mphf.bin PtrHash MPHF +│ │ ├── unitigs.bin unitig spine (never overwritten) +│ │ ├── evidence.bin exact evidence (Exact only) +│ │ ├── unitigs.bin.idx block index (Exact only) +│ │ ├── fingerprint.bin fingerprints (Approx only) +│ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true) +│ │ └── presence/ PersistentBitMatrix +│ └── layer_1/ +│ └── ... +└── partition_1/index/ + └── ... ``` -**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel. +**KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon. -**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint. +**Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets. -**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it. - -**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`. +**Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`. --- -## MphfLayer — autonomous mapping layer +## IndexConfig and IndexMeta ```rust -MphfLayer::find(kmer: CanonicalKmer) -> Option // slot, or None if absent -MphfLayer::n() -> usize // number of slots -MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin -MphfLayer::open(dir: &Path) -> OLMResult +pub struct IndexConfig { + pub kmer_size: usize, + pub minimizer_size: usize, + pub n_bits: usize, // log2(n_partitions) + pub with_counts: bool, + pub evidence: EvidenceKind, + pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig +} + +pub struct IndexMeta { + pub version: u32, + pub config: IndexConfig, + pub genomes: Vec, // ordered; index = genome column number +} + +pub struct GenomeInfo { + pub label: String, + pub meta: HashMap, // arbitrary categorical metadata +} ``` -`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index. +`IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index. -The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`. +--- + +## EvidenceKind + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +Controls which files are written per layer and which query path is taken: + +| Variant | Files written | False-positive rate | +|---|---|---| +| `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 | +| `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) | + +`EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built). + +--- + +## MphfLayer — autonomous kmer → slot mapping + +```rust +pub struct MphfLayer { + mphf: PtrHash<…>, + ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint } + n: usize, +} +``` + +`MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range. + +``` +find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs +find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer) +``` + +`block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`. + +The MPHF and `unitigs.bin` are never rebuilt by any post-build operation. + +--- + +## Layer\ — MPHF + data payload + +```rust +pub struct Layer { + mphf: MphfLayer, + data: D, +} +``` + +`D` selects the attached data payload: + +| `D` | Data directory | `Item` returned by `query` | +|---|---|---| +| `()` | — | `()` (set membership only) | +| `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) | +| `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) | + +`Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option>`. + +Build-time entry points: + +```rust +Layer<()>::build(out_dir, block_bits) // set membership +Layer::build(out_dir, block_bits, count_of) +Layer::build_presence(out_dir, block_bits, n_genomes, present_in) + +Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path) +``` --- ## DataStore — slot-indexed data -```rust -trait DataStore { - type Item; - fn get(&self, slot: usize) -> Self::Item; - fn n(&self) -> usize; -} -``` +`PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs. -Concrete types from `obicompactvec`: - -| Type | `Item` | Column stats | Use | +| Type | `Item` | Aggregation method | Use | |---|---|---|---| -| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1` | count per sample per slot | -| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1` | presence per sample per slot | - -`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights. - -A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. +| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() → Array1` | counts per genome per slot | +| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() → Array1` | presence per genome per slot | --- -## Distance matrix API on DataStore types +## Aggregation traits — `obicompactvec::traits` -Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods. - -### Full distance matrices - -Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon. - -```rust -// PersistentCompactIntMatrix -fn bray_dist_matrix(&self) -> Array2 -fn relfreq_bray_dist_matrix(&self) -> Array2 -fn euclidean_dist_matrix(&self) -> Array2 -fn relfreq_euclidean_dist_matrix(&self) -> Array2 -fn hellinger_dist_matrix(&self) -> Array2 -fn jaccard_dist_matrix(&self) -> Array2 -fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 - -// PersistentBitMatrix -fn jaccard_dist_matrix(&self) -> Array2 -fn hamming_dist_matrix(&self) -> Array2 -``` - -These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required. - -### Partial distance matrices - -Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible. - -**Category 1 — self-contained partials**: additive without any external parameter. - -```rust -// PersistentCompactIntMatrix -fn partial_bray_dist_matrix(&self) - -> (Array2, // sum_min[i,j] - Array1) // col_sums[k] - -fn partial_euclidean_dist_matrix(&self) -> Array2 // sum of squared diffs -fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32) - -> (Array2, // inter[i,j] - Array2) // union[i,j] - -// PersistentBitMatrix -fn partial_jaccard_dist_matrix(&self) - -> (Array2, // inter[i,j] - Array2) // union[i,j] -fn partial_hamming_dist_matrix(&self) -> Array2 // differing bits -``` - -**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs. - -```rust -// PersistentCompactIntMatrix only -fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot min(a_slot/sum_i, b_slot/sum_j) - -fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot (a_slot/sum_i - b_slot/sum_j)² - -fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot (√(a/sum_i) - √(b/sum_j))² -``` - -The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below. - ---- - -## Progressive aggregation principle - -Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down. - -``` -PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix - ↓ Σ across layers -LayeredStore::col_weights() — column sums for one partition - ↓ Σ across partitions -LayeredStore>::col_weights() — global column sums -``` - -The same cascade applies to every partial: - -``` -PersistentCompactIntMatrix::partial_bray() — one (partition, layer) - ↓ element-wise Σ across layers -LayeredStore::partial_bray() — one partition - ↓ element-wise Σ across partitions -LayeredStore>::partial_bray() — global partial → final dist -``` - -Each level presents a stable trait surface to the level above; no level reaches two levels down. - ---- - -## Traits — `obicompactvec::traits` - -Three traits unify the aggregation API across all levels of the hierarchy. +Three traits unify the aggregation API across all hierarchy levels. ```rust trait ColumnWeights: Send + Sync { @@ -172,81 +158,130 @@ trait ColumnWeights: Send + Sync { } trait CountPartials: ColumnWeights { - // self-contained partials (additive, no parameter) - fn partial_bray(&self) -> Array2; - fn partial_euclidean(&self) -> Array2; - fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); - // normalised partials (global col_weights passed in cascade) - fn partial_relfreq_bray(&self, global: &Array1) -> Array2; - fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; - fn partial_hellinger(&self, global: &Array1) -> Array2; - // provided finalisation methods (default implementations) - fn bray_dist_matrix(&self) -> Array2 { … } - fn euclidean_dist_matrix(&self) -> Array2 { … } - fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 { … } - fn relfreq_bray_dist_matrix(&self) -> Array2 { … } - fn relfreq_euclidean_dist_matrix(&self) -> Array2 { … } - fn hellinger_dist_matrix(&self) -> Array2 { … } + fn partial_bray(&self) -> Array2; + fn partial_euclidean(&self) -> Array2; + fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); + fn partial_relfreq_bray(&self, global: &Array1) -> Array2; + fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; + fn partial_hellinger(&self, global: &Array1) -> Array2; + // provided finalisation methods with default impls + fn bray_dist_matrix(&self) -> Array2 { … } + fn relfreq_bray_dist_matrix(&self) -> Array2 { … } + // … } trait BitPartials: ColumnWeights { - fn partial_jaccard(&self) -> (Array2, Array2); - fn partial_hamming(&self) -> Array2; + fn partial_jaccard(&self) -> (Array2, Array2); + fn partial_hamming(&self) -> Array2; // provided fn jaccard_dist_matrix(&self) -> Array2 { … } fn hamming_dist_matrix(&self) -> Array2 { … } } ``` -**Leaf implementors** (in `obicompactvec`): +Leaf implementors: | Type | Traits | |---|---| -| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` | -| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` | - -`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators. +| `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` | +| `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` | --- -## `LayeredStore` — `obilayeredmap` - -A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types: +## LayeredStore\ — recursive aggregation wrapper ```rust pub struct LayeredStore(Vec); ``` -Three blanket impls propagate the traits up the hierarchy: +Three blanket impls propagate all traits up the hierarchy: ```rust -impl ColumnWeights for LayeredStore { … } // Σ across inner stores -impl CountPartials for LayeredStore { … } // same pattern -impl BitPartials for LayeredStore { … } // same pattern +impl ColumnWeights for LayeredStore { … } +impl CountPartials for LayeredStore { … } +impl BitPartials for LayeredStore { … } ``` -Because the blanket impl is recursive, **`LayeredStore>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed: +This makes `LayeredStore>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed: ``` -PersistentCompactIntMatrix implements CountPartials -LayeredStore via blanket impl (= one partition) -LayeredStore> via blanket impl (= partitioned index) +PersistentCompactIntMatrix leaf (one layer) +LayeredStore one partition (layers are disjoint) +LayeredStore> whole index (partitions are independent) ``` -### Normalised metrics — two-pass cascade - -The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic: +Normalised metrics require global column sums — computed in a two-pass cascade: ```rust -// called on LayeredStore> +// on LayeredStore> fn relfreq_bray_dist_matrix(&self) -> Array2 { - let global = self.col_weights(); // pass 1 — progressive sum at every level - let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade - p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately) + let global = self.col_weights(); // pass 1 — sums up hierarchy + let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only + p.mapv(|v| 1.0 - v) } ``` -`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy. +Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy. + +--- + +## Progressive aggregation principle + +No level reaches two levels down. Each level sums contributions from the level immediately below: + +``` +PersistentCompactIntMatrix::col_weights() — one (partition, layer) + ↓ Σ across layers +LayeredStore::col_weights() — one partition + ↓ Σ across partitions +LayeredStore>::col_weights() — global +``` + +The same cascade applies to every partial method. + +--- + +## Multi-genome column invariant + +After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. + +Maintained by three coordinated operations: + +**Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`. + +**New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers. + +**First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1. + +This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape. + +--- + +## Query model + +### Point query + +``` +minimiser(kmer) → partition p +for each layer l in p: + if let Some(slot) = MphfLayer_l.find(kmer): + return data_l.read(slot) +return None +``` + +O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`. + +### Aggregation + +``` +result = reduce( + for p in partitions: // parallel + for l in layers(p): // parallel + partial(data_p_l) +) +``` + +For normalised metrics, replace with the two-pass cascade. --- @@ -254,103 +289,25 @@ fn relfreq_bray_dist_matrix(&self) -> Array2 { | Level | Unit | Coordination | |---|---|---| -| Across partitions | `LayeredStore>` inner stores | none — fully independent | -| Across layers within a partition | `LayeredStore` inner stores | none — disjoint kmer sets | +| Across partitions | inner stores of `LayeredStore>` | none | +| Across layers within a partition | inner stores of `LayeredStore` | none — disjoint kmer sets | | Normalised pass 1 (`col_weights`) | per inner store | none — additive | | Normalised pass 2 (partial) | per inner store | `global` broadcast read-only | | Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` | -All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction. +--- + +## reindex — evidence conversion in place + +`KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`: + +- `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin` +- `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx` + +On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`. --- -## Query model +## estimate — parameter dry-run -### Point query — `kmer → Option` - -``` -minimiser(kmer) → partition p -for each layer l in p: - slot = MphfLayer_l.find(kmer) - if slot is Some: - return DataStore_l.get(slot) -return None -``` - -O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer). - -### Aggregation — `→ Result` - -``` -result = reduce( - for p in partitions: // parallel - for l in layers(p): // parallel - partial(DataStore_p_l) -) -``` - -For normalised metrics replace with the two-pass scheme above. - ---- - -## DataStore derivation - -Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF: - -``` -// count → presence/absence, parallel across (partition, layer) -for (p, l) in all_partition_layer_pairs().par_iter(): - count_store = open PersistentCompactIntMatrix at (p, l) - presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir) -``` - -Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair. - ---- - -## Relationship to current implementation - -### What is implemented - -- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`. -- **`obilayeredmap::LayeredStore`**: generic wrapper with blanket impls for all three traits. `LayeredStore>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data. - -### What is not yet implemented - -- `Layer` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported. -- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`. -- No `PartitionedIndex` type for point queries with parallel partition dispatch. - -### Planned refactoring - -1. Extract `MphfLayer` from `Layer` as an autonomous type. -2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system. -3. Rewire `LayeredMap` to hold `LayeredStore` (or bit variant) alongside the MPHF layers. -4. Implement `PartitionedIndex` using `LayeredStore>` for data and parallel dispatch for queries. - ---- - -## Multi-genome column invariant - -### The invariant - -After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode). - -### How it is maintained - -The invariant is established and preserved by three coordinated operations: - -**1. Existing layers — column append.** -When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns. - -**2. New layers — absent columns prepended.** -If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation. - -**3. First merge, Presence mode — `init_presence_matrix`.** -The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2. - -### Why the invariant is required - -The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics. - -The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape. +`estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing. diff --git a/docmd/implementation/evidence_elimination.md b/docmd/implementation/evidence_elimination.md index ecc2c6f..a7f3a18 100644 --- a/docmd/implementation/evidence_elimination.md +++ b/docmd/implementation/evidence_elimination.md @@ -1,321 +1,178 @@ -# Evidence elimination — design discussion +# Approximate evidence: fingerprint-based index -## Problem statement +## Motivation -`evidence.bin` maps each MPHF slot to a position in the unitig store so that -query verification is possible: given a slot `s` returned by `mphf.index(kmer)`, -retrieve the k-mer stored at that position and compare with the query. +`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it, +enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions, +k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint: -On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition): - -| file | size/partition | total (2048 parts) | fraction of lookup layer | -|---|---|---|---| -| evidence.bin | 132 MB | ~270 GB | **66 %** | -| unitigs.bin | 58 MB | ~118 GB | 29 % | -| mphf.bin | 10 MB | ~20 GB | 5 % | - -Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage -optimisation available for index size. - ---- - -## Why evidence exists - -PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`. -For a query k-mer not in the indexed set, the returned slot is meaningless but -indistinguishable from a real hit without external information. Evidence provides -that information: `evidence[s]` encodes the location of the k-mer that legitimately -occupies slot `s`, allowing the verification: - -``` -slot = mphf.index(query) -(chunk_id, rank) = evidence.decode(slot) -stored_kmer = unitigs.kmer_at(chunk_id, rank) -return canonical(stored_kmer) == canonical(query) -``` - -Evidence is a **permutation** from MPHF-space to unitig-position-space. -Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding. - ---- - -## Information-theoretic lower bound - -For a partition with P k-mers and U unitigs of average length m_u k-mers: - -- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits -- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits - -Current implementation: 25 + 7 = 32 bits (aligned u32). -Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M. - -**Packing headroom: ~22 %.** Not a path to elimination. - ---- - -## Two-index architecture - -The exact index is mandatory for set operations (union, intersection, diff) and -exact k-mer retrieval. A separate approximate index, built for query operations, -can tolerate a controlled false positive rate in exchange for a much smaller -footprint. - -| component | exact index | approximate index | +| file | size/partition | fraction | |---|---|---| -| `mphf.bin` | ✓ | ✓ (same structure) | -| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ | -| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) | -| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) | -| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) | +| evidence.bin | 132 MB | 66 % | +| unitigs.bin | 58 MB | 29 % | +| mphf.bin | 10 MB | 5 % | -The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps -`unitigs.bin` for sequential enumeration of K-mers. +`evidence.bin` is a bijection from MPHF-space to unitig-position-space and +costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with +only ~22 % packing headroom. Compression is not a path to elimination. + +The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a +`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set +operations still require an exact index; the approximate index targets query +workloads that can tolerate a bounded false-positive rate. --- -## MPHF as a perfect Bloom filter +## The Findere model -A standard Bloom filter with a single hash function and N bits storing M keys has -occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of -landing on a set bit. The empty space (fraction 1 − M/N of bits at 0) is what -rejects foreign k-mers. +A B-bit fingerprint stored per MPHF slot provides the discrimination that +`evidence.bin` would otherwise provide through full k-mer reconstruction. -An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer -occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any** -input to a slot in [0, M) — there is no empty space. Every query lands on an -occupied slot. The MPHF alone cannot reject foreign k-mers at all. - -Adding a B-bit fingerprint restores the discrimination: +For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint +stored at `s` belongs to the legitimate k-mer at that slot. The FP event is: ``` -slot = mphf.index(query) -fingerprint = hash(query) & mask_B -present = fingerprint_table[slot] == fingerprint +P(FP per k-mer) = 1 / 2^b ``` -The fingerprint plays the role of the sparse space in the Bloom filter: it provides -the B bits of information needed to reject foreign k-mers. +The Findere trick raises the effective window to z consecutive k-mers. A query +succeeds only when all z fingerprint checks pass, reducing the per-window FP rate: -Both structures reach the same fundamental cost for a given FP rate. For 1% FP: +``` +P(FP per z-window) = 1 / 2^(b·z) +``` -- Bloom filter (optimal, k hash functions): ~9.6 bits/key -- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key +The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer +decomposes into z overlapping k-mers, all of which must match. -This is a fundamental bound, not an implementation detail. +Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`). --- -## Approach A — MPHF + fingerprint (approximate index) +## `FingerprintVec` on disk -### Size +`fingerprint.bin` layout: -| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) | +``` +magic: b"FPVF" (4 bytes) +b: u8 (bits per slot, 1..=64) +padding: [0u8; 3] +n: u64 LE (number of slots) +data: packed bits, ceil(n·b/8) bytes, Lsb0 order +``` + +`FingerprintVec` is memory-mapped. The match check against a query k-mer: + +```rust +fn matches(&self, slot: usize, fingerprint: u64) -> bool { + self.get(slot) == (fingerprint & self.mask) +} +``` + +`build_approx_evidence` iterates `unitigs.bin` sequentially, writes +`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin` +and `layer_meta.json`. No `.idx` file is produced; random access into +`unitigs.bin` is not needed. + +At build time, `find_approx` in `MphfLayer`: + +```rust +let slot = self.mphf.index(&kmer.raw()); +if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None } +``` + +--- + +## `EvidenceKind` and metadata + +`layer_meta.json` records which evidence bundle is present: + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or +`find_approx` transparently. `find_exact` panics on an approximate layer; +`find_approx` panics on an exact layer — mode mixing is a programming error. + +--- + +## Parameter resolution (`resolve_approx_params`) + +The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third. +`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling +rounding: + +| given | derived | +|---|---| +| b, z | fp = 1/2^(b·z) | +| z, fp | b = ⌈−log₂(fp) / z⌉ | +| b, fp | z = ⌈−log₂(fp) / b⌉ | +| z only | b = 8 (default), fp derived | +| b only | z = 1 (default), fp derived | +| fp only | b = 8 (default), z derived | +| none | b = 8, z = 1, fp = 1/256 | + +When all three are given, b and z are authoritative and fp is recomputed. + +--- + +## CLI flags + +Both `index` and `reindex` accept the same flags: + +| flag | type | meaning | |---|---|---| -| 8 | 33 MB | 4× smaller | -| 12 | 49 MB | 2.7× smaller | -| 16 | 66 MB | 2× smaller | +| `--approx` | bool | enable fingerprint evidence | +| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot | +| `-z` | u8 | Findere z parameter | +| `--fp` | f64 | target FP rate per z-window | +| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode | -Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer). - -### False positive rate — per k-mer query - -For a specific non-indexed query k-mer q: - -1. MPHF(q) → slot s, some value in [0, M) -2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s -3. FP event: hash(q) & mask_B == fingerprint_table[s] - -Since q is not the legitimate k-mer at s, its fingerprint is independent of -fingerprint_table[s], giving: - -``` -P(FP per k-mer) = 1 / 2^B -``` - -This is the probability of error **for one specific query k-mer**. It is not the -fraction of the k-mer universe that would be misclassified: querying all 4^k -possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but -that is not the relevant quantity for practical use. - -### Equivalence classes - -The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B -equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true -indexed k-mer and 4^k/(M·2^B) − 1 false positives. A larger M (fewer partitions) -produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B -remains constant. - -### Read-level use case - -The relevant decision unit is the **read**, not the individual k-mer. For a read -of ~100 nucleotides and k=31, there are ~70 k-mers. - -- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are - true positives → high coverage fraction. -- A plant read queried against a bacterial index: k-mers are foreign; each has - P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B. - -A coverage threshold separates the two cases decisively. This is the same -principle as Findere: local coverage continuity distinguishes true hits from noise. +`--approx` must be set explicitly; the other three flags are optional and +resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1. --- -## Approach B — z-consecutive k-mer matching +## `reindex` command -A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers. -Declaring a match only when **all z are present** reduces the per-window FP rate: +`reindex` converts an existing index between exact and approximate evidence +in-place across all partitions and layers, running partitions in parallel via +Rayon. -``` -P(FP per window of z) = (1/2^B)^z = 1/2^(B·z) -``` +Conversion to approximate (`--approx`): -For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z. -The probability that at least one window is a false positive: +- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`. +- Removes `evidence.bin` and `unitigs.bin.idx`. +- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`. -``` -P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z) -``` +Conversion to exact (default, no `--approx`): -For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸. +- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`. +- Removes `fingerprint.bin`. +- Updates `layer_meta.json` with `EvidenceKind::Exact`. -A plant read is misclassified as bacterial roughly once in 60 million reads — -negligible for any practical dataset. - -### Choosing B from (z, L, P_target) - -z is a query-time parameter and does not affect the index structure. However, -knowing z at build time allows computing the minimum B required to reach a target -FP rate P_target for reads of length L (giving W = L − k − z + 2 independent -windows): - -``` -P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z ) -``` - -Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**. - -(B, z) are co-determined at build time to minimise fingerprint size while -guaranteeing the target read-level FP rate. - -### Combined sizing - -| B | z | K = k+z−1 | P(FP_read) | fingerprint.bin/partition | -|---|---|---|---|---| -| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB | -| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB | -| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB | -| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB | - -Smaller B → smaller fingerprint table; larger z → longer minimum match length K -and fewer independent windows per read. +The root `index.meta` is updated with the new evidence kind on success. +`mphf.bin` and `unitigs.bin` are never modified. --- -## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index) +## `estimate` command -Build the MPHF to output the global k-mer position directly: +`estimate` is a dry-run that resolves and prints (b, z, fp) without touching +any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and +additionally accepts `-k` to display the effective indexed k-mer length: ``` -mphf: kmer → global_pos ∈ [0, P) +k (query): 31 +k (indexed): 31 +z: 1 +evidence bits (b): 8 +FP per k-mer: 3.906e-3 (1/2^8) +FP per z-window: 3.906e-3 (1/2^8) ``` -Verification becomes: - -``` -global_pos = mphf.index(query) -stored_kmer = unitigs.kmer_at_global_pos(global_pos) -return canonical(stored_kmer) == canonical(query) -``` - -No evidence array. The unitig block index (see below) provides -`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time. - -### What is required - -A **retrieval data structure** (also called a value-based or function-based MPHF): -given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`, -build a compact structure that maps each key to its assigned value. - -Known constructions: - -- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph + - XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic - minimum. Construction: O(n). Query: O(1). -- **SSHash approach**: builds the MPHF to map k-mers to their positions in a - concatenated unitig string. Achieves elimination of external evidence using a - "skew" construction that aligns the MPHF output with the sequential unitig layout. - -### Rust availability - -No Rust crate implements a retrieval data structure suitable for this use case as -of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain -MPHFs. **This is the key blocking factor.** - -### SSHash construction (reference) - -SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over -(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential -positions in the concatenated unitig string. A port to obikmer would require: - -1. Concatenating all unitig sequences into a single flat buffer per partition. -2. Assigning each k-mer a global position (its offset in that buffer). -3. Building the MPHF to output that position directly (retrieval step). -4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`. - ---- - -## Approach 2 — block index prefix sums (reduces evidence to negligible) - -A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled -offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset -table. - -### Extension: k-mer prefix sums per block - -Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before -block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**. - -This enables `kmer_at_global_pos(p)`: - -1. Binary search in `kmer_prefix[]` to find block `b`. -2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`. -3. Extract the k-mer at the remaining rank within the found chunk. - -Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses. - -### Combined with Approach 1 - -- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions) -- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible) - ---- - -## Recommended path - -1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose - (B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep - `unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8 - vs ~142 MB for the exact lookup layer. - -2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`. - Zero cost if evidence.bin is kept; enables Approach 1 when ready. - -3. **Medium term**: implement GOV retrieval data structure in Rust, or port - SSHash construction. - -4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected - index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset. - ---- - -## Open questions - -- Is a GOV construction compatible with the parallel MPHF build currently used - (PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential - (hypergraph peeling); parallelisation is non-trivial. -- Can the SSHash "skew" insight be reused without the minimizer-bucket structure? - The obikmer partitioning already uses minimizers — there may be natural alignment. -- What is the query latency impact of replacing O(1) evidence lookup with - O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale. -- What is the optimal (B, z) trade-off for the approximate index given the target - read length and acceptable P(FP_read)? +Useful for choosing parameters before committing to an index build. diff --git a/docmd/implementation/obilayeredmap.md b/docmd/implementation/obilayeredmap.md index d5ad7d2..da2f42e 100644 --- a/docmd/implementation/obilayeredmap.md +++ b/docmd/implementation/obilayeredmap.md @@ -2,7 +2,7 @@ ## Purpose -`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. +`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. --- @@ -20,42 +20,81 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico --- +## Evidence kinds + +Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time: + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load. + +- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time. +- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required. + +--- + ## MphfLayer — autonomous kmer → slot mapping -`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data. +`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload. ```rust pub struct MphfLayer { - mphf: Mphf, - evidence: Evidence, - unitigs: UnitigFileReader, - n: usize, // number of indexed kmers = number of MPHF slots + mphf: Mphf, + ev: LayerEvidence, // loaded at open() time + n: usize, } ``` -Public API: +`LayerEvidence` is an internal enum, not public: ```rust -impl MphfLayer { - pub fn open(dir: &Path) -> OLMResult - pub fn find(&self, kmer: CanonicalKmer) -> Option // Some(slot) or None - pub fn n(&self) -> usize - pub fn unitig_writer(dir: &Path) -> OLMResult - pub(crate) fn build( - dir: &Path, - fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>, - ) -> OLMResult +enum LayerEvidence { + Exact { evidence: Evidence, unitigs: UnitigFileReader }, + Approx { fingerprint: FingerprintVec }, } ``` -`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test). +### Query API -`build` runs two sequential passes over `unitigs.bin`: +Three public query methods, all returning `Option` (slot index): -1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. -2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. +```rust +pub fn find(&self, kmer: CanonicalKmer) -> Option +pub fn find_exact(&self, kmer: CanonicalKmer) -> Option +pub fn find_approx(&self, kmer: CanonicalKmer) -> Option +``` -For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`. +- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`. +- `find_exact` panics if the layer holds approximate evidence; zero false positives. +- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer. + +`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes. + +### Build surface + +```rust +// Full MPHF + exact evidence build (two-pass, parallel) +pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult + +// Evidence-only builds (MPHF already present in dir) +pub fn build_exact_evidence(dir, block_bits) -> OLMResult +pub fn build_approx_evidence(dir, b, z) -> OLMResult +pub fn build_evidence(dir, kind, block_bits) -> OLMResult // dispatch +``` + +`MphfLayer::build` runs two sequential passes over `unitigs.bin`: + +1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. +2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. + +`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction. + +For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files. --- @@ -81,7 +120,7 @@ pub struct Hit { } ``` -`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait. +`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait. | Type | `Item` | Description | |---|---|---| @@ -89,31 +128,118 @@ pub struct Hit { | `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) | | `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) | -**Build signatures:** +### Build signatures ```rust // mode 1 impl Layer<()> { - pub fn build(out_dir: &Path) -> OLMResult + pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult } // mode 2 impl Layer { - pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult - pub fn build_from_map(out_dir: &Path, counts: &HashMap) -> OLMResult + pub fn build(out_dir: &Path, block_bits: u8, + count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult + pub fn build_from_map(out_dir: &Path, block_bits: u8, + counts: &HashMap) -> OLMResult } // mode 3 impl Layer { - pub fn build_presence( - out_dir: &Path, - n_genomes: usize, - present_in: impl Fn(CanonicalKmer, usize) -> bool, - ) -> OLMResult + pub fn build_presence(out_dir: &Path, block_bits: u8, + n_genomes: usize, + present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult } ``` -All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`. +All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`. + +### Evidence build helpers on Layer + +```rust +impl Layer { + pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult + pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult + pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult +} +``` + +These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer` level. + +--- + +## FingerprintVec and FingerprintVecWriter + +Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot. + +``` +fingerprint.bin format: + magic: b"FPVF" (4 bytes) + b: u8 (bits per fingerprint, 1..=64) + padding: [0u8; 3] + n: u64 LE (number of slots) + data: packed bits, ceil(n*b/8) bytes, Lsb0 order +``` + +```rust +impl FingerprintVec { + pub fn open(path: &Path) -> OLMResult + pub fn get(&self, slot: usize) -> u64 + pub fn matches(&self, slot: usize, fingerprint: u64) -> bool + pub fn n(&self) -> usize + pub fn b(&self) -> u8 +} +``` + +`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`. + +--- + +## LayeredMap\ — collection of layers + +`LayeredMap` wraps `Vec>` for a single partition directory. + +```rust +pub struct LayeredMap { + root: PathBuf, + meta: PartitionMeta, + layers: Vec>, +} +``` + +`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`. + +### Common methods + +```rust +pub fn open(root: &Path) -> OLMResult +pub fn create(root: &Path) -> OLMResult +pub fn n_layers(&self) -> usize +pub fn layer(&self, i: usize) -> &Layer +pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit)> +pub fn next_layer_writer(&self) -> OLMResult +``` + +`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0. + +### push_layer + +`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`: + +```rust +// mode 1 +impl LayeredMap<()> { + pub fn push_layer(&mut self) -> OLMResult +} + +// mode 2 +impl LayeredMap { + pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult + pub fn push_layer_from_map(&mut self, counts: &HashMap) -> OLMResult +} +``` + +Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer::build_presence`. --- @@ -131,14 +257,6 @@ impl BitPartials for LayeredStore { … } // element-wi Because blanket impls compose, `LayeredStore>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type. -**Aggregation hierarchy:** - -``` -PersistentCompactIntMatrix implements CountPartials -LayeredStore via blanket impl (one partition) -LayeredStore> via blanket impl (partitioned index) -``` - **Leaf implementors** (in `obicompactvec`): | Type | Traits | @@ -146,8 +264,6 @@ LayeredStore> via blanket impl (partitione | `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` | | `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` | -`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators. - See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern. --- @@ -155,34 +271,30 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful ## On-disk structure ``` -index_root/ ← LayeredMap (collection) - meta.json - part_00000/ ← Partition - layer_0/ ← Layer - mphf.bin — ptr_hash MPHF (epserde format) - unitigs.bin — packed 2-bit nucleotide sequences - unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[] - evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE - counts/ [mode 2] PersistentCompactIntMatrix - meta.json {"n": N, "n_cols": 1} - col_000000.pciv - presence/ [mode 3] PersistentBitMatrix - meta.json {"n": N, "n_cols": G} - col_000000.pbiv - … - layer_1/ - … - part_00001/ +partition_root/ ← LayeredMap (one partition) + meta.json — {"n_layers": N} + layer_0/ ← Layer + layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z} + mphf.bin — ptr_hash MPHF (epserde format) + unitigs.bin — packed 2-bit nucleotide sequences + unitigs.bin.idx — UIDX index (exact evidence only) + evidence.bin — [u32; n], LE (exact evidence only) + fingerprint.bin — packed b-bit array (approx evidence only) + counts/ [mode 2] PersistentCompactIntMatrix + meta.json + col_000000.pciv + presence/ [mode 3] PersistentBitMatrix + meta.json + col_000000.pbiv … + layer_1/ … ``` -**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel. - -**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint. +`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction. --- -## Evidence encoding +## Evidence encoding (exact) `evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot: @@ -191,9 +303,9 @@ bits [31:7] = chunk_id (25 bits) — index of the unitig chunk bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based) ``` -Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`. +`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access). -For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer. +For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. --- @@ -203,7 +315,7 @@ For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the type Mphf = PtrHash< u64, // key type: canonical kmer raw encoding CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99 - CachelineEfVec>, // remap: 11.6 bits/entry (Elias-Fano) + CachelineEfVec>, // remap: Elias-Fano Xx64, // hasher: XXH3-64 with seed Vec, // pilots >; @@ -211,21 +323,41 @@ type Mphf = PtrHash< `Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly. -`CubicEps` with `PtrHashParams::::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. +`CubicEps` with `PtrHashParams::::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space. --- -## Query path +## Column append and merge support + +These methods extend existing layers with new genome columns without touching the MPHF. + +### Layer-level genome column append ```rust -pub fn query(&self, kmer: CanonicalKmer) -> Option> { - self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) }) +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> +} + +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> } ``` -`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit. +Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n). -In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0. +### Presence matrix initialisation + +```rust +impl Layer<()> { + pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> +} +``` + +Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended. + +### Why the MPHF is never rebuilt + +The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. --- @@ -235,9 +367,9 @@ When adding dataset B to an existing index: 1. For each partition, probe existing layers for kmers of B routed to that partition. 2. Collect kmers absent from all layers → `B \ index`. -3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`. -4. Call `Layer::build` on the new directory. -5. Update `meta.json`. +3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`. +4. Call `Layer::build` (or `build_presence`) on the new layer directory. +5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`. Each partition's new layer is built independently; the operation is fully parallel across partitions. @@ -250,62 +382,9 @@ Each partition's new layer is built independently; the operation is fully parall | `ptr_hash 1.1` | MPHF per layer | | `cacheline-ef 1.1` | compact remap inside ptr_hash | | `epserde 0.8` | zero-copy MPHF serialisation | -| `memmap2 0.9` | mmap of evidence and payload files | -| `obiskio` | unitig file writer/reader | +| `memmap2 0.9` | mmap of evidence and fingerprint files | +| `bitvec` | packed b-bit fingerprint storage | +| `obiskio` | unitig file writer/reader + `.idx` build | | `obicompactvec` | payload types + aggregation traits | | `rayon 1` | parallel MPHF construction pass | -| `ndarray 0.16` | aggregation output arrays | - ---- - -## Column append and merge support - -These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command. - -### Matrix column append - -```rust -impl PersistentCompactIntMatrix { - pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> -} - -impl PersistentBitMatrix { - pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> -} -``` - -Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change. - -### Presence matrix initialisation - -```rust -impl Layer<()> { - pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> -} -``` - -Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended. - -### Layer-level genome column append - -```rust -impl Layer { - pub fn append_genome_column( - layer_dir: &Path, - value_of: impl Fn(usize) -> bool, - ) -> OLMResult<()> -} - -impl Layer { - pub fn append_genome_column( - layer_dir: &Path, - value_of: impl Fn(usize) -> u32, - ) -> OLMResult<()> -} -``` - -These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction. - -### Why the MPHF is never rebuilt - -The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required. +| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation | diff --git a/docmd/implementation/storage.md b/docmd/implementation/storage.md index defabf4..baf3e9e 100644 --- a/docmd/implementation/storage.md +++ b/docmd/implementation/storage.md @@ -1,5 +1,136 @@ -# On-disk collection structure +# On-disk index layout -See [obilayeredmap crate](obilayeredmap.md) for the current on-disk layout. +## Directory tree -The index root contains one `part_XXXXX/` directory per partition, each holding one or more `layer_N/` directories. Each layer directory contains `mphf.bin`, `unitigs.bin`, `unitigs.bin.idx`, `evidence.bin`, and optionally a `counts/` or `presence/` payload directory. +``` +/ + index.meta ← JSON: IndexMeta + scatter.done ← sentinel: scatter phase complete + count.done ← sentinel: dereplicate + count complete + index.done ← sentinel: MPHF index fully built + spectrums/ +