diff --git a/docmd/architecture/query.md b/docmd/architecture/query.md index 6a11a11..cb83859 100644 --- a/docmd/architecture/query.md +++ b/docmd/architecture/query.md @@ -19,34 +19,80 @@ Given a set of query sequences, determine for each sequence how many of its k-me The query follows the same superkmer-based partitioning strategy used at indexing time. ``` -for each query sequence: - decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing) - for each superkmer: - route to partition p via minimiser hash - for each kmer in the superkmer: - lookup kmer in partition p (MPHF → evidence check → matrix row) - accumulate result into per-sequence accumulators - emit annotated sequence +for each batch of sequences: + build QueryBatch: decompose all sequences into superkmers, deduplicate + split superkmers by partition via minimiser hash + for each partition p: + query_partition(p, superkmers_routed_to_p) + → load QueryLayer(s) for p + → for each kmer in each superkmer: MphfLayer::find(kmer) + broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer + emit annotated sequences ``` -Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions. +Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it. + +Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration. --- -## Exact vs. approximate matching +## Layer lookup: `MphfLayer::find` -### Exact (default) +`MphfLayer::open` reads `layer_meta.json` and loads either exact or approximate evidence. The caller (`QueryLayer::find`) never chooses the dispatch path — it is fixed at open time by `LayerEvidence`: -Standard MPHF lookup followed by evidence check. O(1) per k-mer. +```rust +pub fn find(&self, kmer: CanonicalKmer) -> Option { + match &self.ev { + LayerEvidence::Exact { .. } => self.find_exact(kmer), + LayerEvidence::Approx { .. } => self.find_approx(kmer), + } +} +``` -### 1-mismatch (`--mismatch` flag) +### Exact layers -For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position. +`find_exact` maps the k-mer through the MPHF to a slot, then calls `UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer)` to confirm the stored k-mer matches. Zero false positives. Requires `UnitigFileReader::open()` (random-access via `.idx`); `open_sequential()` cannot serve random-access verification. -- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated. -- Exact and approximate matches are tracked separately in the output (see annotation schema below). -- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser. -- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode. +### Approximate layers + +`find_approx` maps the k-mer through the MPHF, then checks a stored `b`-bit fingerprint of the canonical hash. False-positive rate: **1/2^b per k-mer query**. No `.idx` file is needed; the layer carries only `fingerprint.bin`. + +For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is **1/2^(b·z)**. The `z` parameter is recorded in `layer_meta.json` at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly. + +### `QueryLayer` variant selection + +`QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`: + +| Condition | Variant | Data returned per k-mer | +|---|---|---| +| `with_counts=true` and `counts/` exists | `Count` | raw count per genome | +| `presence/` exists | `Presence` | 0/1 per genome (bit matrix) | +| only `counts/` exists | `Count` | counts used as-is | +| neither exists | `SetOnly` | 1 for every genome | + +--- + +## `open()` vs `open_sequential()` + +`UnitigFileReader::open()` loads the `.idx` block-offset table, enabling random access to individual unitig chunks. It is required whenever `verify_canonical_kmer` is called (exact layers at query time). + +`UnitigFileReader::open_sequential()` skips the `.idx` and supports only forward iteration. It is sufficient for: +- build passes that scan all unitigs sequentially (`build_exact_evidence`, `build_approx_evidence`); +- the `unitig` subcommand, which iterates and prints unitigs without random access. + +`KmerIndex::open()` (called by `query::run`) triggers `MphfLayer::open` for each layer, which calls `UnitigFileReader::open()` for exact layers. Approximate layers do not open a unitig reader at all. + +--- + +## Presence / count mode at query time + +The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores: + +``` +genome_totals[g] += if presence { u32::from(v >= threshold) } else { v } +``` + +`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match. --- @@ -55,57 +101,44 @@ For each k-mer of the query, generate all `3·k` single-substitution variants. E Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line. ``` ->read_id {"kmer_total":150,"kmer_found":59,...} +>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}} ATCGATCG... ``` -Genome order in all list-valued annotations follows the genome order recorded in `index.meta`. +Genome keys in `kmer_strict_matches` are genome labels from `index.meta`. Key order follows iteration order of `meta.genomes`. --- -## Annotation schema - -### Summary mode (default) +## Annotation schema (current implementation) | Key | Type | Condition | Semantics | |---|---|---|---| -| `kmer_total` | int | always | total k-mers in the (masked) sequence | -| `kmer_found` | int | always | k-mers with at least one match (exact or approx) | -| `kmer_missing` | int | `--count-missing` | k-mers absent from the index | -| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) | -| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count | -| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count | -| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers | +| `kmer_count` | int | always | k-mers with at least one match | +| `kmer_missing` | int | `--count-missing` | k-mers absent from every layer | +| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) | -`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index. +`kmer_count` counts matched k-mer positions (incremented once per `Some(row)` hit regardless of how many genomes are covered). `kmer_missing` counts `None` hits. -`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes. - -### Detail mode (`--detail`) - -All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`: - -| Key | Type | Condition | Semantics | -|---|---|---|---| -| `cov_` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent | -| `cov_exact_` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position | -| `cov_approx_` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position | - -Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers. +**Note on doc/impl divergence:** the doc previously used keys `kmer_total`, `kmer_found`, and `kmer_match` (list). The implementation uses `kmer_count` (int, matched only) and `kmer_strict_matches` (object keyed by genome label). `--mismatch` and `--detail` are parsed but not yet implemented and emit a warning. --- ## CLI ``` -obikmer query -i [--summary | --detail] [--mismatch] [--count-missing] +obikmer query -i [--detail] [--mismatch] [--count-missing] + [--force-presence] [--presence-threshold ] + [-T ] [ ...] ``` -`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present). +`--mismatch` and `--detail` are accepted but currently ignored with a warning on stderr. --- ## Future work -- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key. -- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold. +- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently. +- **`--detail`**: per-position coverage vectors (`cov_`) per genome. +- **Read classification** (`--classify`): assign each read to the genome with the highest match score. +- **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value. +- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores. diff --git a/docmd/implementation/chunkreader.md b/docmd/implementation/chunkreader.md index f26ba03..4021c57 100644 --- a/docmd/implementation/chunkreader.md +++ b/docmd/implementation/chunkreader.md @@ -2,19 +2,11 @@ The `obiread` crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers. -## Output type: rope +## Output type: Rope -Each chunk is a `Vec` — a **rope**: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order. +Each chunk is a `Rope` — a segmented byte sequence: a `Vec` of blocks, where each block is a `Vec>`. The consumer iterates over the blocks via a forward or backward cursor. -Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::split_to(n)` adjusts a reference counter, not memory. No `memcpy` in the common case. - -## Allocation policy - -| Case | Cost | -|------|------| -| Boundary found in the current block (common) | zero extra allocation — `split_to` only | -| Boundary straddles multiple blocks (sequence > block size, rare) | one allocation to pack the rope into a flat buffer | -| EOF flush | zero extra allocation | +`Rope::split_off(pos)` splits at an absolute byte offset in O(log n) (binary search over block-start index). If `pos` falls inside a block, that block is split in two via `Vec::split_off` — no `memcpy` in the common case. ## SeqChunkIter @@ -22,7 +14,7 @@ Using `bytes::Bytes` means the split at the record boundary is O(1): `Bytes::spl pub struct SeqChunkIter { /* private */ } impl Iterator for SeqChunkIter { - type Item = io::Result>; + type Item = io::Result; } pub fn fasta_chunks(source: R) -> SeqChunkIter @@ -32,22 +24,21 @@ pub fn fastq_chunks(source: R) -> SeqChunkIter `next()` loop: ```text -1. read one block of block_size bytes → push onto rope -2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the - last block, skip the splitter (avoids a full backward scan for nothing) -3. call splitter on last block - if found at offset n: - remainder = last_block.split_to(n) ← O(1), zero copy - return std::mem::take(&mut self.rope) ← the chunk -4. if rope.len() > 1 (multi-block accumulation): - pack rope into one flat buffer ← one alloc - retry splitter on flat buffer -5. if EOF: flush remaining rope as final chunk +1. read one block of block_size bytes → push onto Rope +2. call splitter(rope) → Option + if Some(pos): + tail = rope.split_off(pos) ← O(log n), may split one block + chunk = mem::replace(&mut rope, tail) + return Some(Ok(chunk)) +3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk +4. if EOF and rope empty: return None ``` +The `Splitter` function signature is `fn(&Rope) -> Option`. It returns the absolute byte offset of the start of the last complete record, or `None` if no boundary was found in the accumulated rope (need more data). + ## Boundary detection — FASTA -Backward scan with a 2-state machine. Searches for `>` immediately preceded by `\n` or `\r`: +Backward scan with a 2-state machine. Searches (right to left) for `>` followed by `\n` or `\r` (i.e., a `>` that is preceded by a newline in forward order): ```mermaid stateDiagram-v2 @@ -58,13 +49,13 @@ stateDiagram-v2 FoundGt --> [*] : '\\n' / '\\r' ✓ ``` -Returns the byte offset of the `>` that starts the last complete record. +Returns the byte offset of the `>` that starts the last complete record. Returns `None` if only one `>` is found (cannot confirm there is a prior complete record). ## Boundary detection — FASTQ FASTQ records have a rigid 4-line structure (`@header`, sequence, `+`, quality). The `@` character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate `@`. -7-state machine (port of Go's `EndOfLastFastqEntry`), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position. +7-state machine (states 0–6), scanning from **right to left**. Each time a `+` is found, its position is saved as `restart`; any state mismatch resets the scan to that position. ```mermaid stateDiagram-v2 diff --git a/docmd/implementation/kmer.md b/docmd/implementation/kmer.md index 49042bd..62ae607 100644 --- a/docmd/implementation/kmer.md +++ b/docmd/implementation/kmer.md @@ -1,38 +1,57 @@ # Kmer — implementation -## Memory layout +## Types and layout -`Kmer` is a `#[repr(transparent)]` newtype over `u64`: +`KmerOf` is a `#[repr(transparent)]` newtype over `u64` parameterized by a `KmerLength` marker: ```rust #[repr(transparent)] -pub struct Kmer(u64); +pub struct KmerOf(u64, PhantomData); ``` -Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is **not stored** — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer. +Three marker types implement `KmerLength`: + +| Marker | `len()` source | Used for | +|--------|---------------|---------| +| `KLen` | `params::k()` | k-mers | +| `MLen` | `params::m()` | minimizers | +| `ConstLen` | const generic `N` | tests | + +Public aliases: + +```rust +pub type Kmer = KmerOf; // k-mer, global k +pub type Minimizer = CanonicalKmerOf; // canonical m-mer, global m +``` + +Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2·len bits are always zero. The length is **not stored** — every operation reads it from `L::len()`. | 63–62 | 61–60 | … | 63−2(k−1)−1 to 63−2(k−1) | 63−2k down to 0 | |-------|-------|---|--------------------------|-----------------| | nt 0 | nt 1 | … | nt k−1 | zero padding | +## Global parameters + +`params::set_k(k)` / `params::k()` and `params::set_m(m)` / `params::m()` are backed by `OnceLock` in production (write-once, panic on conflict) and by `thread_local! { Cell }` in test builds (per-thread, freely writable). `params::init(k, m)` sets both in one call. + ## Encoding -`Kmer::from_ascii(ascii, k)` encodes the first k bytes of an ASCII slice using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)): +`KmerOf::::from_ascii(ascii)` encodes the first `L::len()` bytes using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)): ```rust for i in 0..k { val = (val << 2) | encode_base(ascii[i]) as u64; } -Kmer(val << (64 - 2 * k)) +KmerOf(val << (64 - 2 * k), PhantomData) ``` Zero allocation — result lives on the stack. ## Decoding -`write_ascii(k, buf)` appends k ASCII characters to a caller-supplied `Vec` using the shared `DEC4` table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path. +`write_ascii(writer)` writes k ASCII characters to any `W: Write` using the shared `DEC4` table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path. -`to_ascii(k)` is a convenience wrapper that allocates and returns a `Vec`; intended for tests and display only. +`to_ascii()` is a convenience wrapper that allocates and returns a `Vec`; intended for tests and display only. ## Reverse complement @@ -43,18 +62,30 @@ let x = !self.0; / let x = x.swap_bytes(); // reverse bytes let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2); // swap 2-bit groups -Kmer(x << (64 - 2 * k)) +KmerOf(x << (64 - 2 * k), PhantomData) ``` After complementing, bytes are reversed (`swap_bytes`), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack. -## Canonical form +## Canonical form and `CanonicalKmerOf` + +`canonical()` returns a `CanonicalKmerOf` — a distinct newtype that carries the same `u64` layout but enforces the invariant that the stored value equals `min(kmer, revcomp)`: ```rust -pub fn canonical(&self, k: usize) -> Self { - let rc = self.revcomp(k); - if self.0 <= rc.0 { *self } else { rc } +pub fn canonical(&self) -> CanonicalKmerOf { + let rc = self.revcomp(); + CanonicalKmerOf(if self.0 <= rc.0 { self.0 } else { rc.0 }, PhantomData) } ``` Lexicographic minimum of forward and reverse-complement, comparing the raw `u64` values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack. + +`CanonicalKmerOf::from_raw_unchecked(raw)` is the only other public constructor, for trusted paths such as deserialisation. + +## Sliding window helpers + +`push_right(nuc)` / `push_left(nuc)` shift the window by one base in O(1). `is_overlapping(other)` checks whether the last k−1 nucleotides of `self` equal the first k−1 of `other`. + +## Hashing + +`hash_kmer(raw: u64) -> u64` computes `mix64(raw ^ 0x9e3779b97f4a7c15)`, the seeded splitmix64 finalizer. `CanonicalKmerOf::seq_hash()` delegates to `hash_kmer`. diff --git a/docmd/implementation/merge.md b/docmd/implementation/merge.md index 8a51609..cd27349 100644 --- a/docmd/implementation/merge.md +++ b/docmd/implementation/merge.md @@ -26,13 +26,36 @@ Default mode is `Presence`. `Count` mode requires **all** source indexes to have All source indexes must satisfy: - `IndexState::Indexed` (fully built — `index.done` sentinel present) -- Same `kmer_size`, `minimizer_size`, `n_bits` +- Same `kmer_size`, `minimizer_size`, `n_partitions` +- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters - If `Count` mode: all sources must have `with_counts=true` `--force`: if the output directory already exists, it is deleted before the merge begins. --- +## Evidence compatibility + +`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`: + +- All `Exact` → accepted, output uses `Exact` +- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters +- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first + +Mixed exact/approx is a hard error, not a silent conversion. + +```rust +fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult +``` + +--- + +## Genome label deduplication + +`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise. + +--- + ## Algorithm ### 1. Validation @@ -41,36 +64,72 @@ Check all sources against the constraints above. Abort on any mismatch. ### 2. Bootstrap output from first source -Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output. +Recursive file copy of `sources[0]` → `output`. Immediately after the copy: + +- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind. +- In `Presence` mode, any `counts/` directories inherited from source_0 are removed. +- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels. + +This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination. ### 3. For each subsequent source (parallel across partitions) -For each partition, process one source at a time sequentially: +`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`. -**a. Classify kmers** +Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index). -Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`: +**First merge, Presence mode**: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot). -- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode). -- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap>`. +**Pass 1 — classify kmers** -**b. Extend existing layers** +Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`: -For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent). +- **Hit**: kmer already in the destination; record for Pass 2. +- **Miss**: push kmer into a `GraphDeBruijn` accumulator. -If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer). +**New layer construction** -**c. Build new layer for new kmers** +If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across **all** sources — accumulate into a **single** graph, producing one new layer per merge operation (not one per source). -If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation. +**Pass 2 — fill column builders** -**d. Update partition metadata** +For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`: -Write updated `meta.json` with the incremented `n_layers` if a new layer was added. +- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]` +- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/` +- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/` + +Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`. + +**Column prepending for new layers** + +Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation. + +**Close and update metadata** + +Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added. ### 4. Update index metadata -Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`. +`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop. + +--- + +## `append_genome_column` + +Defined on two concrete specialisations of `Layer`: + +```rust +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> +} + +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> +} +``` + +Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same. --- @@ -78,27 +137,31 @@ Append the merged source's genome label(s) to `index.meta.genomes`. Write update After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point. -This is maintained by three mechanisms: +Maintained by three mechanisms: -1. **Existing layers**: one column appended per source genome (`append_genome_column`). -2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column. -3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended. +1. **Existing layers**: `n_src_total` columns appended (one per source genome). +2. **New layers created during merge**: `n_dst_genomes` absent columns prepended before source columns. +3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0. -The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count. +The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count. --- -## New layer construction +## Error variants relevant to merge -All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation. - -De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction. +| Variant | Condition | +|---|---| +| `OKIError::NotIndexed(path)` | Source not in `Indexed` state | +| `OKIError::IncompatibleConfig` | Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` | +| `OKIError::MismatchedMode` | Count mode but a source has `with_counts=false` | +| `OKIError::IncompatibleEvidence(msg)` | Mixed exact/approx or different approx `(b, z)` | +| `OKIError::DuplicateGenomeLabel(label)` | Duplicate label and `rename_duplicates=false` | --- ## On-disk impact -After merging `G` genomes (1 existing + G-1 new sources): +After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest): ``` partitions/ @@ -106,28 +169,28 @@ partitions/ index/ meta.json ← n_layers updated if new layer added layer_0/ - mphf.bin ← unchanged (MPHF immutable) + mphf.bin ← unchanged unitigs.bin ← unchanged evidence.bin ← unchanged presence/ ← created on first merge (Presence mode) meta.json {"n": N, "n_cols": G} - col_000000.pbiv ← all-true (genome 0) - col_000001.pbiv ← source 1 presence + col_000000.pbiv ← all-true (genome 0 … G0-1) + col_000001.pbiv ← next source ... counts/ ← extended (Count mode) meta.json {"n": N, "n_cols": G} col_000000.pciv ← genome 0 counts (from original build) - col_000001.pciv ← source 1 counts + col_000001.pciv ← next source ... - layer_1/ ← new layer (if new kmers found) + layer_N/ ← new layer (if new kmers found) mphf.bin unitigs.bin evidence.bin presence/ or counts/ meta.json {"n": N1, "n_cols": G} - col_000000.pbiv ← all-false (genome 0 absent from this layer) + col_000000.pbiv ← all-false (absent for existing genomes) ... - col_000001.pbiv ← source 1 presence in this new layer +spectrums/ +