diff --git a/TODO.md b/TODO.md index a874150..ceca1fe 100644 --- a/TODO.md +++ b/TODO.md @@ -1,3 +1,21 @@ +## A finir dans le cadre de l'extension des index à une forme approximative + +- Il faut avoir un chemin explicite pour construire en mode exact avec des méthodes qui ont ce mot exact à l'intérieur. + - pub fn find_exact (src/obilayeredmap/src/mphf_layer.rs) + - pub fn build_exact_evidence (src/obilayeredmap/src/layer.rs) + +Comme elles existent actuellement pour le mode approx. + +Ensuite, il faudra définir des méthodes génériques +- find() +- build_evidence() + +qui utilise la bonne version suivant le mode de l'index de manière complètement transparente. +Avec ce système, tout le reste du code devrait être insensible au fait que l'on utilise un index exact ou approximatif. + +Sauf qu'avec un index approximatif, les résultats seront approximatifs. + + ## commandes à ajouter - aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne. diff --git a/docmd/architecture/index_architecture.md b/docmd/architecture/index_architecture.md index 0915a9c..752334a 100644 --- a/docmd/architecture/index_architecture.md +++ b/docmd/architecture/index_architecture.md @@ -2,169 +2,155 @@ ## Fundamental invariant -A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination. +A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination. --- ## Three-level hierarchy ``` -PartitionedIndex -├── LayeredPartition (one per minimiser bucket) -│ ├── MphfLayer 0 kmer → slot (immutable bijection) -│ │ ├── DataStore A slot → T (e.g. counts) -│ │ └── DataStore B slot → T (e.g. presence/absence, derived) -│ ├── MphfLayer 1 -│ │ └── DataStore A -│ └── ... -├── LayeredPartition -│ └── ... +KmerIndex (index.meta + KmerPartition) +├── partition_0/index/ one directory per minimiser bucket +│ ├── meta.json PartitionMeta { n_layers } +│ ├── layer_0/ +│ │ ├── layer_meta.json LayerMeta { evidence: EvidenceKind } +│ │ ├── mphf.bin PtrHash MPHF +│ │ ├── unitigs.bin unitig spine (never overwritten) +│ │ ├── evidence.bin exact evidence (Exact only) +│ │ ├── unitigs.bin.idx block index (Exact only) +│ │ ├── fingerprint.bin fingerprints (Approx only) +│ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true) +│ │ └── presence/ PersistentBitMatrix +│ └── layer_1/ +│ └── ... +└── partition_1/index/ + └── ... ``` -**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel. +**KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon. -**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint. +**Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets. -**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it. - -**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`. +**Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`. --- -## MphfLayer — autonomous mapping layer +## IndexConfig and IndexMeta ```rust -MphfLayer::find(kmer: CanonicalKmer) -> Option // slot, or None if absent -MphfLayer::n() -> usize // number of slots -MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin -MphfLayer::open(dir: &Path) -> OLMResult +pub struct IndexConfig { + pub kmer_size: usize, + pub minimizer_size: usize, + pub n_bits: usize, // log2(n_partitions) + pub with_counts: bool, + pub evidence: EvidenceKind, + pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig +} + +pub struct IndexMeta { + pub version: u32, + pub config: IndexConfig, + pub genomes: Vec, // ordered; index = genome column number +} + +pub struct GenomeInfo { + pub label: String, + pub meta: HashMap, // arbitrary categorical metadata +} ``` -`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index. +`IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index. -The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`. +--- + +## EvidenceKind + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +Controls which files are written per layer and which query path is taken: + +| Variant | Files written | False-positive rate | +|---|---|---| +| `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 | +| `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) | + +`EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built). + +--- + +## MphfLayer — autonomous kmer → slot mapping + +```rust +pub struct MphfLayer { + mphf: PtrHash<…>, + ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint } + n: usize, +} +``` + +`MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range. + +``` +find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs +find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer) +``` + +`block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`. + +The MPHF and `unitigs.bin` are never rebuilt by any post-build operation. + +--- + +## Layer\ — MPHF + data payload + +```rust +pub struct Layer { + mphf: MphfLayer, + data: D, +} +``` + +`D` selects the attached data payload: + +| `D` | Data directory | `Item` returned by `query` | +|---|---|---| +| `()` | — | `()` (set membership only) | +| `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) | +| `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) | + +`Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option>`. + +Build-time entry points: + +```rust +Layer<()>::build(out_dir, block_bits) // set membership +Layer::build(out_dir, block_bits, count_of) +Layer::build_presence(out_dir, block_bits, n_genomes, present_in) + +Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path) +``` --- ## DataStore — slot-indexed data -```rust -trait DataStore { - type Item; - fn get(&self, slot: usize) -> Self::Item; - fn n(&self) -> usize; -} -``` +`PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs. -Concrete types from `obicompactvec`: - -| Type | `Item` | Column stats | Use | +| Type | `Item` | Aggregation method | Use | |---|---|---|---| -| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1` | count per sample per slot | -| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1` | presence per sample per slot | - -`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights. - -A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. +| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() → Array1` | counts per genome per slot | +| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() → Array1` | presence per genome per slot | --- -## Distance matrix API on DataStore types +## Aggregation traits — `obicompactvec::traits` -Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods. - -### Full distance matrices - -Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon. - -```rust -// PersistentCompactIntMatrix -fn bray_dist_matrix(&self) -> Array2 -fn relfreq_bray_dist_matrix(&self) -> Array2 -fn euclidean_dist_matrix(&self) -> Array2 -fn relfreq_euclidean_dist_matrix(&self) -> Array2 -fn hellinger_dist_matrix(&self) -> Array2 -fn jaccard_dist_matrix(&self) -> Array2 -fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 - -// PersistentBitMatrix -fn jaccard_dist_matrix(&self) -> Array2 -fn hamming_dist_matrix(&self) -> Array2 -``` - -These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required. - -### Partial distance matrices - -Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible. - -**Category 1 — self-contained partials**: additive without any external parameter. - -```rust -// PersistentCompactIntMatrix -fn partial_bray_dist_matrix(&self) - -> (Array2, // sum_min[i,j] - Array1) // col_sums[k] - -fn partial_euclidean_dist_matrix(&self) -> Array2 // sum of squared diffs -fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32) - -> (Array2, // inter[i,j] - Array2) // union[i,j] - -// PersistentBitMatrix -fn partial_jaccard_dist_matrix(&self) - -> (Array2, // inter[i,j] - Array2) // union[i,j] -fn partial_hamming_dist_matrix(&self) -> Array2 // differing bits -``` - -**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs. - -```rust -// PersistentCompactIntMatrix only -fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot min(a_slot/sum_i, b_slot/sum_j) - -fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot (a_slot/sum_i - b_slot/sum_j)² - -fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot (√(a/sum_i) - √(b/sum_j))² -``` - -The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below. - ---- - -## Progressive aggregation principle - -Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down. - -``` -PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix - ↓ Σ across layers -LayeredStore::col_weights() — column sums for one partition - ↓ Σ across partitions -LayeredStore>::col_weights() — global column sums -``` - -The same cascade applies to every partial: - -``` -PersistentCompactIntMatrix::partial_bray() — one (partition, layer) - ↓ element-wise Σ across layers -LayeredStore::partial_bray() — one partition - ↓ element-wise Σ across partitions -LayeredStore>::partial_bray() — global partial → final dist -``` - -Each level presents a stable trait surface to the level above; no level reaches two levels down. - ---- - -## Traits — `obicompactvec::traits` - -Three traits unify the aggregation API across all levels of the hierarchy. +Three traits unify the aggregation API across all hierarchy levels. ```rust trait ColumnWeights: Send + Sync { @@ -172,81 +158,130 @@ trait ColumnWeights: Send + Sync { } trait CountPartials: ColumnWeights { - // self-contained partials (additive, no parameter) - fn partial_bray(&self) -> Array2; - fn partial_euclidean(&self) -> Array2; - fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); - // normalised partials (global col_weights passed in cascade) - fn partial_relfreq_bray(&self, global: &Array1) -> Array2; - fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; - fn partial_hellinger(&self, global: &Array1) -> Array2; - // provided finalisation methods (default implementations) - fn bray_dist_matrix(&self) -> Array2 { … } - fn euclidean_dist_matrix(&self) -> Array2 { … } - fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 { … } - fn relfreq_bray_dist_matrix(&self) -> Array2 { … } - fn relfreq_euclidean_dist_matrix(&self) -> Array2 { … } - fn hellinger_dist_matrix(&self) -> Array2 { … } + fn partial_bray(&self) -> Array2; + fn partial_euclidean(&self) -> Array2; + fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); + fn partial_relfreq_bray(&self, global: &Array1) -> Array2; + fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; + fn partial_hellinger(&self, global: &Array1) -> Array2; + // provided finalisation methods with default impls + fn bray_dist_matrix(&self) -> Array2 { … } + fn relfreq_bray_dist_matrix(&self) -> Array2 { … } + // … } trait BitPartials: ColumnWeights { - fn partial_jaccard(&self) -> (Array2, Array2); - fn partial_hamming(&self) -> Array2; + fn partial_jaccard(&self) -> (Array2, Array2); + fn partial_hamming(&self) -> Array2; // provided fn jaccard_dist_matrix(&self) -> Array2 { … } fn hamming_dist_matrix(&self) -> Array2 { … } } ``` -**Leaf implementors** (in `obicompactvec`): +Leaf implementors: | Type | Traits | |---|---| -| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` | -| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` | - -`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators. +| `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` | +| `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` | --- -## `LayeredStore` — `obilayeredmap` - -A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types: +## LayeredStore\ — recursive aggregation wrapper ```rust pub struct LayeredStore(Vec); ``` -Three blanket impls propagate the traits up the hierarchy: +Three blanket impls propagate all traits up the hierarchy: ```rust -impl ColumnWeights for LayeredStore { … } // Σ across inner stores -impl CountPartials for LayeredStore { … } // same pattern -impl BitPartials for LayeredStore { … } // same pattern +impl ColumnWeights for LayeredStore { … } +impl CountPartials for LayeredStore { … } +impl BitPartials for LayeredStore { … } ``` -Because the blanket impl is recursive, **`LayeredStore>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed: +This makes `LayeredStore>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed: ``` -PersistentCompactIntMatrix implements CountPartials -LayeredStore via blanket impl (= one partition) -LayeredStore> via blanket impl (= partitioned index) +PersistentCompactIntMatrix leaf (one layer) +LayeredStore one partition (layers are disjoint) +LayeredStore> whole index (partitions are independent) ``` -### Normalised metrics — two-pass cascade - -The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic: +Normalised metrics require global column sums — computed in a two-pass cascade: ```rust -// called on LayeredStore> +// on LayeredStore> fn relfreq_bray_dist_matrix(&self) -> Array2 { - let global = self.col_weights(); // pass 1 — progressive sum at every level - let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade - p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately) + let global = self.col_weights(); // pass 1 — sums up hierarchy + let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only + p.mapv(|v| 1.0 - v) } ``` -`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy. +Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy. + +--- + +## Progressive aggregation principle + +No level reaches two levels down. Each level sums contributions from the level immediately below: + +``` +PersistentCompactIntMatrix::col_weights() — one (partition, layer) + ↓ Σ across layers +LayeredStore::col_weights() — one partition + ↓ Σ across partitions +LayeredStore>::col_weights() — global +``` + +The same cascade applies to every partial method. + +--- + +## Multi-genome column invariant + +After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. + +Maintained by three coordinated operations: + +**Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`. + +**New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers. + +**First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1. + +This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape. + +--- + +## Query model + +### Point query + +``` +minimiser(kmer) → partition p +for each layer l in p: + if let Some(slot) = MphfLayer_l.find(kmer): + return data_l.read(slot) +return None +``` + +O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`. + +### Aggregation + +``` +result = reduce( + for p in partitions: // parallel + for l in layers(p): // parallel + partial(data_p_l) +) +``` + +For normalised metrics, replace with the two-pass cascade. --- @@ -254,103 +289,25 @@ fn relfreq_bray_dist_matrix(&self) -> Array2 { | Level | Unit | Coordination | |---|---|---| -| Across partitions | `LayeredStore>` inner stores | none — fully independent | -| Across layers within a partition | `LayeredStore` inner stores | none — disjoint kmer sets | +| Across partitions | inner stores of `LayeredStore>` | none | +| Across layers within a partition | inner stores of `LayeredStore` | none — disjoint kmer sets | | Normalised pass 1 (`col_weights`) | per inner store | none — additive | | Normalised pass 2 (partial) | per inner store | `global` broadcast read-only | | Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` | -All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction. +--- + +## reindex — evidence conversion in place + +`KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`: + +- `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin` +- `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx` + +On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`. --- -## Query model +## estimate — parameter dry-run -### Point query — `kmer → Option` - -``` -minimiser(kmer) → partition p -for each layer l in p: - slot = MphfLayer_l.find(kmer) - if slot is Some: - return DataStore_l.get(slot) -return None -``` - -O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer). - -### Aggregation — `→ Result` - -``` -result = reduce( - for p in partitions: // parallel - for l in layers(p): // parallel - partial(DataStore_p_l) -) -``` - -For normalised metrics replace with the two-pass scheme above. - ---- - -## DataStore derivation - -Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF: - -``` -// count → presence/absence, parallel across (partition, layer) -for (p, l) in all_partition_layer_pairs().par_iter(): - count_store = open PersistentCompactIntMatrix at (p, l) - presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir) -``` - -Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair. - ---- - -## Relationship to current implementation - -### What is implemented - -- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`. -- **`obilayeredmap::LayeredStore`**: generic wrapper with blanket impls for all three traits. `LayeredStore>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data. - -### What is not yet implemented - -- `Layer` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported. -- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`. -- No `PartitionedIndex` type for point queries with parallel partition dispatch. - -### Planned refactoring - -1. Extract `MphfLayer` from `Layer` as an autonomous type. -2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system. -3. Rewire `LayeredMap` to hold `LayeredStore` (or bit variant) alongside the MPHF layers. -4. Implement `PartitionedIndex` using `LayeredStore>` for data and parallel dispatch for queries. - ---- - -## Multi-genome column invariant - -### The invariant - -After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode). - -### How it is maintained - -The invariant is established and preserved by three coordinated operations: - -**1. Existing layers — column append.** -When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns. - -**2. New layers — absent columns prepended.** -If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation. - -**3. First merge, Presence mode — `init_presence_matrix`.** -The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2. - -### Why the invariant is required - -The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics. - -The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape. +`estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing. diff --git a/docmd/architecture/index_architecture.refs.md b/docmd/architecture/index_architecture.refs.md new file mode 100644 index 0000000..ff9e3c0 --- /dev/null +++ b/docmd/architecture/index_architecture.refs.md @@ -0,0 +1,22 @@ + +# Coverage: architecture/index_architecture.md + +## Code couvert + +- `obilayeredmap/src/layer.rs` — Layer, trait LayerData, modes () / PersistentCompactIntMatrix / PersistentBitMatrix +- `obilayeredmap/src/mphf_layer.rs` — MphfLayer, EvidenceKind (Exact / Approx), LayerEvidence enum +- `obilayeredmap/src/map.rs` — LayeredMap +- `obilayeredmap/src/meta.rs` — LayerMeta, PartitionMeta +- `obikindex/src/meta.rs` — IndexConfig (kmer_size, n_bits, with_counts, evidence, block_bits), IndexMeta +- `obikindex/src/index.rs` — KmerIndex, build_layers +- `obicompactvec/src/` — PersistentCompactIntMatrix, PersistentBitMatrix (DataStore implementations) + +## Notes + +FORT RISQUE DE DÉRIVE. Nombreux changements récents : +- Ajout de `EvidenceKind` (Exact / Approx { b, z }) dans `IndexConfig` et `LayerMeta` +- Ajout de `block_bits` dans `IndexConfig` +- `LayerEvidence` enum dans `mphf_layer.rs` remplace l'ancienne approche monolithique +- Distinction `open()` vs `open_sequential()` dans `UnitigFileReader` +- Commandes `reindex` et `estimate` ajoutées +Vérifier que la hiérarchie à 3 niveaux décrite est toujours exacte et que les nouveaux paramètres sont documentés. diff --git a/docmd/architecture/query.refs.md b/docmd/architecture/query.refs.md new file mode 100644 index 0000000..2dcbc50 --- /dev/null +++ b/docmd/architecture/query.refs.md @@ -0,0 +1,16 @@ + +# Coverage: architecture/query.md + +## Code couvert + +- `obikmer/src/cmd/query.rs` — commande query, format de sortie +- `obikpartitionner/src/query_layer.rs` — routage de la requête à travers les partitions +- `obiread/src/lib.rs` — lecture des séquences d'entrée pour la requête + +## Notes + +RISQUE DE DÉRIVE. Vérifier : +- La commande `unitig` a été modifiée pour utiliser `open_sequential()` — vérifier si query est concerné +- `find_exact` / `find_approx` / `find` générique ont été ajoutés dans `MphfLayer` — le chemin de requête a changé +- Si l'index est approximatif (Approx), la requête peut produire des faux positifs : la doc le mentionne-t-elle ? +- Format de sortie CSV (`obikindex/src/csv.rs` ou équivalent) à vérifier diff --git a/docmd/architecture/sequences/invariant.refs.md b/docmd/architecture/sequences/invariant.refs.md new file mode 100644 index 0000000..52130f3 --- /dev/null +++ b/docmd/architecture/sequences/invariant.refs.md @@ -0,0 +1,12 @@ + +# Coverage: architecture/sequences/invariant.md + +## Code couvert + +- `obikseq/src/sequence.rs` — invariants de représentation des séquences (ACGT, longueur max) +- `obikseq/src/unitig.rs` — type Unitig, contrainte MAX_KMERS_PER_CHUNK (255 kmers par chunk) + +## Notes + +Document court et stable. Vérifier que la limite de 256 nucléotides (ou 255 kmers) par chunk +est toujours la même dans `obiskio::MAX_KMERS_PER_CHUNK`. diff --git a/docmd/implementation/chunkreader.refs.md b/docmd/implementation/chunkreader.refs.md new file mode 100644 index 0000000..99be9cf --- /dev/null +++ b/docmd/implementation/chunkreader.refs.md @@ -0,0 +1,12 @@ + +# Coverage: implementation/chunkreader.md + +## Code couvert + +- `obiread/src/chunk.rs` — SeqChunkIter, détection de frontières FASTA/FASTQ, state machines +- `obikrope/src/lib.rs` — type Rope (Vec), opérations zero-copy + +## Notes + +Document stable (la stratégie de chunking rope ne devrait pas avoir changé). +Vérifier que le split FASTA/FASTQ reste correct si de nouveaux formats ont été ajoutés. diff --git a/docmd/implementation/evidence_elimination.md b/docmd/implementation/evidence_elimination.md new file mode 100644 index 0000000..a7f3a18 --- /dev/null +++ b/docmd/implementation/evidence_elimination.md @@ -0,0 +1,178 @@ +# Approximate evidence: fingerprint-based index + +## Motivation + +`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it, +enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions, +k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint: + +| file | size/partition | fraction | +|---|---|---| +| evidence.bin | 132 MB | 66 % | +| unitigs.bin | 58 MB | 29 % | +| mphf.bin | 10 MB | 5 % | + +`evidence.bin` is a bijection from MPHF-space to unitig-position-space and +costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with +only ~22 % packing headroom. Compression is not a path to elimination. + +The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a +`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set +operations still require an exact index; the approximate index targets query +workloads that can tolerate a bounded false-positive rate. + +--- + +## The Findere model + +A B-bit fingerprint stored per MPHF slot provides the discrimination that +`evidence.bin` would otherwise provide through full k-mer reconstruction. + +For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint +stored at `s` belongs to the legitimate k-mer at that slot. The FP event is: + +``` +P(FP per k-mer) = 1 / 2^b +``` + +The Findere trick raises the effective window to z consecutive k-mers. A query +succeeds only when all z fingerprint checks pass, reducing the per-window FP rate: + +``` +P(FP per z-window) = 1 / 2^(b·z) +``` + +The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer +decomposes into z overlapping k-mers, all of which must match. + +Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`). + +--- + +## `FingerprintVec` on disk + +`fingerprint.bin` layout: + +``` +magic: b"FPVF" (4 bytes) +b: u8 (bits per slot, 1..=64) +padding: [0u8; 3] +n: u64 LE (number of slots) +data: packed bits, ceil(n·b/8) bytes, Lsb0 order +``` + +`FingerprintVec` is memory-mapped. The match check against a query k-mer: + +```rust +fn matches(&self, slot: usize, fingerprint: u64) -> bool { + self.get(slot) == (fingerprint & self.mask) +} +``` + +`build_approx_evidence` iterates `unitigs.bin` sequentially, writes +`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin` +and `layer_meta.json`. No `.idx` file is produced; random access into +`unitigs.bin` is not needed. + +At build time, `find_approx` in `MphfLayer`: + +```rust +let slot = self.mphf.index(&kmer.raw()); +if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None } +``` + +--- + +## `EvidenceKind` and metadata + +`layer_meta.json` records which evidence bundle is present: + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or +`find_approx` transparently. `find_exact` panics on an approximate layer; +`find_approx` panics on an exact layer — mode mixing is a programming error. + +--- + +## Parameter resolution (`resolve_approx_params`) + +The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third. +`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling +rounding: + +| given | derived | +|---|---| +| b, z | fp = 1/2^(b·z) | +| z, fp | b = ⌈−log₂(fp) / z⌉ | +| b, fp | z = ⌈−log₂(fp) / b⌉ | +| z only | b = 8 (default), fp derived | +| b only | z = 1 (default), fp derived | +| fp only | b = 8 (default), z derived | +| none | b = 8, z = 1, fp = 1/256 | + +When all three are given, b and z are authoritative and fp is recomputed. + +--- + +## CLI flags + +Both `index` and `reindex` accept the same flags: + +| flag | type | meaning | +|---|---|---| +| `--approx` | bool | enable fingerprint evidence | +| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot | +| `-z` | u8 | Findere z parameter | +| `--fp` | f64 | target FP rate per z-window | +| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode | + +`--approx` must be set explicitly; the other three flags are optional and +resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1. + +--- + +## `reindex` command + +`reindex` converts an existing index between exact and approximate evidence +in-place across all partitions and layers, running partitions in parallel via +Rayon. + +Conversion to approximate (`--approx`): + +- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`. +- Removes `evidence.bin` and `unitigs.bin.idx`. +- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`. + +Conversion to exact (default, no `--approx`): + +- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`. +- Removes `fingerprint.bin`. +- Updates `layer_meta.json` with `EvidenceKind::Exact`. + +The root `index.meta` is updated with the new evidence kind on success. +`mphf.bin` and `unitigs.bin` are never modified. + +--- + +## `estimate` command + +`estimate` is a dry-run that resolves and prints (b, z, fp) without touching +any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and +additionally accepts `-k` to display the effective indexed k-mer length: + +``` +k (query): 31 +k (indexed): 31 +z: 1 +evidence bits (b): 8 +FP per k-mer: 3.906e-3 (1/2^8) +FP per z-window: 3.906e-3 (1/2^8) +``` + +Useful for choosing parameters before committing to an index build. diff --git a/docmd/implementation/evidence_elimination.refs.md b/docmd/implementation/evidence_elimination.refs.md new file mode 100644 index 0000000..0fedcb3 --- /dev/null +++ b/docmd/implementation/evidence_elimination.refs.md @@ -0,0 +1,22 @@ + +# Coverage: implementation/evidence_elimination.md + +## Code couvert + +- `obilayeredmap/src/fingerprint.rs` — FingerprintVec, FingerprintVecWriter, stockage b bits/slot, matches() +- `obilayeredmap/src/mphf_layer.rs` — build_approx_evidence(dir, b, z), find_approx() +- `obilayeredmap/src/meta.rs` — EvidenceKind::Approx { b, z }, LayerMeta +- `obikindex/src/reindex.rs` — KmerIndex::reindex(), conversion exact↔approx en place +- `obikmer/src/cmd/reindex.rs` — CLI reindex, options --approx, -z, --evidence-bits, --fp, --block-size +- `obikmer/src/cmd/index.rs` — resolve_approx_params(), options --approx, -z, --evidence-bits, --fp +- `obikmer/src/cmd/estimate.rs` — commande estimate (dry-run des paramètres) + +## Notes + +Ce document était à l'origine une discussion de design (4 approches). L'implémentation +a maintenant convergé vers l'approche fingerprint (Findere-style). +FORT RISQUE DE DÉRIVE — le contenu est probablement un mélange de design et d'implémentation : +- Le modèle FP = 1/2^(b·z) et les règles de résolution (2-of-3 parmi b, z, fp) sont implémentés +- La commande `reindex` permet la conversion a posteriori exact↔approx +- La commande `estimate` fait le dry-run des paramètres +Cette page doit être réécrite pour documenter l'implémentation Findere réelle plutôt que les alternatives abandonnées. diff --git a/docmd/implementation/kmer.refs.md b/docmd/implementation/kmer.refs.md new file mode 100644 index 0000000..747d5e3 --- /dev/null +++ b/docmd/implementation/kmer.refs.md @@ -0,0 +1,13 @@ + +# Coverage: implementation/kmer.md + +## Code couvert + +- `obikseq/src/kmer.rs` — layout mémoire (repr(transparent) u64), encodage/décodage, revcomp, forme canonique +- `obikseq/src/params.rs` — k global (set_k / k()) + +## Notes + +Document d'implémentation stable. L'algorithme de revcomp bit-à-bit est décrit — +vérifier qu'il correspond à `revcomp_raw` dans `obiskio/src/unitig_index.rs` (copie locale) +et à l'implémentation dans `obikseq/src/kmer.rs`. diff --git a/docmd/implementation/merge.refs.md b/docmd/implementation/merge.refs.md new file mode 100644 index 0000000..a9ffa07 --- /dev/null +++ b/docmd/implementation/merge.refs.md @@ -0,0 +1,19 @@ + +# Coverage: implementation/merge.md + +## Code couvert + +- `obikindex/src/merge.rs` — `KmerIndex::merge()`, validation de compatibilité d'évidence, `validate_evidence_compat()` +- `obikpartitionner/src/merge_layer.rs` — `merge_partition()`, construction de la nouvelle layer, paramètre `block_bits` +- `obikpartitionner/src/rebuild_layer.rs` — `rebuild_partition()`, paramètre `block_bits` +- `obilayeredmap/src/layer.rs` — `Layer::append_genome_column()` (PersistentCompactIntMatrix et PersistentBitMatrix) +- `obicompactvec/src/intmatrix.rs` — `append_column` pour PersistentCompactIntMatrix +- `obicompactvec/src/bitmatrix.rs` — `append_column` pour PersistentBitMatrix + +## Notes + +FORT RISQUE DE DÉRIVE. Changements récents : +- Ajout de la validation de compatibilité d'évidence : merge exact+approx → erreur (OKIError::IncompatibleEvidence) +- `merge_partition` reçoit maintenant `block_bits: u8` +- La commande `reindex` a été ajoutée comme outil de conversion exact↔approx avant merge +Vérifier que la doc décrit la politique de merge mixed-evidence et le recours à `reindex`. diff --git a/docmd/implementation/mphf.refs.md b/docmd/implementation/mphf.refs.md new file mode 100644 index 0000000..fd56c80 --- /dev/null +++ b/docmd/implementation/mphf.refs.md @@ -0,0 +1,16 @@ + +# Coverage: implementation/mphf.md + +## Code couvert + +- `obilayeredmap/src/mphf_layer.rs` — type Mphf (PtrHash + CubicEps + CachelineEfVec + Xx64), construction en 2 passes, `build()`, `build_exact_evidence()`, `build_approx_evidence()`, `build_evidence()` +- `obikpartitionner/src/index_layer.rs` — `build_index_layer()` avec passage de `block_bits` + +## Notes + +FORT RISQUE DE DÉRIVE. Changements récents : +- `build_exact_evidence(dir, block_bits)` — `block_bits` maintenant paramétrisé (défaut 0) +- `build_approx_evidence(dir, b, z)` — nouvelle fonction pour l'évidence fingerprint +- `build_evidence(dir, kind, block_bits)` — dispatch selon EvidenceKind +- Construction en 2 phases : pass 1 (Rayon parallèle) + pass 2 (callback `fill_slot`) +Vérifier que la doc décrit correctement les deux nouvelles routes d'évidence et le paramètre `block_bits`. diff --git a/docmd/implementation/obilayeredmap.md b/docmd/implementation/obilayeredmap.md index d5ad7d2..da2f42e 100644 --- a/docmd/implementation/obilayeredmap.md +++ b/docmd/implementation/obilayeredmap.md @@ -2,7 +2,7 @@ ## Purpose -`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. +`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. --- @@ -20,42 +20,81 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico --- +## Evidence kinds + +Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time: + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load. + +- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time. +- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required. + +--- + ## MphfLayer — autonomous kmer → slot mapping -`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data. +`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload. ```rust pub struct MphfLayer { - mphf: Mphf, - evidence: Evidence, - unitigs: UnitigFileReader, - n: usize, // number of indexed kmers = number of MPHF slots + mphf: Mphf, + ev: LayerEvidence, // loaded at open() time + n: usize, } ``` -Public API: +`LayerEvidence` is an internal enum, not public: ```rust -impl MphfLayer { - pub fn open(dir: &Path) -> OLMResult - pub fn find(&self, kmer: CanonicalKmer) -> Option // Some(slot) or None - pub fn n(&self) -> usize - pub fn unitig_writer(dir: &Path) -> OLMResult - pub(crate) fn build( - dir: &Path, - fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>, - ) -> OLMResult +enum LayerEvidence { + Exact { evidence: Evidence, unitigs: UnitigFileReader }, + Approx { fingerprint: FingerprintVec }, } ``` -`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test). +### Query API -`build` runs two sequential passes over `unitigs.bin`: +Three public query methods, all returning `Option` (slot index): -1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. -2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. +```rust +pub fn find(&self, kmer: CanonicalKmer) -> Option +pub fn find_exact(&self, kmer: CanonicalKmer) -> Option +pub fn find_approx(&self, kmer: CanonicalKmer) -> Option +``` -For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`. +- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`. +- `find_exact` panics if the layer holds approximate evidence; zero false positives. +- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer. + +`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes. + +### Build surface + +```rust +// Full MPHF + exact evidence build (two-pass, parallel) +pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult + +// Evidence-only builds (MPHF already present in dir) +pub fn build_exact_evidence(dir, block_bits) -> OLMResult +pub fn build_approx_evidence(dir, b, z) -> OLMResult +pub fn build_evidence(dir, kind, block_bits) -> OLMResult // dispatch +``` + +`MphfLayer::build` runs two sequential passes over `unitigs.bin`: + +1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. +2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. + +`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction. + +For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files. --- @@ -81,7 +120,7 @@ pub struct Hit { } ``` -`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait. +`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait. | Type | `Item` | Description | |---|---|---| @@ -89,31 +128,118 @@ pub struct Hit { | `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) | | `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) | -**Build signatures:** +### Build signatures ```rust // mode 1 impl Layer<()> { - pub fn build(out_dir: &Path) -> OLMResult + pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult } // mode 2 impl Layer { - pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult - pub fn build_from_map(out_dir: &Path, counts: &HashMap) -> OLMResult + pub fn build(out_dir: &Path, block_bits: u8, + count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult + pub fn build_from_map(out_dir: &Path, block_bits: u8, + counts: &HashMap) -> OLMResult } // mode 3 impl Layer { - pub fn build_presence( - out_dir: &Path, - n_genomes: usize, - present_in: impl Fn(CanonicalKmer, usize) -> bool, - ) -> OLMResult + pub fn build_presence(out_dir: &Path, block_bits: u8, + n_genomes: usize, + present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult } ``` -All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`. +All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`. + +### Evidence build helpers on Layer + +```rust +impl Layer { + pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult + pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult + pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult +} +``` + +These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer` level. + +--- + +## FingerprintVec and FingerprintVecWriter + +Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot. + +``` +fingerprint.bin format: + magic: b"FPVF" (4 bytes) + b: u8 (bits per fingerprint, 1..=64) + padding: [0u8; 3] + n: u64 LE (number of slots) + data: packed bits, ceil(n*b/8) bytes, Lsb0 order +``` + +```rust +impl FingerprintVec { + pub fn open(path: &Path) -> OLMResult + pub fn get(&self, slot: usize) -> u64 + pub fn matches(&self, slot: usize, fingerprint: u64) -> bool + pub fn n(&self) -> usize + pub fn b(&self) -> u8 +} +``` + +`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`. + +--- + +## LayeredMap\ — collection of layers + +`LayeredMap` wraps `Vec>` for a single partition directory. + +```rust +pub struct LayeredMap { + root: PathBuf, + meta: PartitionMeta, + layers: Vec>, +} +``` + +`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`. + +### Common methods + +```rust +pub fn open(root: &Path) -> OLMResult +pub fn create(root: &Path) -> OLMResult +pub fn n_layers(&self) -> usize +pub fn layer(&self, i: usize) -> &Layer +pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit)> +pub fn next_layer_writer(&self) -> OLMResult +``` + +`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0. + +### push_layer + +`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`: + +```rust +// mode 1 +impl LayeredMap<()> { + pub fn push_layer(&mut self) -> OLMResult +} + +// mode 2 +impl LayeredMap { + pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult + pub fn push_layer_from_map(&mut self, counts: &HashMap) -> OLMResult +} +``` + +Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer::build_presence`. --- @@ -131,14 +257,6 @@ impl BitPartials for LayeredStore { … } // element-wi Because blanket impls compose, `LayeredStore>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type. -**Aggregation hierarchy:** - -``` -PersistentCompactIntMatrix implements CountPartials -LayeredStore via blanket impl (one partition) -LayeredStore> via blanket impl (partitioned index) -``` - **Leaf implementors** (in `obicompactvec`): | Type | Traits | @@ -146,8 +264,6 @@ LayeredStore> via blanket impl (partitione | `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` | | `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` | -`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators. - See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern. --- @@ -155,34 +271,30 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful ## On-disk structure ``` -index_root/ ← LayeredMap (collection) - meta.json - part_00000/ ← Partition - layer_0/ ← Layer - mphf.bin — ptr_hash MPHF (epserde format) - unitigs.bin — packed 2-bit nucleotide sequences - unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[] - evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE - counts/ [mode 2] PersistentCompactIntMatrix - meta.json {"n": N, "n_cols": 1} - col_000000.pciv - presence/ [mode 3] PersistentBitMatrix - meta.json {"n": N, "n_cols": G} - col_000000.pbiv - … - layer_1/ - … - part_00001/ +partition_root/ ← LayeredMap (one partition) + meta.json — {"n_layers": N} + layer_0/ ← Layer + layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z} + mphf.bin — ptr_hash MPHF (epserde format) + unitigs.bin — packed 2-bit nucleotide sequences + unitigs.bin.idx — UIDX index (exact evidence only) + evidence.bin — [u32; n], LE (exact evidence only) + fingerprint.bin — packed b-bit array (approx evidence only) + counts/ [mode 2] PersistentCompactIntMatrix + meta.json + col_000000.pciv + presence/ [mode 3] PersistentBitMatrix + meta.json + col_000000.pbiv … + layer_1/ … ``` -**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel. - -**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint. +`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction. --- -## Evidence encoding +## Evidence encoding (exact) `evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot: @@ -191,9 +303,9 @@ bits [31:7] = chunk_id (25 bits) — index of the unitig chunk bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based) ``` -Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`. +`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access). -For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer. +For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. --- @@ -203,7 +315,7 @@ For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the type Mphf = PtrHash< u64, // key type: canonical kmer raw encoding CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99 - CachelineEfVec>, // remap: 11.6 bits/entry (Elias-Fano) + CachelineEfVec>, // remap: Elias-Fano Xx64, // hasher: XXH3-64 with seed Vec, // pilots >; @@ -211,21 +323,41 @@ type Mphf = PtrHash< `Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly. -`CubicEps` with `PtrHashParams::::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. +`CubicEps` with `PtrHashParams::::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space. --- -## Query path +## Column append and merge support + +These methods extend existing layers with new genome columns without touching the MPHF. + +### Layer-level genome column append ```rust -pub fn query(&self, kmer: CanonicalKmer) -> Option> { - self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) }) +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> +} + +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> } ``` -`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit. +Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n). -In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0. +### Presence matrix initialisation + +```rust +impl Layer<()> { + pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> +} +``` + +Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended. + +### Why the MPHF is never rebuilt + +The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. --- @@ -235,9 +367,9 @@ When adding dataset B to an existing index: 1. For each partition, probe existing layers for kmers of B routed to that partition. 2. Collect kmers absent from all layers → `B \ index`. -3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`. -4. Call `Layer::build` on the new directory. -5. Update `meta.json`. +3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`. +4. Call `Layer::build` (or `build_presence`) on the new layer directory. +5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`. Each partition's new layer is built independently; the operation is fully parallel across partitions. @@ -250,62 +382,9 @@ Each partition's new layer is built independently; the operation is fully parall | `ptr_hash 1.1` | MPHF per layer | | `cacheline-ef 1.1` | compact remap inside ptr_hash | | `epserde 0.8` | zero-copy MPHF serialisation | -| `memmap2 0.9` | mmap of evidence and payload files | -| `obiskio` | unitig file writer/reader | +| `memmap2 0.9` | mmap of evidence and fingerprint files | +| `bitvec` | packed b-bit fingerprint storage | +| `obiskio` | unitig file writer/reader + `.idx` build | | `obicompactvec` | payload types + aggregation traits | | `rayon 1` | parallel MPHF construction pass | -| `ndarray 0.16` | aggregation output arrays | - ---- - -## Column append and merge support - -These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command. - -### Matrix column append - -```rust -impl PersistentCompactIntMatrix { - pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> -} - -impl PersistentBitMatrix { - pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> -} -``` - -Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change. - -### Presence matrix initialisation - -```rust -impl Layer<()> { - pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> -} -``` - -Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended. - -### Layer-level genome column append - -```rust -impl Layer { - pub fn append_genome_column( - layer_dir: &Path, - value_of: impl Fn(usize) -> bool, - ) -> OLMResult<()> -} - -impl Layer { - pub fn append_genome_column( - layer_dir: &Path, - value_of: impl Fn(usize) -> u32, - ) -> OLMResult<()> -} -``` - -These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction. - -### Why the MPHF is never rebuilt - -The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required. +| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation | diff --git a/docmd/implementation/obilayeredmap.refs.md b/docmd/implementation/obilayeredmap.refs.md new file mode 100644 index 0000000..5340ab3 --- /dev/null +++ b/docmd/implementation/obilayeredmap.refs.md @@ -0,0 +1,22 @@ + +# Coverage: implementation/obilayeredmap.md + +## Code couvert + +- `obilayeredmap/src/mphf_layer.rs` — MphfLayer, LayerEvidence enum (Exact/Approx), find(), find_exact(), find_approx() +- `obilayeredmap/src/layer.rs` — Layer, trait LayerData, modes () / PersistentCompactIntMatrix / PersistentBitMatrix, build(), build_evidence(), append_genome_column() +- `obilayeredmap/src/map.rs` — LayeredMap, push_layer(), query() +- `obilayeredmap/src/evidence.rs` — Evidence, EvidenceWriter, encodage chunk_id:rank +- `obilayeredmap/src/fingerprint.rs` — FingerprintVec, FingerprintVecWriter, matches() +- `obilayeredmap/src/meta.rs` — LayerMeta, EvidenceKind (Exact / Approx { b, z }) + +## Notes + +FORT RISQUE DE DÉRIVE. C'est le fichier le plus affecté par les changements récents : +- EvidenceKind (Exact / Approx) est désormais un concept de premier plan — toute la sémantique de query en dépend +- `LayerEvidence` enum interne à `MphfLayer` : dispatch transparent find() → find_exact() ou find_approx() +- `fingerprint.rs` : module entièrement nouveau (FingerprintVec + FingerprintVecWriter) +- `build_evidence()` / `build_exact_evidence()` / `build_approx_evidence()` sont nouveaux +- `block_bits` dans les fonctions build : O(1) garanti avec le chemin chaud explicit pour block_bits=0 +- Séparation open() (accès aléatoire, requiert .idx) vs open_sequential() (itération seule) +Pratiquement toute cette page est à réécrire. diff --git a/docmd/implementation/obipipeline.refs.md b/docmd/implementation/obipipeline.refs.md new file mode 100644 index 0000000..82fb2e8 --- /dev/null +++ b/docmd/implementation/obipipeline.refs.md @@ -0,0 +1,13 @@ + +# Coverage: implementation/obipipeline.md + +## Code couvert + +- `obipipeline/src/lib.rs` — WorkerPool, Pipeline, macro make_pipeline! +- `obipipeline/src/scheduler.rs` — Scheduler avec Select biaisé sur les entrées de canaux + +## Notes + +Document stable (librairie générique, peu de risque de dérive). +Vérifier si `obipipeline` est toujours utilisé dans la phase scatter de `obikpartitionner` +ou s'il a été remplacé par Rayon dans certains chemins. diff --git a/docmd/implementation/persistent_bit_vec.refs.md b/docmd/implementation/persistent_bit_vec.refs.md new file mode 100644 index 0000000..67fed55 --- /dev/null +++ b/docmd/implementation/persistent_bit_vec.refs.md @@ -0,0 +1,13 @@ + +# Coverage: implementation/persistent_bit_vec.md + +## Code couvert + +- `obicompactvec/src/bitvec.rs` — PersistentBitVec, opérations mot u64, invariant de padding +- `obicompactvec/src/bitmatrix.rs` — PersistentBitMatrix, wrapper colonne-major, append_column +- `obicompactvec/src/bitmatrix.rs` — PersistentBitMatrixBuilder + +## Notes + +Document d'implémentation stable. Vérifier que `PersistentBitMatrixBuilder` et `append_column` +sont couverts (utilisés dans `Layer::::build_presence` et `append_genome_column`). diff --git a/docmd/implementation/persistent_compact_int_vec.refs.md b/docmd/implementation/persistent_compact_int_vec.refs.md new file mode 100644 index 0000000..c5da1b0 --- /dev/null +++ b/docmd/implementation/persistent_compact_int_vec.refs.md @@ -0,0 +1,14 @@ + +# Coverage: implementation/persistent_compact_int_vec.md + +## Code couvert + +- `obicompactvec/src/builder.rs` — PersistentCompactIntVecBuilder, cycle de vie +- `obicompactvec/src/reader.rs` — PersistentCompactIntVec, accès aléatoire et séquentiel +- `obicompactvec/src/intmatrix.rs` — PersistentCompactIntMatrix, wrapper colonne-major, append_column +- `obicompactvec/src/format.rs` — format de fichier (magic PCIV, header, primary u8, overflow, index) + +## Notes + +Document d'implémentation stable. Vérifier que `append_column` (utilisé dans merge et reindex) +est décrit. Vérifier que `PersistentCompactIntMatrixBuilder` est couvert (utilisé dans `layer.rs`). diff --git a/docmd/implementation/pipeline.refs.md b/docmd/implementation/pipeline.refs.md new file mode 100644 index 0000000..a5656a0 --- /dev/null +++ b/docmd/implementation/pipeline.refs.md @@ -0,0 +1,19 @@ + +# Coverage: implementation/pipeline.md + +## Code couvert + +- `obikpartitionner/src/partition.rs` — estimation des paramètres (phase 0) +- `obiskbuilder/src/iter.rs` — scatter : filtre entropie, extraction superkmers, routage partition (phase 1) +- `obikpartitionner/src/filter.rs` — déduplication bucket-sort (phase 2) +- `obikpartitionner/src/kmer_sort.rs` — tri externe + agrégation de comptages (phase 3) +- `obidebruinj/src/debruijn.rs` — graphe De Bruijn, extraction des unitigs (phase 5) +- `obikpartitionner/src/index_layer.rs` — construction MPHF + évidence (phase 6), paramètre `block_bits` +- `obikindex/src/index.rs` — `build_layers()`, `dereplicate_and_count()` + +## Notes + +RISQUE DE DÉRIVE modéré. Vérifier : +- Phase 6 : la doc mentionne-t-elle le filtre d'abondance (`min_ab`, `max_ab`) ? +- Phase 6 : `block_bits` passé à `build_index_layer` depuis `IndexConfig` +- Phase 6 : dispatch exact/approx selon `EvidenceKind` dans `build_index_layer` diff --git a/docmd/implementation/storage.md b/docmd/implementation/storage.md index defabf4..baf3e9e 100644 --- a/docmd/implementation/storage.md +++ b/docmd/implementation/storage.md @@ -1,5 +1,136 @@ -# On-disk collection structure +# On-disk index layout -See [obilayeredmap crate](obilayeredmap.md) for the current on-disk layout. +## Directory tree -The index root contains one `part_XXXXX/` directory per partition, each holding one or more `layer_N/` directories. Each layer directory contains `mphf.bin`, `unitigs.bin`, `unitigs.bin.idx`, `evidence.bin`, and optionally a `counts/` or `presence/` payload directory. +``` +/ + index.meta ← JSON: IndexMeta + scatter.done ← sentinel: scatter phase complete + count.done ← sentinel: dereplicate + count complete + index.done ← sentinel: MPHF index fully built + spectrums/ +