docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
This commit is contained in:
@@ -2,169 +2,155 @@
|
||||
|
||||
## Fundamental invariant
|
||||
|
||||
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
|
||||
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination.
|
||||
|
||||
---
|
||||
|
||||
## Three-level hierarchy
|
||||
|
||||
```
|
||||
PartitionedIndex
|
||||
├── LayeredPartition (one per minimiser bucket)
|
||||
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
|
||||
│ │ ├── DataStore A slot → T (e.g. counts)
|
||||
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
|
||||
│ ├── MphfLayer 1
|
||||
│ │ └── DataStore A
|
||||
│ └── ...
|
||||
├── LayeredPartition
|
||||
│ └── ...
|
||||
KmerIndex (index.meta + KmerPartition)
|
||||
├── partition_0/index/ one directory per minimiser bucket
|
||||
│ ├── meta.json PartitionMeta { n_layers }
|
||||
│ ├── layer_0/
|
||||
│ │ ├── layer_meta.json LayerMeta { evidence: EvidenceKind }
|
||||
│ │ ├── mphf.bin PtrHash MPHF
|
||||
│ │ ├── unitigs.bin unitig spine (never overwritten)
|
||||
│ │ ├── evidence.bin exact evidence (Exact only)
|
||||
│ │ ├── unitigs.bin.idx block index (Exact only)
|
||||
│ │ ├── fingerprint.bin fingerprints (Approx only)
|
||||
│ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true)
|
||||
│ │ └── presence/ PersistentBitMatrix
|
||||
│ └── layer_1/
|
||||
│ └── ...
|
||||
└── partition_1/index/
|
||||
└── ...
|
||||
```
|
||||
|
||||
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
|
||||
**KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon.
|
||||
|
||||
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
|
||||
**Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets.
|
||||
|
||||
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
|
||||
|
||||
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
|
||||
**Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`.
|
||||
|
||||
---
|
||||
|
||||
## MphfLayer — autonomous mapping layer
|
||||
## IndexConfig and IndexMeta
|
||||
|
||||
```rust
|
||||
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
|
||||
MphfLayer::n() -> usize // number of slots
|
||||
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
|
||||
MphfLayer::open(dir: &Path) -> OLMResult<Self>
|
||||
pub struct IndexConfig {
|
||||
pub kmer_size: usize,
|
||||
pub minimizer_size: usize,
|
||||
pub n_bits: usize, // log2(n_partitions)
|
||||
pub with_counts: bool,
|
||||
pub evidence: EvidenceKind,
|
||||
pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig
|
||||
}
|
||||
|
||||
pub struct IndexMeta {
|
||||
pub version: u32,
|
||||
pub config: IndexConfig,
|
||||
pub genomes: Vec<GenomeInfo>, // ordered; index = genome column number
|
||||
}
|
||||
|
||||
pub struct GenomeInfo {
|
||||
pub label: String,
|
||||
pub meta: HashMap<String, String>, // arbitrary categorical metadata
|
||||
}
|
||||
```
|
||||
|
||||
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
|
||||
`IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index.
|
||||
|
||||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
|
||||
---
|
||||
|
||||
## EvidenceKind
|
||||
|
||||
```rust
|
||||
pub enum EvidenceKind {
|
||||
Exact,
|
||||
Approx { b: u8, z: u8 },
|
||||
}
|
||||
```
|
||||
|
||||
Controls which files are written per layer and which query path is taken:
|
||||
|
||||
| Variant | Files written | False-positive rate |
|
||||
|---|---|---|
|
||||
| `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 |
|
||||
| `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) |
|
||||
|
||||
`EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built).
|
||||
|
||||
---
|
||||
|
||||
## MphfLayer — autonomous kmer → slot mapping
|
||||
|
||||
```rust
|
||||
pub struct MphfLayer {
|
||||
mphf: PtrHash<…>,
|
||||
ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint }
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
`MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range.
|
||||
|
||||
```
|
||||
find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs
|
||||
find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer)
|
||||
```
|
||||
|
||||
`block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`.
|
||||
|
||||
The MPHF and `unitigs.bin` are never rebuilt by any post-build operation.
|
||||
|
||||
---
|
||||
|
||||
## Layer\<D\> — MPHF + data payload
|
||||
|
||||
```rust
|
||||
pub struct Layer<D: LayerData = ()> {
|
||||
mphf: MphfLayer,
|
||||
data: D,
|
||||
}
|
||||
```
|
||||
|
||||
`D` selects the attached data payload:
|
||||
|
||||
| `D` | Data directory | `Item` returned by `query` |
|
||||
|---|---|---|
|
||||
| `()` | — | `()` (set membership only) |
|
||||
| `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) |
|
||||
| `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) |
|
||||
|
||||
`Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option<Hit<D::Item>>`.
|
||||
|
||||
Build-time entry points:
|
||||
|
||||
```rust
|
||||
Layer<()>::build(out_dir, block_bits) // set membership
|
||||
Layer<PersistentCompactIntMatrix>::build(out_dir, block_bits, count_of)
|
||||
Layer<PersistentBitMatrix>::build_presence(out_dir, block_bits, n_genomes, present_in)
|
||||
|
||||
Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## DataStore — slot-indexed data
|
||||
|
||||
```rust
|
||||
trait DataStore {
|
||||
type Item;
|
||||
fn get(&self, slot: usize) -> Self::Item;
|
||||
fn n(&self) -> usize;
|
||||
}
|
||||
```
|
||||
`PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs.
|
||||
|
||||
Concrete types from `obicompactvec`:
|
||||
|
||||
| Type | `Item` | Column stats | Use |
|
||||
| Type | `Item` | Aggregation method | Use |
|
||||
|---|---|---|---|
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
|
||||
|
||||
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
|
||||
|
||||
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() → Array1<u64>` | counts per genome per slot |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() → Array1<u64>` | presence per genome per slot |
|
||||
|
||||
---
|
||||
|
||||
## Distance matrix API on DataStore types
|
||||
## Aggregation traits — `obicompactvec::traits`
|
||||
|
||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
|
||||
|
||||
### Full distance matrices
|
||||
|
||||
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
|
||||
|
||||
```rust
|
||||
// PersistentCompactIntMatrix
|
||||
fn bray_dist_matrix(&self) -> Array2<f64>
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
|
||||
fn euclidean_dist_matrix(&self) -> Array2<f64>
|
||||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
|
||||
fn hellinger_dist_matrix(&self) -> Array2<f64>
|
||||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
|
||||
|
||||
// PersistentBitMatrix
|
||||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||||
fn hamming_dist_matrix(&self) -> Array2<u64>
|
||||
```
|
||||
|
||||
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
|
||||
|
||||
### Partial distance matrices
|
||||
|
||||
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
|
||||
|
||||
**Category 1 — self-contained partials**: additive without any external parameter.
|
||||
|
||||
```rust
|
||||
// PersistentCompactIntMatrix
|
||||
fn partial_bray_dist_matrix(&self)
|
||||
-> (Array2<u64>, // sum_min[i,j]
|
||||
Array1<u64>) // col_sums[k]
|
||||
|
||||
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
|
||||
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
|
||||
-> (Array2<u64>, // inter[i,j]
|
||||
Array2<u64>) // union[i,j]
|
||||
|
||||
// PersistentBitMatrix
|
||||
fn partial_jaccard_dist_matrix(&self)
|
||||
-> (Array2<u64>, // inter[i,j]
|
||||
Array2<u64>) // union[i,j]
|
||||
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
|
||||
```
|
||||
|
||||
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
|
||||
|
||||
```rust
|
||||
// PersistentCompactIntMatrix only
|
||||
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
|
||||
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
|
||||
|
||||
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||||
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
|
||||
|
||||
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||||
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
|
||||
```
|
||||
|
||||
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
|
||||
|
||||
---
|
||||
|
||||
## Progressive aggregation principle
|
||||
|
||||
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
|
||||
↓ Σ across layers
|
||||
LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
|
||||
↓ Σ across partitions
|
||||
LayeredStore<LayeredStore<…>>::col_weights() — global column sums
|
||||
```
|
||||
|
||||
The same cascade applies to every partial:
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
|
||||
↓ element-wise Σ across layers
|
||||
LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
|
||||
↓ element-wise Σ across partitions
|
||||
LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
|
||||
```
|
||||
|
||||
Each level presents a stable trait surface to the level above; no level reaches two levels down.
|
||||
|
||||
---
|
||||
|
||||
## Traits — `obicompactvec::traits`
|
||||
|
||||
Three traits unify the aggregation API across all levels of the hierarchy.
|
||||
Three traits unify the aggregation API across all hierarchy levels.
|
||||
|
||||
```rust
|
||||
trait ColumnWeights: Send + Sync {
|
||||
@@ -172,81 +158,130 @@ trait ColumnWeights: Send + Sync {
|
||||
}
|
||||
|
||||
trait CountPartials: ColumnWeights {
|
||||
// self-contained partials (additive, no parameter)
|
||||
fn partial_bray(&self) -> Array2<u64>;
|
||||
fn partial_euclidean(&self) -> Array2<f64>;
|
||||
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
|
||||
// normalised partials (global col_weights passed in cascade)
|
||||
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
// provided finalisation methods (default implementations)
|
||||
fn bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64> { … }
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn hellinger_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn partial_bray(&self) -> Array2<u64>;
|
||||
fn partial_euclidean(&self) -> Array2<f64>;
|
||||
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
|
||||
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
// provided finalisation methods with default impls
|
||||
fn bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||
// …
|
||||
}
|
||||
|
||||
trait BitPartials: ColumnWeights {
|
||||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
|
||||
fn partial_hamming(&self) -> Array2<u64>;
|
||||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
|
||||
fn partial_hamming(&self) -> Array2<u64>;
|
||||
// provided
|
||||
fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn hamming_dist_matrix(&self) -> Array2<u64> { … }
|
||||
}
|
||||
```
|
||||
|
||||
**Leaf implementors** (in `obicompactvec`):
|
||||
Leaf implementors:
|
||||
|
||||
| Type | Traits |
|
||||
|---|---|
|
||||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` |
|
||||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` |
|
||||
|
||||
`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators.
|
||||
| `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` |
|
||||
| `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` |
|
||||
|
||||
---
|
||||
|
||||
## `LayeredStore<S>` — `obilayeredmap`
|
||||
|
||||
A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types:
|
||||
## LayeredStore\<S\> — recursive aggregation wrapper
|
||||
|
||||
```rust
|
||||
pub struct LayeredStore<S>(Vec<S>);
|
||||
```
|
||||
|
||||
Three blanket impls propagate the traits up the hierarchy:
|
||||
Three blanket impls propagate all traits up the hierarchy:
|
||||
|
||||
```rust
|
||||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ across inner stores
|
||||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // same pattern
|
||||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // same pattern
|
||||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }
|
||||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … }
|
||||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … }
|
||||
```
|
||||
|
||||
Because the blanket impl is recursive, **`LayeredStore<LayeredStore<S>>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed:
|
||||
This makes `LayeredStore<LayeredStore<PersistentCompactIntMatrix>>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed:
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix implements CountPartials
|
||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition)
|
||||
LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index)
|
||||
PersistentCompactIntMatrix leaf (one layer)
|
||||
LayeredStore<PersistentCompactIntMatrix> one partition (layers are disjoint)
|
||||
LayeredStore<LayeredStore<…>> whole index (partitions are independent)
|
||||
```
|
||||
|
||||
### Normalised metrics — two-pass cascade
|
||||
|
||||
The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
|
||||
Normalised metrics require global column sums — computed in a two-pass cascade:
|
||||
|
||||
```rust
|
||||
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
|
||||
// on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||||
let global = self.col_weights(); // pass 1 — progressive sum at every level
|
||||
let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
|
||||
p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately)
|
||||
let global = self.col_weights(); // pass 1 — sums up hierarchy
|
||||
let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only
|
||||
p.mapv(|v| 1.0 - v)
|
||||
}
|
||||
```
|
||||
|
||||
`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy.
|
||||
Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy.
|
||||
|
||||
---
|
||||
|
||||
## Progressive aggregation principle
|
||||
|
||||
No level reaches two levels down. Each level sums contributions from the level immediately below:
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix::col_weights() — one (partition, layer)
|
||||
↓ Σ across layers
|
||||
LayeredStore<PersistentCompactIntMatrix>::col_weights() — one partition
|
||||
↓ Σ across partitions
|
||||
LayeredStore<LayeredStore<…>>::col_weights() — global
|
||||
```
|
||||
|
||||
The same cascade applies to every partial method.
|
||||
|
||||
---
|
||||
|
||||
## Multi-genome column invariant
|
||||
|
||||
After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
|
||||
|
||||
Maintained by three coordinated operations:
|
||||
|
||||
**Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`.
|
||||
|
||||
**New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers.
|
||||
|
||||
**First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1.
|
||||
|
||||
This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape.
|
||||
|
||||
---
|
||||
|
||||
## Query model
|
||||
|
||||
### Point query
|
||||
|
||||
```
|
||||
minimiser(kmer) → partition p
|
||||
for each layer l in p:
|
||||
if let Some(slot) = MphfLayer_l.find(kmer):
|
||||
return data_l.read(slot)
|
||||
return None
|
||||
```
|
||||
|
||||
O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`.
|
||||
|
||||
### Aggregation
|
||||
|
||||
```
|
||||
result = reduce(
|
||||
for p in partitions: // parallel
|
||||
for l in layers(p): // parallel
|
||||
partial(data_p_l)
|
||||
)
|
||||
```
|
||||
|
||||
For normalised metrics, replace with the two-pass cascade.
|
||||
|
||||
---
|
||||
|
||||
@@ -254,103 +289,25 @@ fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||||
|
||||
| Level | Unit | Coordination |
|
||||
|---|---|---|
|
||||
| Across partitions | `LayeredStore<LayeredStore<S>>` inner stores | none — fully independent |
|
||||
| Across layers within a partition | `LayeredStore<S>` inner stores | none — disjoint kmer sets |
|
||||
| Across partitions | inner stores of `LayeredStore<LayeredStore<S>>` | none |
|
||||
| Across layers within a partition | inner stores of `LayeredStore<S>` | none — disjoint kmer sets |
|
||||
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
|
||||
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
|
||||
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |
|
||||
|
||||
All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction.
|
||||
---
|
||||
|
||||
## reindex — evidence conversion in place
|
||||
|
||||
`KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`:
|
||||
|
||||
- `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin`
|
||||
- `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx`
|
||||
|
||||
On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`.
|
||||
|
||||
---
|
||||
|
||||
## Query model
|
||||
## estimate — parameter dry-run
|
||||
|
||||
### Point query — `kmer → Option<Item>`
|
||||
|
||||
```
|
||||
minimiser(kmer) → partition p
|
||||
for each layer l in p:
|
||||
slot = MphfLayer_l.find(kmer)
|
||||
if slot is Some:
|
||||
return DataStore_l.get(slot)
|
||||
return None
|
||||
```
|
||||
|
||||
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
|
||||
|
||||
### Aggregation — `→ Result`
|
||||
|
||||
```
|
||||
result = reduce(
|
||||
for p in partitions: // parallel
|
||||
for l in layers(p): // parallel
|
||||
partial(DataStore_p_l)
|
||||
)
|
||||
```
|
||||
|
||||
For normalised metrics replace with the two-pass scheme above.
|
||||
|
||||
---
|
||||
|
||||
## DataStore derivation
|
||||
|
||||
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
|
||||
|
||||
```
|
||||
// count → presence/absence, parallel across (partition, layer)
|
||||
for (p, l) in all_partition_layer_pairs().par_iter():
|
||||
count_store = open PersistentCompactIntMatrix at (p, l)
|
||||
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
|
||||
```
|
||||
|
||||
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
|
||||
|
||||
---
|
||||
|
||||
## Relationship to current implementation
|
||||
|
||||
### What is implemented
|
||||
|
||||
- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
|
||||
- **`obilayeredmap::LayeredStore<S>`**: generic wrapper with blanket impls for all three traits. `LayeredStore<LayeredStore<S>>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
|
||||
|
||||
### What is not yet implemented
|
||||
|
||||
- `Layer<D: LayerData>` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported.
|
||||
- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`.
|
||||
- No `PartitionedIndex` type for point queries with parallel partition dispatch.
|
||||
|
||||
### Planned refactoring
|
||||
|
||||
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
|
||||
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
|
||||
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
|
||||
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
|
||||
|
||||
---
|
||||
|
||||
## Multi-genome column invariant
|
||||
|
||||
### The invariant
|
||||
|
||||
After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode).
|
||||
|
||||
### How it is maintained
|
||||
|
||||
The invariant is established and preserved by three coordinated operations:
|
||||
|
||||
**1. Existing layers — column append.**
|
||||
When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns.
|
||||
|
||||
**2. New layers — absent columns prepended.**
|
||||
If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation.
|
||||
|
||||
**3. First merge, Presence mode — `init_presence_matrix`.**
|
||||
The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2.
|
||||
|
||||
### Why the invariant is required
|
||||
|
||||
The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics.
|
||||
|
||||
The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape.
|
||||
`estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing.
|
||||
|
||||
Reference in New Issue
Block a user