Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
12 KiB
Kmer index architecture
Fundamental invariant
A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination.
Three-level hierarchy
KmerIndex (index.meta + KmerPartition)
├── partition_0/index/ one directory per minimiser bucket
│ ├── meta.json PartitionMeta { n_layers }
│ ├── layer_0/
│ │ ├── layer_meta.json LayerMeta { evidence: EvidenceKind }
│ │ ├── mphf.bin PtrHash MPHF
│ │ ├── unitigs.bin unitig spine (never overwritten)
│ │ ├── evidence.bin exact evidence (Exact only)
│ │ ├── unitigs.bin.idx block index (Exact only)
│ │ ├── fingerprint.bin fingerprints (Approx only)
│ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true)
│ │ └── presence/ PersistentBitMatrix
│ └── layer_1/
│ └── ...
└── partition_1/index/
└── ...
KmerIndex: root entry point. Owns IndexMeta (written to index.meta) and a KmerPartition that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon.
Partition directory: one directory per minimiser bucket. PartitionMeta (stored as meta.json) records n_layers. Layers within a partition cover disjoint kmer sets.
Layer directory: one MphfLayer plus optional data stores. LayerMeta (stored as layer_meta.json) records which EvidenceKind was used. The MPHF and unitigs.bin are immutable once built; evidence files are the only part replaced by reindex.
IndexConfig and IndexMeta
pub struct IndexConfig {
pub kmer_size: usize,
pub minimizer_size: usize,
pub n_bits: usize, // log2(n_partitions)
pub with_counts: bool,
pub evidence: EvidenceKind,
pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig
}
pub struct IndexMeta {
pub version: u32,
pub config: IndexConfig,
pub genomes: Vec<GenomeInfo>, // ordered; index = genome column number
}
pub struct GenomeInfo {
pub label: String,
pub meta: HashMap<String, String>, // arbitrary categorical metadata
}
IndexMeta is serialised as index.meta (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index.
EvidenceKind
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
Controls which files are written per layer and which query path is taken:
| Variant | Files written | False-positive rate |
|---|---|---|
Exact |
evidence.bin, unitigs.bin.idx |
0 |
Approx { b, z } |
fingerprint.bin |
≈ W / 2^(b·z) per read (Findere) |
EvidenceKind is stored both in IndexConfig (index-wide default, updated by reindex) and in each LayerMeta (per-layer record of what was actually built).
MphfLayer — autonomous kmer → slot mapping
pub struct MphfLayer {
mphf: PtrHash<…>,
ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint }
n: usize,
}
MphfLayer::find(kmer) dispatches transparently to find_exact or find_approx based on the evidence loaded at open time (read from layer_meta.json). Returns Some(slot) only if the kmer is confirmed present; None for absent or out-of-range.
find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs
find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer)
block_bits controls the .idx file written alongside evidence.bin. At block_bits = 0, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller .idx.
The MPHF and unitigs.bin are never rebuilt by any post-build operation.
Layer<D> — MPHF + data payload
pub struct Layer<D: LayerData = ()> {
mphf: MphfLayer,
data: D,
}
D selects the attached data payload:
D |
Data directory | Item returned by query |
|---|---|---|
() |
— | () (set membership only) |
PersistentCompactIntMatrix |
counts/ |
Box<[u32]> (counts per genome) |
PersistentBitMatrix |
presence/ |
Box<[bool]> (presence per genome) |
Layer::query(kmer) delegates to MphfLayer::find, then calls data.read(slot) if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only Option<Hit<D::Item>>.
Build-time entry points:
Layer<()>::build(out_dir, block_bits) // set membership
Layer<PersistentCompactIntMatrix>::build(out_dir, block_bits, count_of)
Layer<PersistentBitMatrix>::build_presence(out_dir, block_bits, n_genomes, present_in)
Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path)
DataStore — slot-indexed data
PersistentCompactIntMatrix and PersistentBitMatrix are slot-indexed stores. They know nothing about kmers or MPHFs.
| Type | Item |
Aggregation method | Use |
|---|---|---|---|
PersistentCompactIntMatrix |
Box<[u32]> |
sum() → Array1<u64> |
counts per genome per slot |
PersistentBitMatrix |
Box<[bool]> |
count_ones() → Array1<u64> |
presence per genome per slot |
Aggregation traits — obicompactvec::traits
Three traits unify the aggregation API across all hierarchy levels.
trait ColumnWeights: Send + Sync {
fn col_weights(&self) -> Array1<u64>;
}
trait CountPartials: ColumnWeights {
fn partial_bray(&self) -> Array2<u64>;
fn partial_euclidean(&self) -> Array2<f64>;
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
// provided finalisation methods with default impls
fn bray_dist_matrix(&self) -> Array2<f64> { … }
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
// …
}
trait BitPartials: ColumnWeights {
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
fn partial_hamming(&self) -> Array2<u64>;
// provided
fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
fn hamming_dist_matrix(&self) -> Array2<u64> { … }
}
Leaf implementors:
| Type | Traits |
|---|---|
PersistentCompactIntMatrix |
ColumnWeights, CountPartials |
PersistentBitMatrix |
ColumnWeights, BitPartials |
LayeredStore<S> — recursive aggregation wrapper
pub struct LayeredStore<S>(Vec<S>);
Three blanket impls propagate all traits up the hierarchy:
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }
impl<S: CountPartials> CountPartials for LayeredStore<S> { … }
impl<S: BitPartials> BitPartials for LayeredStore<S> { … }
This makes LayeredStore<LayeredStore<PersistentCompactIntMatrix>> automatically implement CountPartials — no separate PartitionedStore type is needed:
PersistentCompactIntMatrix leaf (one layer)
LayeredStore<PersistentCompactIntMatrix> one partition (layers are disjoint)
LayeredStore<LayeredStore<…>> whole index (partitions are independent)
Normalised metrics require global column sums — computed in a two-pass cascade:
// on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
let global = self.col_weights(); // pass 1 — sums up hierarchy
let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only
p.mapv(|v| 1.0 - v)
}
Because each kmer belongs to exactly one (partition, layer) pair, col_weights() has no double-counting across the hierarchy.
Progressive aggregation principle
No level reaches two levels down. Each level sums contributions from the level immediately below:
PersistentCompactIntMatrix::col_weights() — one (partition, layer)
↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights() — one partition
↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights() — global
The same cascade applies to every partial method.
Multi-genome column invariant
After any merge, every layer in every partition has exactly n_genomes columns, where n_genomes is the current total in index.meta. This holds for both PersistentCompactIntMatrix and PersistentBitMatrix.
Maintained by three coordinated operations:
Existing layers — column append. Layer::append_genome_column appends one column to each existing layer. Slots matching the incoming genome receive its count or true; all other slots receive 0 or false.
New layers — absent columns prepended. When a new layer is created for kmers unique to the incoming genome, n_existing_genomes absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers.
First merge, Presence mode — init_presence_matrix. The initial single-genome index has no presence/ directory (presence is implicit). On the first merge, Layer<()>::init_presence_matrix materialises genome 0's presence column (all true) retroactively, raising the column count from 0 to 1 before appending column 1.
This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape.
Query model
Point query
minimiser(kmer) → partition p
for each layer l in p:
if let Some(slot) = MphfLayer_l.find(kmer):
return data_l.read(slot)
return None
O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one (partition, layer).
Aggregation
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(data_p_l)
)
For normalised metrics, replace with the two-pass cascade.
Parallelism model
| Level | Unit | Coordination |
|---|---|---|
| Across partitions | inner stores of LayeredStore<LayeredStore<S>> |
none |
| Across layers within a partition | inner stores of LayeredStore<S> |
none — disjoint kmer sets |
Normalised pass 1 (col_weights) |
per inner store | none — additive |
| Normalised pass 2 (partial) | per inner store | global broadcast read-only |
| Within a matrix (distance) | upper-triangle pair (i,j) |
none — rayon par_iter |
reindex — evidence conversion in place
KmerIndex::reindex(target, block_bits) converts every layer's evidence bundle to target without touching the MPHF or unitigs.bin:
→ Exact: buildsevidence.bin+unitigs.bin.idx; removesfingerprint.bin→ Approx { b, z }: buildsfingerprint.bin; removesevidence.bin+unitigs.bin.idx
On success, IndexConfig::evidence and IndexConfig::block_bits are updated in index.meta. Each layer's layer_meta.json is also rewritten with the new EvidenceKind.
estimate — parameter dry-run
estimate resolves approximate-evidence parameters (z, b, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate Approx { b, z } before building or reindexing.