Files

T

Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index

Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.

2026-05-23 13:54:31 +02:00

12 KiB

Raw Blame History

Kmer index architecture

Fundamental invariant

A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination.

Three-level hierarchy

KmerIndex  (index.meta + KmerPartition)
├── partition_0/index/          one directory per minimiser bucket
│   ├── meta.json               PartitionMeta { n_layers }
│   ├── layer_0/
│   │   ├── layer_meta.json     LayerMeta { evidence: EvidenceKind }
│   │   ├── mphf.bin            PtrHash MPHF
│   │   ├── unitigs.bin         unitig spine (never overwritten)
│   │   ├── evidence.bin        exact evidence  (Exact only)
│   │   ├── unitigs.bin.idx     block index     (Exact only)
│   │   ├── fingerprint.bin     fingerprints    (Approx only)
│   │   ├── counts/             PersistentCompactIntMatrix  (with_counts = true)
│   │   └── presence/           PersistentBitMatrix
│   └── layer_1/
│       └── ...
└── partition_1/index/
    └── ...

KmerIndex: root entry point. Owns IndexMeta (written to index.meta) and a KmerPartition that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon.

Partition directory: one directory per minimiser bucket. PartitionMeta (stored as meta.json) records n_layers. Layers within a partition cover disjoint kmer sets.

Layer directory: one MphfLayer plus optional data stores. LayerMeta (stored as layer_meta.json) records which EvidenceKind was used. The MPHF and unitigs.bin are immutable once built; evidence files are the only part replaced by reindex.

IndexConfig and IndexMeta

pub struct IndexConfig {
    pub kmer_size:      usize,
    pub minimizer_size: usize,
    pub n_bits:         usize,       // log2(n_partitions)
    pub with_counts:    bool,
    pub evidence:       EvidenceKind,
    pub block_bits:     u8,          // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig
}

pub struct IndexMeta {
    pub version: u32,
    pub config:  IndexConfig,
    pub genomes: Vec<GenomeInfo>,    // ordered; index = genome column number
}

pub struct GenomeInfo {
    pub label: String,
    pub meta:  HashMap<String, String>,  // arbitrary categorical metadata
}

IndexMeta is serialised as index.meta (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index.

EvidenceKind

pub enum EvidenceKind {
    Exact,
    Approx { b: u8, z: u8 },
}

Controls which files are written per layer and which query path is taken:

Variant	Files written	False-positive rate
`Exact`	`evidence.bin`, `unitigs.bin.idx`	0
`Approx { b, z }`	`fingerprint.bin`	≈ W / 2^(b·z) per read (Findere)

EvidenceKind is stored both in IndexConfig (index-wide default, updated by reindex) and in each LayerMeta (per-layer record of what was actually built).

MphfLayer — autonomous kmer → slot mapping

pub struct MphfLayer {
    mphf: PtrHash<…>,
    ev:   LayerEvidence,   // Exact { evidence, unitigs } | Approx { fingerprint }
    n:    usize,
}

MphfLayer::find(kmer) dispatches transparently to find_exact or find_approx based on the evidence loaded at open time (read from layer_meta.json). Returns Some(slot) only if the kmer is confirmed present; None for absent or out-of-range.

find_exact:  slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs
find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer)

block_bits controls the .idx file written alongside evidence.bin. At block_bits = 0, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller .idx.

The MPHF and unitigs.bin are never rebuilt by any post-build operation.

Layer<D> — MPHF + data payload

pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}

D selects the attached data payload:

`D`	Data directory	`Item` returned by `query`
`()`	—	`()` (set membership only)
`PersistentCompactIntMatrix`	`counts/`	`Box<[u32]>` (counts per genome)
`PersistentBitMatrix`	`presence/`	`Box<[bool]>` (presence per genome)

Layer::query(kmer) delegates to MphfLayer::find, then calls data.read(slot) if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only Option<Hit<D::Item>>.

Build-time entry points:

Layer<()>::build(out_dir, block_bits)                         // set membership
Layer<PersistentCompactIntMatrix>::build(out_dir, block_bits, count_of)
Layer<PersistentBitMatrix>::build_presence(out_dir, block_bits, n_genomes, present_in)

Layer::<()>::build_evidence(layer_dir, kind, block_bits)      // evidence only (reindex path)

DataStore — slot-indexed data

PersistentCompactIntMatrix and PersistentBitMatrix are slot-indexed stores. They know nothing about kmers or MPHFs.

Type	`Item`	Aggregation method	Use
`PersistentCompactIntMatrix`	`Box<[u32]>`	`sum() → Array1<u64>`	counts per genome per slot
`PersistentBitMatrix`	`Box<[bool]>`	`count_ones() → Array1<u64>`	presence per genome per slot

Aggregation traits — `obicompactvec::traits`

Three traits unify the aggregation API across all hierarchy levels.

trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;
}

trait CountPartials: ColumnWeights {
    fn partial_bray(&self)                                       -> Array2<u64>;
    fn partial_euclidean(&self)                                  -> Array2<f64>;
    fn partial_threshold_jaccard(&self, threshold: u32)          -> (Array2<u64>, Array2<u64>);
    fn partial_relfreq_bray(&self, global: &Array1<u64>)         -> Array2<f64>;
    fn partial_relfreq_euclidean(&self, global: &Array1<u64>)    -> Array2<f64>;
    fn partial_hellinger(&self, global: &Array1<u64>)            -> Array2<f64>;
    // provided finalisation methods with default impls
    fn bray_dist_matrix(&self)                                   -> Array2<f64> { … }
    fn relfreq_bray_dist_matrix(&self)                           -> Array2<f64> { … }
    // …
}

trait BitPartials: ColumnWeights {
    fn partial_jaccard(&self)  -> (Array2<u64>, Array2<u64>);
    fn partial_hamming(&self)  -> Array2<u64>;
    // provided
    fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
    fn hamming_dist_matrix(&self) -> Array2<u64> { … }
}

Leaf implementors:

Type	Traits
`PersistentCompactIntMatrix`	`ColumnWeights`, `CountPartials`
`PersistentBitMatrix`	`ColumnWeights`, `BitPartials`

LayeredStore<S> — recursive aggregation wrapper

pub struct LayeredStore<S>(Vec<S>);

Three blanket impls propagate all traits up the hierarchy:

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }
impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }
impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }

This makes LayeredStore<LayeredStore<PersistentCompactIntMatrix>> automatically implement CountPartials — no separate PartitionedStore type is needed:

PersistentCompactIntMatrix                         leaf (one layer)
LayeredStore<PersistentCompactIntMatrix>           one partition (layers are disjoint)
LayeredStore<LayeredStore<…>>                      whole index   (partitions are independent)

Normalised metrics require global column sums — computed in a two-pass cascade:

// on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
    let global = self.col_weights();             // pass 1 — sums up hierarchy
    let p      = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only
    p.mapv(|v| 1.0 - v)
}

Because each kmer belongs to exactly one (partition, layer) pair, col_weights() has no double-counting across the hierarchy.

Progressive aggregation principle

No level reaches two levels down. Each level sums contributions from the level immediately below:

PersistentCompactIntMatrix::col_weights()              — one (partition, layer)
        ↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights() — one partition
        ↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights()            — global

The same cascade applies to every partial method.

Multi-genome column invariant

After any merge, every layer in every partition has exactly n_genomes columns, where n_genomes is the current total in index.meta. This holds for both PersistentCompactIntMatrix and PersistentBitMatrix.

Maintained by three coordinated operations:

Existing layers — column append. Layer::append_genome_column appends one column to each existing layer. Slots matching the incoming genome receive its count or true; all other slots receive 0 or false.

New layers — absent columns prepended. When a new layer is created for kmers unique to the incoming genome, n_existing_genomes absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers.

First merge, Presence mode — init_presence_matrix. The initial single-genome index has no presence/ directory (presence is implicit). On the first merge, Layer<()>::init_presence_matrix materialises genome 0's presence column (all true) retroactively, raising the column count from 0 to 1 before appending column 1.

This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape.

Query model

Point query

minimiser(kmer) → partition p
for each layer l in p:
    if let Some(slot) = MphfLayer_l.find(kmer):
        return data_l.read(slot)
return None

O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one (partition, layer).

Aggregation

result = reduce(
    for p in partitions:       // parallel
        for l in layers(p):    // parallel
            partial(data_p_l)
)

For normalised metrics, replace with the two-pass cascade.

Parallelism model

Level	Unit	Coordination
Across partitions	inner stores of `LayeredStore<LayeredStore<S>>`	none
Across layers within a partition	inner stores of `LayeredStore<S>`	none — disjoint kmer sets
Normalised pass 1 (`col_weights`)	per inner store	none — additive
Normalised pass 2 (partial)	per inner store	`global` broadcast read-only
Within a matrix (distance)	upper-triangle pair `(i,j)`	none — rayon `par_iter`

reindex — evidence conversion in place

KmerIndex::reindex(target, block_bits) converts every layer's evidence bundle to target without touching the MPHF or unitigs.bin:

→ Exact: builds evidence.bin + unitigs.bin.idx; removes fingerprint.bin
→ Approx { b, z }: builds fingerprint.bin; removes evidence.bin + unitigs.bin.idx

On success, IndexConfig::evidence and IndexConfig::block_bits are updated in index.meta. Each layer's layer_meta.json is also rewritten with the new EvidenceKind.

estimate — parameter dry-run

estimate resolves approximate-evidence parameters (z, b, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate Approx { b, z } before building or reindexing.

12 KiB Raw Blame History