docmd/architecture/index_architecture.md

# Kmer index architecture

## Fundamental invariant

A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination.

---

## Three-level hierarchy

```
KmerIndex  (index.meta + KmerPartition)
├── partition_0/index/          one directory per minimiser bucket
│   ├── meta.json               PartitionMeta { n_layers }
│   ├── layer_0/
│   │   ├── layer_meta.json     LayerMeta { evidence: EvidenceKind }
│   │   ├── mphf.bin            PtrHash MPHF
│   │   ├── unitigs.bin         unitig spine (never overwritten)
│   │   ├── evidence.bin        exact evidence  (Exact only)
│   │   ├── unitigs.bin.idx     block index     (Exact only)
│   │   ├── fingerprint.bin     fingerprints    (Approx only)
│   │   ├── counts/             PersistentCompactIntMatrix  (with_counts = true)
│   │   └── presence/           PersistentBitMatrix
│   └── layer_1/
│       └── ...
└── partition_1/index/
    └── ...
```

**KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon.

**Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets.

**Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`.

---

## IndexConfig and IndexMeta

```rust
pub struct IndexConfig {
    pub kmer_size:      usize,
    pub minimizer_size: usize,
    pub n_bits:         usize,       // log2(n_partitions)
    pub with_counts:    bool,
    pub evidence:       EvidenceKind,
    pub block_bits:     u8,          // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig
}

pub struct IndexMeta {
    pub version: u32,
    pub config:  IndexConfig,
    pub genomes: Vec<GenomeInfo>,    // ordered; index = genome column number
}

pub struct GenomeInfo {
    pub label: String,
    pub meta:  HashMap<String, String>,  // arbitrary categorical metadata
}
```

`IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index.

---

## EvidenceKind

```rust
pub enum EvidenceKind {
    Exact,
    Approx { b: u8, z: u8 },
}
```

Controls which files are written per layer and which query path is taken:

| Variant | Files written | False-positive rate |
|---|---|---|
| `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 |
| `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) |

`EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built).

---

## MphfLayer — autonomous kmer → slot mapping

```rust
pub struct MphfLayer {
    mphf: PtrHash<…>,
    ev:   LayerEvidence,   // Exact { evidence, unitigs } | Approx { fingerprint }
    n:    usize,
}
```

`MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range.

```
find_exact:  slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs
find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer)
```

`block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`.

The MPHF and `unitigs.bin` are never rebuilt by any post-build operation.

---

## Layer\<D\> — MPHF + data payload

```rust
pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}
```

`D` selects the attached data payload:

| `D` | Data directory | `Item` returned by `query` |
|---|---|---|
| `()` | — | `()` (set membership only) |
| `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) |
| `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) |

`Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option<Hit<D::Item>>`.

Build-time entry points:

```rust
Layer<()>::build(out_dir, block_bits)                         // set membership
Layer<PersistentCompactIntMatrix>::build(out_dir, block_bits, count_of)
Layer<PersistentBitMatrix>::build_presence(out_dir, block_bits, n_genomes, present_in)

Layer::<()>::build_evidence(layer_dir, kind, block_bits)      // evidence only (reindex path)
```

---

## DataStore — slot-indexed data

`PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs.

| Type | `Item` | Aggregation method | Use |
|---|---|---|---|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() → Array1<u64>` | counts per genome per slot |
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() → Array1<u64>` | presence per genome per slot |

---

## Aggregation traits — `obicompactvec::traits`

Three traits unify the aggregation API across all hierarchy levels.

```rust
trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;
}

trait CountPartials: ColumnWeights {
    fn partial_bray(&self)                                       -> Array2<u64>;
    fn partial_euclidean(&self)                                  -> Array2<f64>;
    fn partial_threshold_jaccard(&self, threshold: u32)          -> (Array2<u64>, Array2<u64>);
    fn partial_relfreq_bray(&self, global: &Array1<u64>)         -> Array2<f64>;
    fn partial_relfreq_euclidean(&self, global: &Array1<u64>)    -> Array2<f64>;
    fn partial_hellinger(&self, global: &Array1<u64>)            -> Array2<f64>;
    // provided finalisation methods with default impls
    fn bray_dist_matrix(&self)                                   -> Array2<f64> { … }
    fn relfreq_bray_dist_matrix(&self)                           -> Array2<f64> { … }
    // …
}

trait BitPartials: ColumnWeights {
    fn partial_jaccard(&self)  -> (Array2<u64>, Array2<u64>);
    fn partial_hamming(&self)  -> Array2<u64>;
    // provided
    fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
    fn hamming_dist_matrix(&self) -> Array2<u64> { … }
}
```

Leaf implementors:

| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` |

---

## LayeredStore\<S\> — recursive aggregation wrapper

```rust
pub struct LayeredStore<S>(Vec<S>);
```

Three blanket impls propagate all traits up the hierarchy:

```rust
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }
impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }
impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }
```

This makes `LayeredStore<LayeredStore<PersistentCompactIntMatrix>>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed:

```
PersistentCompactIntMatrix                         leaf (one layer)
LayeredStore<PersistentCompactIntMatrix>           one partition (layers are disjoint)
LayeredStore<LayeredStore<…>>                      whole index   (partitions are independent)
```

Normalised metrics require global column sums — computed in a two-pass cascade:

```rust
// on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
    let global = self.col_weights();             // pass 1 — sums up hierarchy
    let p      = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only
    p.mapv(|v| 1.0 - v)
}
```

Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy.

---

## Progressive aggregation principle

No level reaches two levels down. Each level sums contributions from the level immediately below:

```
PersistentCompactIntMatrix::col_weights()              — one (partition, layer)
        ↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights() — one partition
        ↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights()            — global
```

The same cascade applies to every partial method.

---

## Multi-genome column invariant

After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`.

Maintained by three coordinated operations:

**Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`.

**New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers.

**First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1.

This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape.

---

## Query model

### Point query

```
minimiser(kmer) → partition p
for each layer l in p:
    if let Some(slot) = MphfLayer_l.find(kmer):
        return data_l.read(slot)
return None
```

O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`.

### Aggregation

```
result = reduce(
    for p in partitions:       // parallel
        for l in layers(p):    // parallel
            partial(data_p_l)
)
```

For normalised metrics, replace with the two-pass cascade.

---

## Parallelism model

| Level | Unit | Coordination |
|---|---|---|
| Across partitions | inner stores of `LayeredStore<LayeredStore<S>>` | none |
| Across layers within a partition | inner stores of `LayeredStore<S>` | none — disjoint kmer sets |
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |

---

## reindex — evidence conversion in place

`KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`:

- `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin`
- `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx`

On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`.

---

## estimate — parameter dry-run

`estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing.