docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
Eric Coissac
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -236,3 +236,35 @@ impl LayerData for PersistentBitMatrix {
fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) }
}
```
---
## Aggregation traits — `obicompactvec::traits`
`PersistentBitMatrix` implements two aggregation traits used by `LayeredStore<S>` for cross-layer and cross-partition distance computations.
### ColumnWeights
```rust
impl ColumnWeights for PersistentBitMatrix {
fn col_weights(&self) -> Array1<u64> // = self.count_ones()
}
```
`col_weights()[c]` = number of set bits in column `c` across all slots.
### BitPartials
```rust
impl BitPartials for PersistentBitMatrix {
// Self-contained partials (additive across layers)
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>) // (inter, union)
fn partial_hamming(&self) -> Array2<u64> // differing bits
// Provided finalisations
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn hamming_dist_matrix(&self) -> Array2<u64>
}
```
`partial_jaccard` returns `(inter, union)` as a pair because `union` is not reconstructible from per-column `count_ones()` — it depends on both columns simultaneously. Both components are additively decomposable across `(partition, layer)` pairs; the final `jaccard_dist_matrix()` is computed from their element-wise sums.