docs: document k-mer index architecture and refactor distance metrics

Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
This commit is contained in:
Eric Coissac
2026-05-15 21:07:23 +08:00
parent 8409c852ef
commit 45d49ed501
25 changed files with 8842 additions and 117 deletions
+316 -3
View File
@@ -745,6 +745,89 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#multilayer-index-architecture" class="md-nav__link">
<span class="md-ellipsis">
Multilayer index architecture
</span>
</a>
<nav class="md-nav" aria-label="Multilayer index architecture">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#motivation" class="md-nav__link">
<span class="md-ellipsis">
Motivation
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-structure" class="md-nav__link">
<span class="md-ellipsis">
Layer structure
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#membership-verification" class="md-nav__link">
<span class="md-ellipsis">
Membership verification
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#query-algorithm" class="md-nav__link">
<span class="md-ellipsis">
Query algorithm
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-count-and-probe-cost" class="md-nav__link">
<span class="md-ellipsis">
Layer count and probe cost
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#merging-layers" class="md-nav__link">
<span class="md-ellipsis">
Merging layers
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -795,6 +878,90 @@
<li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link">
<span class="md-ellipsis">
obilayeredmap crate
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../persistent_compact_int_vec/" class="md-nav__link">
<span class="md-ellipsis">
PersistentCompactIntVec
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../persistent_bit_vec/" class="md-nav__link">
<span class="md-ellipsis">
PersistentBitVec
</span>
</a>
</li>
</ul>
</nav>
@@ -877,6 +1044,34 @@
<li class="md-nav__item">
<a href="../../architecture/index_architecture/" class="md-nav__link">
<span class="md-ellipsis">
Kmer index
</span>
</a>
</li>
</ul>
</nav>
@@ -1002,6 +1197,89 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#multilayer-index-architecture" class="md-nav__link">
<span class="md-ellipsis">
Multilayer index architecture
</span>
</a>
<nav class="md-nav" aria-label="Multilayer index architecture">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#motivation" class="md-nav__link">
<span class="md-ellipsis">
Motivation
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-structure" class="md-nav__link">
<span class="md-ellipsis">
Layer structure
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#membership-verification" class="md-nav__link">
<span class="md-ellipsis">
Membership verification
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#query-algorithm" class="md-nav__link">
<span class="md-ellipsis">
Query algorithm
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-count-and-probe-cost" class="md-nav__link">
<span class="md-ellipsis">
Layer count and probe cost
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#merging-layers" class="md-nav__link">
<span class="md-ellipsis">
Merging layers
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -1071,7 +1349,7 @@
</ul>
<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.</p>
<p><strong>Phase 2</strong> (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.13.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.</p>
<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.</p>
<p>boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.</p>
<hr />
<h2 id="space-at-scale">Space at scale</h2>
@@ -1106,12 +1384,47 @@
<h2 id="on-disk-and-mmap-considerations">On-disk and mmap considerations</h2>
<p>All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the <code>ph</code> crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 18 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.</p>
<p>No established Rust crate provides a natively on-disk MPHF. <strong>SSHash</strong> (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.</p>
<hr />
<h2 id="multilayer-index-architecture">Multilayer index architecture</h2>
<h3 id="motivation">Motivation</h3>
<p>An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.</p>
<h3 id="layer-structure">Layer structure</h3>
<p>Each layer is a self-contained unit:</p>
<div class="highlight"><pre><span></span><code>layer_i/
unitigs.bin — packed 2-bit nucleotide sequences
mphf.bin — ptr_hash index (phase-2, exact key count)
evidence.bin — [(unitig_id, rank)] per MPHF slot (see unitig_evidence.md)
counts.bin — [u32] per MPHF slot
</code></pre></div>
<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:</p>
<ol>
<li>For each kmer in B: query layer 0 — if found, accumulate count into <code>counts_0[MPHF_0(kmer)]</code>.</li>
<li>Collect all kmers of B not present in any existing layer → set <code>B \ A</code>.</li>
<li>Build layer 1 from <code>B \ A</code> using the standard two-phase pipeline (spectrum, filter, ptr_hash).</li>
</ol>
<p>Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from <code>C \ A \ B</code>.</p>
<h3 id="membership-verification">Membership verification</h3>
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(unitig_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
<p>This makes the evidence layer load-bearing for correctness, not only for locality.</p>
<h3 id="query-algorithm">Query algorithm</h3>
<div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;count&gt;:
for layer in layers:
slot = layer.mphf.query(kmer)
if layer.evidence.decode(slot) == kmer:
return Some(layer.counts[slot])
return None
</code></pre></div>
<p>Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.</p>
<h3 id="layer-count-and-probe-cost">Layer count and probe cost</h3>
<p>Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (25 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.</p>
<h3 id="merging-layers">Merging layers</h3>
<p>Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>
<h2 id="open-questions">Open questions</h2>
<ul>
<li>Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.</li>
<li>Revisit ptr_hash for phase 2 once the crate has broader production track record.</li>
<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.</li>
<li><strong>rkyv integration</strong>: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to <code>rkyv::Archive</code> — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.</li>
<li>Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.</li>
<li>Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.</li>
</ul>