docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
Eric Coissac
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
+88 -171
View File
@@ -654,22 +654,22 @@
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#indexing-architecture" class="md-nav__link">
<a href="#why-two-phases-are-needed" class="md-nav__link">
<span class="md-ellipsis">
Indexing architecture
Why two phases are needed
</span>
</a>
<nav class="md-nav" aria-label="Indexing architecture">
<nav class="md-nav" aria-label="Why two phases are needed">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#superkmer-vs-kmer-counts" class="md-nav__link">
<a href="#phase-1-provisional-mphf-kmer-spectrum" class="md-nav__link">
<span class="md-ellipsis">
Superkmer vs kmer counts
Phase 1 — provisional MPHF + kmer spectrum
</span>
</a>
@@ -677,21 +677,10 @@
</li>
<li class="md-nav__item">
<a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
<a href="#phase-2-definitive-mphf" class="md-nav__link">
<span class="md-ellipsis">
Phase 1provisional index and spectrum
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#phase-2-definitive-index" class="md-nav__link">
<span class="md-ellipsis">
Phase 2 — definitive index
Phase 2definitive MPHF
</span>
</a>
@@ -704,10 +693,10 @@
</li>
<li class="md-nav__item">
<a href="#candidates" class="md-nav__link">
<a href="#mphf-candidates" class="md-nav__link">
<span class="md-ellipsis">
Candidates
MPHF candidates
</span>
</a>
@@ -737,10 +726,10 @@
</li>
<li class="md-nav__item">
<a href="#on-disk-and-mmap-considerations" class="md-nav__link">
<a href="#ptr_hash-configuration-phase-2" class="md-nav__link">
<span class="md-ellipsis">
On-disk and mmap considerations
ptr_hash configuration (phase 2)
</span>
</a>
@@ -759,17 +748,6 @@
<nav class="md-nav" aria-label="Multilayer index architecture">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#motivation" class="md-nav__link">
<span class="md-ellipsis">
Motivation
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-structure" class="md-nav__link">
<span class="md-ellipsis">
@@ -801,17 +779,6 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-count-and-probe-cost" class="md-nav__link">
<span class="md-ellipsis">
Layer count and probe cost
</span>
</a>
</li>
<li class="md-nav__item">
@@ -828,17 +795,6 @@
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#open-questions" class="md-nav__link">
<span class="md-ellipsis">
Open questions
</span>
</a>
</li>
</ul>
@@ -1106,22 +1062,22 @@
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#indexing-architecture" class="md-nav__link">
<a href="#why-two-phases-are-needed" class="md-nav__link">
<span class="md-ellipsis">
Indexing architecture
Why two phases are needed
</span>
</a>
<nav class="md-nav" aria-label="Indexing architecture">
<nav class="md-nav" aria-label="Why two phases are needed">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#superkmer-vs-kmer-counts" class="md-nav__link">
<a href="#phase-1-provisional-mphf-kmer-spectrum" class="md-nav__link">
<span class="md-ellipsis">
Superkmer vs kmer counts
Phase 1 — provisional MPHF + kmer spectrum
</span>
</a>
@@ -1129,21 +1085,10 @@
</li>
<li class="md-nav__item">
<a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
<a href="#phase-2-definitive-mphf" class="md-nav__link">
<span class="md-ellipsis">
Phase 1provisional index and spectrum
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#phase-2-definitive-index" class="md-nav__link">
<span class="md-ellipsis">
Phase 2 — definitive index
Phase 2definitive MPHF
</span>
</a>
@@ -1156,10 +1101,10 @@
</li>
<li class="md-nav__item">
<a href="#candidates" class="md-nav__link">
<a href="#mphf-candidates" class="md-nav__link">
<span class="md-ellipsis">
Candidates
MPHF candidates
</span>
</a>
@@ -1189,10 +1134,10 @@
</li>
<li class="md-nav__item">
<a href="#on-disk-and-mmap-considerations" class="md-nav__link">
<a href="#ptr_hash-configuration-phase-2" class="md-nav__link">
<span class="md-ellipsis">
On-disk and mmap considerations
ptr_hash configuration (phase 2)
</span>
</a>
@@ -1211,17 +1156,6 @@
<nav class="md-nav" aria-label="Multilayer index architecture">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#motivation" class="md-nav__link">
<span class="md-ellipsis">
Motivation
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-structure" class="md-nav__link">
<span class="md-ellipsis">
@@ -1253,17 +1187,6 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-count-and-probe-cost" class="md-nav__link">
<span class="md-ellipsis">
Layer count and probe cost
</span>
</a>
</li>
<li class="md-nav__item">
@@ -1280,17 +1203,6 @@
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#open-questions" class="md-nav__link">
<span class="md-ellipsis">
Open questions
</span>
</a>
</li>
</ul>
@@ -1311,46 +1223,55 @@
<h1 id="mphf-selection-two-phase-indexing-architecture">MPHF selection — two-phase indexing architecture</h1>
<h2 id="indexing-architecture">Indexing architecture</h2>
<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.</p>
<h3 id="superkmer-vs-kmer-counts">Superkmer vs kmer counts</h3>
<p>The <code>SKFileMeta</code> sidecar written by <code>SKFileWriter</code> records <code>instances</code> (unique superkmers) and <code>length_sum</code> (total nucleotides). A superkmer of length L contains L k + 1 kmers, so the kmer count per partition can be estimated as <code>length_sum instances × (k 1)</code>. This is an <strong>overestimate</strong> of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.</p>
<p>Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.</p>
<h3 id="phase-1-provisional-index-and-spectrum">Phase 1 — provisional index and spectrum</h3>
<h2 id="why-two-phases-are-needed">Why two phases are needed</h2>
<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.</p>
<h3 id="phase-1-provisional-mphf-kmer-spectrum">Phase 1 — provisional MPHF + kmer spectrum</h3>
<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code>.</p>
<ol>
<li>Enumerate all kmers from the dereplicated superkmers of the partition.</li>
<li>Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).</li>
<li>Accumulate counts: for each kmer in each superkmer, <code>count[MPHF(kmer)] += sk.count()</code>.</li>
<li>Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).</li>
<li>Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.</li>
<li>Discard the provisional MPHF.</li>
<li><strong>Pass 1</strong>: read the dereplicated superkmer file; enumerate all unique canonical kmers into a <code>HashSet</code>. Exact count known after this pass.</li>
<li><strong>Build a provisional MPHF</strong> (<code>GOFunction</code> from the <code>ph</code> crate) over the exact kmer set. Produces <code>mphf1.bin</code>.</li>
<li><strong>Create <code>counts1.bin</code></strong>: one zero-initialised <code>u32</code> per MPHF slot (mmap'd).</li>
<li><strong>Pass 2</strong>: re-read the dereplicated file; for each kmer, query <code>mphf1.get(kmer)</code> and atomically accumulate the superkmer count into <code>counts1[slot]</code>.</li>
<li><strong>Build kmer frequency spectrum</strong> from <code>counts1</code>: histogram <code>{count → n_kmers}</code>, totals f0 (distinct kmers) and f1 (total abundance). Written to <code>kmer_spectrum_raw.json</code> per partition, then merged globally.</li>
</ol>
<h3 id="phase-2-definitive-index">Phase 2 — definitive index</h3>
<p>Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).</p>
<p>Files produced per partition:</p>
<div class="highlight"><pre><span></span><code>part_XXXXX/
mphf1.bin — GOFunction (provisional MPHF, discarded after phase 2)
counts1.bin — [u32; n_kmers] kmer counts, mmap&#39;d
kmer_spectrum_raw.json — local frequency spectrum
</code></pre></div>
<h3 id="phase-2-definitive-mphf">Phase 2 — definitive MPHF</h3>
<p>After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see <a href="../pipeline/">Construction pipeline</a>), the exact filtered kmer set is available via <code>unitigs.bin</code>.</p>
<p><code>MphfLayer::build</code> is called on the unitig file:</p>
<ol>
<li><strong>Pass 1</strong>: iterate all canonical kmers from <code>unitigs.bin</code> in parallel, build and store <code>mphf.bin</code> (ptr_hash).</li>
<li><strong>Pass 2</strong>: iterate sequentially, fill <code>evidence.bin</code>, call the mode-specific <code>fill_slot</code> callback.</li>
</ol>
<p><code>mphf1.bin</code> and <code>counts1.bin</code> are no longer needed after phase 2 and can be deleted.</p>
<hr />
<h2 id="candidates">Candidates</h2>
<h2 id="mphf-candidates">MPHF candidates</h2>
<p><strong>boomphf</strong> (BBHash algorithm, maintained by 10X Genomics):</p>
<ul>
<li>~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)</li>
<li>Parallel construction; well-tested with DNA kmer data at scale</li>
<li>Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2</li>
<li>Supports streaming construction (no exact count needed)</li>
<li>Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available</li>
</ul>
<p><strong>ptr_hash</strong> (PtrHash algorithm, Groot Koerkamp, SEA 2025):</p>
<ul>
<li>~2.4 bits/key; fastest queries (≥2.1× over alternatives, 812 ns/key for u64 in tight loops) and fastest construction (≥3.1×)</li>
<li>Requires exact key count at construction — available at phase 2</li>
<li>Drawback: published February 2025 — very young, no production track record</li>
<li>~2.4 bits/key; fastest queries (≥2.1× over alternatives, 812 ns/key for u64) and fastest construction (≥3.1×)</li>
<li>Requires exact key count at construction — available at both phases after pass 1</li>
<li>Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file</li>
</ul>
<p><strong>FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
<p><strong>FMPH/FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
<ul>
<li>~2.1 bits/key — most compact of the three; good query speed; parallelisable construction</li>
<li>More established than ptr_hash; actively maintained</li>
<li>Works well with overestimated capacity → natural fit for phase 1</li>
<li>~2.1 bits/key — most compact; good query speed; deterministic construction</li>
<li>Works well from an exact or slightly overestimated count</li>
<li><code>GOFunction</code> (group-oriented variant) is the specific type used</li>
</ul>
<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.</p>
<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.</p>
<p>boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.</p>
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): <code>ph::fmph::GOFunction</code>. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of <code>count_kmer</code>.</p>
<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.</p>
<p>boomphf is eliminated: largest space overhead, streaming advantage does not apply.</p>
<hr />
<h2 id="space-at-scale">Space at scale</h2>
<p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
@@ -1374,58 +1295,54 @@
<td>~31 GB</td>
</tr>
<tr>
<td>FMPHGO</td>
<td>FMPH</td>
<td>2.1</td>
<td>~27 GB</td>
</tr>
</tbody>
</table>
<p>For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 330 M unique kmers → 18 MB per phase-2 MPHF, well within RAM.</p>
<h2 id="on-disk-and-mmap-considerations">On-disk and mmap considerations</h2>
<p>All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the <code>ph</code> crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 18 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.</p>
<p>No established Rust crate provides a natively on-disk MPHF. <strong>SSHash</strong> (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.</p>
<hr />
<h2 id="ptr_hash-configuration-phase-2">ptr_hash configuration (phase 2)</h2>
<div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o">&lt;</span>
<span class="w"> </span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="c1">// key: canonical kmer raw encoding</span>
<span class="w"> </span><span class="n">CubicEps</span><span class="p">,</span><span class="w"> </span><span class="c1">// bucket fn: 2.4 bits/key, λ=3.5, α=0.99</span>
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">CachelineEf</span><span class="o">&gt;&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: 11.6 bits/entry (Elias-Fano)</span>
<span class="w"> </span><span class="n">Xx64</span><span class="p">,</span><span class="w"> </span><span class="c1">// hasher: XXH3-64 with seed</span>
<span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">u8</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// pilots</span>
<span class="o">&gt;</span><span class="p">;</span>
</code></pre></div>
<p><strong>Hasher — <code>Xx64</code></strong>: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). <code>FxHash</code> (single multiply) distributes these poorly; <code>Xx64</code> (XXH3-64, seeded) handles structured input correctly.</p>
<p><strong>Bucket function — <code>CubicEps</code></strong>: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than <code>Linear/λ=3.0</code>, 20% less space. <code>default_compact</code> (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.</p>
<p><strong>Remap — <code>CachelineEfVec</code></strong>: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for <code>Vec&lt;u32&gt;</code>). One cacheline per query; space win dominates at billion-scale key counts.</p>
<hr />
<h2 id="multilayer-index-architecture">Multilayer index architecture</h2>
<h3 id="motivation">Motivation</h3>
<p>An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.</p>
<h3 id="layer-structure">Layer structure</h3>
<p>Each layer is a self-contained unit:</p>
<p>Each layer is a self-contained unit. See <a href="../obilayeredmap/">obilayeredmap</a> for the full on-disk layout. The MPHF-relevant files are:</p>
<div class="highlight"><pre><span></span><code>layer_i/
unitigs.bin — packed 2-bit nucleotide sequences
mphf.bin — ptr_hash index (phase-2, exact key count)
evidence.bin — [(unitig_id, rank)] per MPHF slot (see unitig_evidence.md)
counts.bin — [u32] per MPHF slot
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence)
mphf.bin — ptr_hash phase-2 MPHF
evidence.bin n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
</code></pre></div>
<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:</p>
<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:</p>
<ol>
<li>For each kmer in B: query layer 0 — if found, accumulate count into <code>counts_0[MPHF_0(kmer)]</code>.</li>
<li>Collect all kmers of B not present in any existing layer → set <code>B \ A</code>.</li>
<li>Build layer 1 from <code>B \ A</code> using the standard two-phase pipeline (spectrum, filter, ptr_hash).</li>
<li>For each kmer in B: probe existing layers. If found, the kmer is already indexed.</li>
<li>Collect kmers of B not present in any layer → set <code>B \ A</code>.</li>
<li>Build layer 1 from <code>B \ A</code> (dereplicate → count → De Bruijn → unitigs → <code>MphfLayer::build</code>).</li>
</ol>
<p>Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from <code>C \ A \ B</code>.</p>
<h3 id="membership-verification">Membership verification</h3>
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(unitig_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
<p>This makes the evidence layer load-bearing for correctness, not only for locality.</p>
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(chunk_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
<h3 id="query-algorithm">Query algorithm</h3>
<div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;count&gt;:
for layer in layers:
slot = layer.mphf.query(kmer)
if layer.evidence.decode(slot) == kmer:
return Some(layer.counts[slot])
<div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;(layer_index, slot)&gt;:
for (i, layer) in layers.iter().enumerate():
slot = layer.mphf.index(kmer)
if layer.evidence.decode(slot) matches kmer:
return Some((i, slot))
return None
</code></pre></div>
<p>Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.</p>
<h3 id="layer-count-and-probe-cost">Layer count and probe cost</h3>
<p>Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (25 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.</p>
<p>Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.</p>
<h3 id="merging-layers">Merging layers</h3>
<p>Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>
<h2 id="open-questions">Open questions</h2>
<ul>
<li>Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.</li>
<li><strong>rkyv integration</strong>: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to <code>rkyv::Archive</code> — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.</li>
<li>Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.</li>
<li>Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.</li>
</ul>
<p>Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>