docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
Eric Coissac
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -907,6 +907,45 @@
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#aggregation-traits-obicompactvectraits" class="md-nav__link">
<span class="md-ellipsis">
Aggregation traits — obicompactvec::traits
</span>
</a>
<nav class="md-nav" aria-label="Aggregation traits — obicompactvec::traits">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#columnweights" class="md-nav__link">
<span class="md-ellipsis">
ColumnWeights
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#countpartials" class="md-nav__link">
<span class="md-ellipsis">
CountPartials
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
@@ -1259,6 +1298,45 @@
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#aggregation-traits-obicompactvectraits" class="md-nav__link">
<span class="md-ellipsis">
Aggregation traits — obicompactvec::traits
</span>
</a>
<nav class="md-nav" aria-label="Aggregation traits — obicompactvec::traits">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#columnweights" class="md-nav__link">
<span class="md-ellipsis">
ColumnWeights
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#countpartials" class="md-nav__link">
<span class="md-ellipsis">
CountPartials
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
@@ -1535,6 +1613,40 @@ step = ⌈n_overflow / 2048⌉ otherwise
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">read</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Box</span><span class="o">&lt;</span><span class="p">[</span><span class="kt">u32</span><span class="p">]</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span><span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<hr />
<h2 id="aggregation-traits-obicompactvectraits">Aggregation traits — <code>obicompactvec::traits</code></h2>
<p><code>PersistentCompactIntMatrix</code> implements two aggregation traits used by <code>LayeredStore&lt;S&gt;</code> for cross-layer and cross-partition distance computations.</p>
<h3 id="columnweights">ColumnWeights</h3>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">ColumnWeights</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">PersistentCompactIntMatrix</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">col_weights</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array1</span><span class="o">&lt;</span><span class="kt">u64</span><span class="o">&gt;</span><span class="w"> </span><span class="c1">// = self.sum()</span>
<span class="p">}</span>
</code></pre></div>
<p><code>col_weights()[c]</code> = sum of all values in column <code>c</code> across all slots.</p>
<h3 id="countpartials">CountPartials</h3>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">CountPartials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">PersistentCompactIntMatrix</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="c1">// Self-contained partials (additive across layers, no external parameter)</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">partial_bray</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">u64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">partial_euclidean</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">partial_threshold_jaccard</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">threshold</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="p">(</span><span class="n">Array2</span><span class="o">&lt;</span><span class="kt">u64</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="n">Array2</span><span class="o">&lt;</span><span class="kt">u64</span><span class="o">&gt;</span><span class="p">)</span>
<span class="w"> </span><span class="c1">// Normalised partials (require global col_weights across all layers/partitions)</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">partial_relfreq_bray</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">global</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Array1</span><span class="o">&lt;</span><span class="kt">u64</span><span class="o">&gt;</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">partial_relfreq_euclidean</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">global</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Array1</span><span class="o">&lt;</span><span class="kt">u64</span><span class="o">&gt;</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">partial_hellinger</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">global</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Array1</span><span class="o">&lt;</span><span class="kt">u64</span><span class="o">&gt;</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="c1">// Provided finalisations (default implementations on the trait)</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">bray_dist_matrix</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">euclidean_dist_matrix</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">threshold_jaccard_dist_matrix</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">threshold</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">relfreq_bray_dist_matrix</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">relfreq_euclidean_dist_matrix</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">hellinger_dist_matrix</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Array2</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span>
<span class="p">}</span>
</code></pre></div>
<p><strong>Self-contained partials</strong> are additively decomposable: summing <code>partial_bray()</code> across all <code>(partition, layer)</code> pairs and finalising gives the same result as computing on the combined data.</p>
<p><strong>Normalised partials</strong> require the global column weights (sum across all layers and all partitions). The <code>global</code> parameter must reflect the complete index, not a per-layer sum. The provided <code>relfreq_bray_dist_matrix()</code> etc. call <code>col_weights()</code> first (pass 1) then the normalised partial (pass 2); when called on a <code>LayeredStore&lt;LayeredStore&lt;&gt;&gt;</code> these two-pass calls cascade automatically through the blanket impls.</p>
<p><strong><code>partial_bray</code> returns <code>Array2&lt;u64&gt;</code></strong> (sum_min only, not a tuple). The denominator is always reconstructible as <code>col_weights()[i] + col_weights()[j]</code>.</p>
<p><strong><code>partial_threshold_jaccard</code> returns <code>(inter, union)</code></strong> as a pair because <code>union[i,j]</code> is not reconstructible from per-column statistics — it depends on both columns simultaneously.</p>