docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -654,22 +654,22 @@
    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
      
        <li class="md-nav__item">
-  <a href="#indexing-architecture" class="md-nav__link">
+  <a href="#why-two-phases-are-needed" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Indexing architecture
+        Why two phases are needed
      
    </span>
  </a>
  
-    <nav class="md-nav" aria-label="Indexing architecture">
+    <nav class="md-nav" aria-label="Why two phases are needed">
      <ul class="md-nav__list">
        
          <li class="md-nav__item">
-  <a href="#superkmer-vs-kmer-counts" class="md-nav__link">
+  <a href="#phase-1-provisional-mphf-kmer-spectrum" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Superkmer vs kmer counts
+        Phase 1 — provisional MPHF + kmer spectrum
      
    </span>
  </a>
@@ -677,21 +677,10 @@
 </li>
        
          <li class="md-nav__item">
-  <a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
+  <a href="#phase-2-definitive-mphf" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Phase 1 — provisional index and spectrum
-      
-    </span>
-  </a>
-  
-</li>
-        
-          <li class="md-nav__item">
-  <a href="#phase-2-definitive-index" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Phase 2 — definitive index
+        Phase 2 — definitive MPHF
      
    </span>
  </a>
@@ -704,10 +693,10 @@
 </li>
      
        <li class="md-nav__item">
-  <a href="#candidates" class="md-nav__link">
+  <a href="#mphf-candidates" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Candidates
+        MPHF candidates
      
    </span>
  </a>
@@ -737,10 +726,10 @@
 </li>
      
        <li class="md-nav__item">
-  <a href="#on-disk-and-mmap-considerations" class="md-nav__link">
+  <a href="#ptr_hash-configuration-phase-2" class="md-nav__link">
    <span class="md-ellipsis">
      
-        On-disk and mmap considerations
+        ptr_hash configuration (phase 2)
      
    </span>
  </a>
@@ -759,17 +748,6 @@
    <nav class="md-nav" aria-label="Multilayer index architecture">
      <ul class="md-nav__list">
        
-          <li class="md-nav__item">
-  <a href="#motivation" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Motivation
-      
-    </span>
-  </a>
-  
-</li>
-        
          <li class="md-nav__item">
  <a href="#layer-structure" class="md-nav__link">
    <span class="md-ellipsis">
@@ -801,17 +779,6 @@
    </span>
  </a>
  
-</li>
-        
-          <li class="md-nav__item">
-  <a href="#layer-count-and-probe-cost" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Layer count and probe cost
-      
-    </span>
-  </a>
-  
 </li>
        
          <li class="md-nav__item">
@@ -828,17 +795,6 @@
      </ul>
    </nav>
  
-</li>
-      
-        <li class="md-nav__item">
-  <a href="#open-questions" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Open questions
-      
-    </span>
-  </a>
-  
 </li>
      
    </ul>
@@ -1106,22 +1062,22 @@
    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
      
        <li class="md-nav__item">
-  <a href="#indexing-architecture" class="md-nav__link">
+  <a href="#why-two-phases-are-needed" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Indexing architecture
+        Why two phases are needed
      
    </span>
  </a>
  
-    <nav class="md-nav" aria-label="Indexing architecture">
+    <nav class="md-nav" aria-label="Why two phases are needed">
      <ul class="md-nav__list">
        
          <li class="md-nav__item">
-  <a href="#superkmer-vs-kmer-counts" class="md-nav__link">
+  <a href="#phase-1-provisional-mphf-kmer-spectrum" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Superkmer vs kmer counts
+        Phase 1 — provisional MPHF + kmer spectrum
      
    </span>
  </a>
@@ -1129,21 +1085,10 @@
 </li>
        
          <li class="md-nav__item">
-  <a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
+  <a href="#phase-2-definitive-mphf" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Phase 1 — provisional index and spectrum
-      
-    </span>
-  </a>
-  
-</li>
-        
-          <li class="md-nav__item">
-  <a href="#phase-2-definitive-index" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Phase 2 — definitive index
+        Phase 2 — definitive MPHF
      
    </span>
  </a>
@@ -1156,10 +1101,10 @@
 </li>
      
        <li class="md-nav__item">
-  <a href="#candidates" class="md-nav__link">
+  <a href="#mphf-candidates" class="md-nav__link">
    <span class="md-ellipsis">
      
-        Candidates
+        MPHF candidates
      
    </span>
  </a>
@@ -1189,10 +1134,10 @@
 </li>
      
        <li class="md-nav__item">
-  <a href="#on-disk-and-mmap-considerations" class="md-nav__link">
+  <a href="#ptr_hash-configuration-phase-2" class="md-nav__link">
    <span class="md-ellipsis">
      
-        On-disk and mmap considerations
+        ptr_hash configuration (phase 2)
      
    </span>
  </a>
@@ -1211,17 +1156,6 @@
    <nav class="md-nav" aria-label="Multilayer index architecture">
      <ul class="md-nav__list">
        
-          <li class="md-nav__item">
-  <a href="#motivation" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Motivation
-      
-    </span>
-  </a>
-  
-</li>
-        
          <li class="md-nav__item">
  <a href="#layer-structure" class="md-nav__link">
    <span class="md-ellipsis">
@@ -1253,17 +1187,6 @@
    </span>
  </a>
  
-</li>
-        
-          <li class="md-nav__item">
-  <a href="#layer-count-and-probe-cost" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Layer count and probe cost
-      
-    </span>
-  </a>
-  
 </li>
        
          <li class="md-nav__item">
@@ -1280,17 +1203,6 @@
      </ul>
    </nav>
  
-</li>
-      
-        <li class="md-nav__item">
-  <a href="#open-questions" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Open questions
-      
-    </span>
-  </a>
-  
 </li>
      
    </ul>
@@ -1311,46 +1223,55 @@


 <h1 id="mphf-selection-two-phase-indexing-architecture">MPHF selection — two-phase indexing architecture</h1>
-<h2 id="indexing-architecture">Indexing architecture</h2>
-<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.</p>
-<h3 id="superkmer-vs-kmer-counts">Superkmer vs kmer counts</h3>
-<p>The <code>SKFileMeta</code> sidecar written by <code>SKFileWriter</code> records <code>instances</code> (unique superkmers) and <code>length_sum</code> (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as <code>length_sum − instances × (k − 1)</code>. This is an <strong>overestimate</strong> of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.</p>
-<p>Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.</p>
-<h3 id="phase-1-provisional-index-and-spectrum">Phase 1 — provisional index and spectrum</h3>
+<h2 id="why-two-phases-are-needed">Why two phases are needed</h2>
+<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.</p>
+<h3 id="phase-1-provisional-mphf-kmer-spectrum">Phase 1 — provisional MPHF + kmer spectrum</h3>
+<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code>.</p>
 <ol>
-<li>Enumerate all kmers from the dereplicated superkmers of the partition.</li>
-<li>Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).</li>
-<li>Accumulate counts: for each kmer in each superkmer, <code>count[MPHF(kmer)] += sk.count()</code>.</li>
-<li>Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).</li>
-<li>Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.</li>
-<li>Discard the provisional MPHF.</li>
+<li><strong>Pass 1</strong>: read the dereplicated superkmer file; enumerate all unique canonical kmers into a <code>HashSet</code>. Exact count known after this pass.</li>
+<li><strong>Build a provisional MPHF</strong> (<code>GOFunction</code> from the <code>ph</code> crate) over the exact kmer set. Produces <code>mphf1.bin</code>.</li>
+<li><strong>Create <code>counts1.bin</code></strong>: one zero-initialised <code>u32</code> per MPHF slot (mmap'd).</li>
+<li><strong>Pass 2</strong>: re-read the dereplicated file; for each kmer, query <code>mphf1.get(kmer)</code> and atomically accumulate the superkmer count into <code>counts1[slot]</code>.</li>
+<li><strong>Build kmer frequency spectrum</strong> from <code>counts1</code>: histogram <code>{count → n_kmers}</code>, totals f0 (distinct kmers) and f1 (total abundance). Written to <code>kmer_spectrum_raw.json</code> per partition, then merged globally.</li>
 </ol>
-<h3 id="phase-2-definitive-index">Phase 2 — definitive index</h3>
-<p>Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).</p>
+<p>Files produced per partition:</p>
+<div class="highlight"><pre><span></span><code>part_XXXXX/
+  mphf1.bin               — GOFunction (provisional MPHF, discarded after phase 2)
+  counts1.bin             — [u32; n_kmers] kmer counts, mmap&#39;d
+  kmer_spectrum_raw.json  — local frequency spectrum
+</code></pre></div>
+<h3 id="phase-2-definitive-mphf">Phase 2 — definitive MPHF</h3>
+<p>After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see <a href="../pipeline/">Construction pipeline</a>), the exact filtered kmer set is available via <code>unitigs.bin</code>.</p>
+<p><code>MphfLayer::build</code> is called on the unitig file:</p>
+<ol>
+<li><strong>Pass 1</strong>: iterate all canonical kmers from <code>unitigs.bin</code> in parallel, build and store <code>mphf.bin</code> (ptr_hash).</li>
+<li><strong>Pass 2</strong>: iterate sequentially, fill <code>evidence.bin</code>, call the mode-specific <code>fill_slot</code> callback.</li>
+</ol>
+<p><code>mphf1.bin</code> and <code>counts1.bin</code> are no longer needed after phase 2 and can be deleted.</p>
 <hr />
-<h2 id="candidates">Candidates</h2>
+<h2 id="mphf-candidates">MPHF candidates</h2>
 <p><strong>boomphf</strong> (BBHash algorithm, maintained by 10X Genomics):</p>
 <ul>
 <li>~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)</li>
-<li>Parallel construction; well-tested with DNA kmer data at scale</li>
-<li>Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2</li>
+<li>Supports streaming construction (no exact count needed)</li>
+<li>Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available</li>
 </ul>
 <p><strong>ptr_hash</strong> (PtrHash algorithm, Groot Koerkamp, SEA 2025):</p>
 <ul>
-<li>~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)</li>
-<li>Requires exact key count at construction — available at phase 2</li>
-<li>Drawback: published February 2025 — very young, no production track record</li>
+<li>~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64) and fastest construction (≥3.1×)</li>
+<li>Requires exact key count at construction — available at both phases after pass 1</li>
+<li>Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file</li>
 </ul>
-<p><strong>FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
+<p><strong>FMPH/FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
 <ul>
-<li>~2.1 bits/key — most compact of the three; good query speed; parallelisable construction</li>
-<li>More established than ptr_hash; actively maintained</li>
-<li>Works well with overestimated capacity → natural fit for phase 1</li>
+<li>~2.1 bits/key — most compact; good query speed; deterministic construction</li>
+<li>Works well from an exact or slightly overestimated count</li>
+<li><code>GOFunction</code> (group-oriented variant) is the specific type used</li>
 </ul>
 <h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
-<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.</p>
-<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.</p>
-<p>boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.</p>
+<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): <code>ph::fmph::GOFunction</code>. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of <code>count_kmer</code>.</p>
+<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.</p>
+<p>boomphf is eliminated: largest space overhead, streaming advantage does not apply.</p>
 <hr />
 <h2 id="space-at-scale">Space at scale</h2>
 <p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
@@ -1374,58 +1295,54 @@
 <td>~31 GB</td>
 </tr>
 <tr>
-<td>FMPHGO</td>
+<td>FMPH</td>
 <td>2.1</td>
 <td>~27 GB</td>
 </tr>
 </tbody>
 </table>
 <p>For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.</p>
-<h2 id="on-disk-and-mmap-considerations">On-disk and mmap considerations</h2>
-<p>All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the <code>ph</code> crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.</p>
-<p>No established Rust crate provides a natively on-disk MPHF. <strong>SSHash</strong> (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.</p>
+<hr />
+<h2 id="ptr_hash-configuration-phase-2">ptr_hash configuration (phase 2)</h2>
+<div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o">&lt;</span>
+<span class="w">    </span><span class="kt">u64</span><span class="p">,</span><span class="w">                              </span><span class="c1">// key: canonical kmer raw encoding</span>
+<span class="w">    </span><span class="n">CubicEps</span><span class="p">,</span><span class="w">                         </span><span class="c1">// bucket fn: 2.4 bits/key, λ=3.5, α=0.99</span>
+<span class="w">    </span><span class="n">CachelineEfVec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">CachelineEf</span><span class="o">&gt;&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: 11.6 bits/entry (Elias-Fano)</span>
+<span class="w">    </span><span class="n">Xx64</span><span class="p">,</span><span class="w">                             </span><span class="c1">// hasher: XXH3-64 with seed</span>
+<span class="w">    </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">u8</span><span class="o">&gt;</span><span class="p">,</span><span class="w">                          </span><span class="c1">// pilots</span>
+<span class="o">&gt;</span><span class="p">;</span>
+</code></pre></div>
+<p><strong>Hasher — <code>Xx64</code></strong>: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). <code>FxHash</code> (single multiply) distributes these poorly; <code>Xx64</code> (XXH3-64, seeded) handles structured input correctly.</p>
+<p><strong>Bucket function — <code>CubicEps</code></strong>: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than <code>Linear/λ=3.0</code>, 20% less space. <code>default_compact</code> (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.</p>
+<p><strong>Remap — <code>CachelineEfVec</code></strong>: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for <code>Vec&lt;u32&gt;</code>). One cacheline per query; space win dominates at billion-scale key counts.</p>
 <hr />
 <h2 id="multilayer-index-architecture">Multilayer index architecture</h2>
-<h3 id="motivation">Motivation</h3>
-<p>An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.</p>
 <h3 id="layer-structure">Layer structure</h3>
-<p>Each layer is a self-contained unit:</p>
+<p>Each layer is a self-contained unit. See <a href="../obilayeredmap/">obilayeredmap</a> for the full on-disk layout. The MPHF-relevant files are:</p>
 <div class="highlight"><pre><span></span><code>layer_i/
-  unitigs.bin     — packed 2-bit nucleotide sequences
-  mphf.bin        — ptr_hash index (phase-2, exact key count)
-  evidence.bin    — [(unitig_id, rank)] per MPHF slot  (see unitig_evidence.md)
-  counts.bin      — [u32] per MPHF slot
+  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence)
+  mphf.bin         — ptr_hash phase-2 MPHF
+  evidence.bin     — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
 </code></pre></div>
-<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:</p>
+<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:</p>
 <ol>
-<li>For each kmer in B: query layer 0 — if found, accumulate count into <code>counts_0[MPHF_0(kmer)]</code>.</li>
-<li>Collect all kmers of B not present in any existing layer → set <code>B \ A</code>.</li>
-<li>Build layer 1 from <code>B \ A</code> using the standard two-phase pipeline (spectrum, filter, ptr_hash).</li>
+<li>For each kmer in B: probe existing layers. If found, the kmer is already indexed.</li>
+<li>Collect kmers of B not present in any layer → set <code>B \ A</code>.</li>
+<li>Build layer 1 from <code>B \ A</code> (dereplicate → count → De Bruijn → unitigs → <code>MphfLayer::build</code>).</li>
 </ol>
-<p>Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from <code>C \ A \ B</code>.</p>
 <h3 id="membership-verification">Membership verification</h3>
-<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(unitig_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
-<p>This makes the evidence layer load-bearing for correctness, not only for locality.</p>
+<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(chunk_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
 <h3 id="query-algorithm">Query algorithm</h3>
-<div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;count&gt;:
-    for layer in layers:
-        slot = layer.mphf.query(kmer)
-        if layer.evidence.decode(slot) == kmer:
-            return Some(layer.counts[slot])
+<div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;(layer_index, slot)&gt;:
+    for (i, layer) in layers.iter().enumerate():
+        slot = layer.mphf.index(kmer)
+        if layer.evidence.decode(slot) matches kmer:
+            return Some((i, slot))
    return None
 </code></pre></div>
-<p>Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.</p>
-<h3 id="layer-count-and-probe-cost">Layer count and probe cost</h3>
-<p>Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.</p>
+<p>Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.</p>
 <h3 id="merging-layers">Merging layers</h3>
-<p>Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>
-<h2 id="open-questions">Open questions</h2>
-<ul>
-<li>Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.</li>
-<li><strong>rkyv integration</strong>: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to <code>rkyv::Archive</code> — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.</li>
-<li>Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.</li>
-<li>Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.</li>
-</ul>
+<p>Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>