docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
Eric Coissac
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
+210 -307
View File
@@ -721,49 +721,10 @@
</li>
<li class="md-nav__item">
<a href="#four-usage-modes" class="md-nav__link">
<a href="#three-usage-modes" class="md-nav__link">
<span class="md-ellipsis">
Four usage modes
</span>
</a>
<nav class="md-nav" aria-label="Four usage modes">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#payload-for-modes-24-persistentcompactintmatrix" class="md-nav__link">
<span class="md-ellipsis">
Payload for modes 2/4: PersistentCompactIntMatrix
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#payload-for-mode-3-persistentbitmatrix" class="md-nav__link">
<span class="md-ellipsis">
Payload for mode 3: PersistentBitMatrix
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#payload-architecture" class="md-nav__link">
<span class="md-ellipsis">
Payload architecture
Three usage modes
</span>
</a>
@@ -771,10 +732,10 @@
</li>
<li class="md-nav__item">
<a href="#three-level-hierarchy" class="md-nav__link">
<a href="#mphflayer-autonomous-kmer-slot-mapping" class="md-nav__link">
<span class="md-ellipsis">
Three-level hierarchy
MphfLayer — autonomous kmer → slot mapping
</span>
</a>
@@ -782,18 +743,39 @@
</li>
<li class="md-nav__item">
<a href="#layer-file-layout" class="md-nav__link">
<a href="#layerd-layerdata-mphf-payload" class="md-nav__link">
<span class="md-ellipsis">
Layer file layout
Layer\&lt;D: LayerData> — MPHF + payload
</span>
</a>
<nav class="md-nav" aria-label="Layer file layout">
<ul class="md-nav__list">
<li class="md-nav__item">
</li>
<li class="md-nav__item">
<a href="#layeredstores-and-aggregation-traits" class="md-nav__link">
<span class="md-ellipsis">
LayeredStore\&lt;S> and aggregation traits
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#on-disk-structure" class="md-nav__link">
<span class="md-ellipsis">
On-disk structure
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#evidence-encoding" class="md-nav__link">
<span class="md-ellipsis">
@@ -802,11 +784,6 @@
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -818,17 +795,6 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#build-path" class="md-nav__link">
<span class="md-ellipsis">
Build path
</span>
</a>
</li>
<li class="md-nav__item">
@@ -862,28 +828,6 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#relationship-to-target-architecture" class="md-nav__link">
<span class="md-ellipsis">
Relationship to target architecture
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#open-questions" class="md-nav__link">
<span class="md-ellipsis">
Open questions
</span>
</a>
</li>
</ul>
@@ -1106,49 +1050,10 @@
</li>
<li class="md-nav__item">
<a href="#four-usage-modes" class="md-nav__link">
<a href="#three-usage-modes" class="md-nav__link">
<span class="md-ellipsis">
Four usage modes
</span>
</a>
<nav class="md-nav" aria-label="Four usage modes">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#payload-for-modes-24-persistentcompactintmatrix" class="md-nav__link">
<span class="md-ellipsis">
Payload for modes 2/4: PersistentCompactIntMatrix
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#payload-for-mode-3-persistentbitmatrix" class="md-nav__link">
<span class="md-ellipsis">
Payload for mode 3: PersistentBitMatrix
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#payload-architecture" class="md-nav__link">
<span class="md-ellipsis">
Payload architecture
Three usage modes
</span>
</a>
@@ -1156,10 +1061,10 @@
</li>
<li class="md-nav__item">
<a href="#three-level-hierarchy" class="md-nav__link">
<a href="#mphflayer-autonomous-kmer-slot-mapping" class="md-nav__link">
<span class="md-ellipsis">
Three-level hierarchy
MphfLayer — autonomous kmer → slot mapping
</span>
</a>
@@ -1167,18 +1072,39 @@
</li>
<li class="md-nav__item">
<a href="#layer-file-layout" class="md-nav__link">
<a href="#layerd-layerdata-mphf-payload" class="md-nav__link">
<span class="md-ellipsis">
Layer file layout
Layer\&lt;D: LayerData> — MPHF + payload
</span>
</a>
<nav class="md-nav" aria-label="Layer file layout">
<ul class="md-nav__list">
<li class="md-nav__item">
</li>
<li class="md-nav__item">
<a href="#layeredstores-and-aggregation-traits" class="md-nav__link">
<span class="md-ellipsis">
LayeredStore\&lt;S> and aggregation traits
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#on-disk-structure" class="md-nav__link">
<span class="md-ellipsis">
On-disk structure
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#evidence-encoding" class="md-nav__link">
<span class="md-ellipsis">
@@ -1187,11 +1113,6 @@
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
@@ -1203,17 +1124,6 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#build-path" class="md-nav__link">
<span class="md-ellipsis">
Build path
</span>
</a>
</li>
<li class="md-nav__item">
@@ -1247,28 +1157,6 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#relationship-to-target-architecture" class="md-nav__link">
<span class="md-ellipsis">
Relationship to target architecture
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#open-questions" class="md-nav__link">
<span class="md-ellipsis">
Open questions
</span>
</a>
</li>
</ul>
@@ -1290,10 +1178,10 @@
<h1 id="obilayeredmap-layered-kmer-index-crate">obilayeredmap — layered kmer index crate</h1>
<h2 id="purpose">Purpose</h2>
<p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. The index is organised in three levels: <strong>collection → partition → layer</strong>. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p>
<p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. The index is organised in three levels: <strong>index root → partition → layer</strong>. Each layer covers a disjoint kmer set and wraps a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p>
<hr />
<h2 id="four-usage-modes">Four usage modes</h2>
<p>The MPHF + evidence infrastructure is fixed for all modes. The <strong>payload</strong> — data associated with each slot — is orthogonal and varies by mode.</p>
<h2 id="three-usage-modes">Three usage modes</h2>
<p>The MPHF + evidence infrastructure is the same for all modes. The <strong>payload</strong> varies.</p>
<table>
<thead>
<tr>
@@ -1317,29 +1205,46 @@
<td><code>counts/</code> directory</td>
</tr>
<tr>
<td>3. Presence/absence matrix</td>
<td>3. Presence/absence</td>
<td>which genomes contain each kmer</td>
<td><code>PersistentBitMatrix</code></td>
<td><code>presence/</code> directory</td>
</tr>
<tr>
<td>4. Count matrix</td>
<td>occurrences per kmer per genome</td>
<td><code>PersistentCompactIntMatrix</code></td>
<td><code>counts/</code> directory</td>
</tr>
</tbody>
</table>
<p>Both <code>PersistentCompactIntMatrix</code> and <code>PersistentBitMatrix</code> come from the <code>obicompactvec</code> crate. Mode 3 has a build path (<code>Layer::&lt;PersistentBitMatrix&gt;::build_presence</code>); mode 4 is not yet implemented.</p>
<h3 id="payload-for-modes-24-persistentcompactintmatrix">Payload for modes 2/4: PersistentCompactIntMatrix</h3>
<p><code>PersistentCompactIntMatrix</code> is a column-major matrix stored in a directory: one <code>col_NNNNNN.pciv</code> file per column, plus a <code>meta.json</code>. Each column is a <code>PersistentCompactIntVec</code> — a mmap'd PCIV file with a <code>u8</code> primary array (255 = overflow sentinel), a sorted overflow section of <code>(slot: u64, value: u32)</code> entries, and a sparse L1-fitting index.</p>
<p>Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). <code>read(slot)</code> returns <code>Box&lt;[u32]&gt;</code> — the full row across all columns.</p>
<h3 id="payload-for-mode-3-persistentbitmatrix">Payload for mode 3: PersistentBitMatrix</h3>
<p><code>PersistentBitMatrix</code> is a column-major bit matrix stored in a directory: one <code>col_NNNNNN.pbiv</code> per genome, plus <code>meta.json</code>. Each column is a <code>PersistentBitVec</code> — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). <code>read(slot)</code> returns <code>Box&lt;[bool]&gt;</code> — the presence vector across all genomes.</p>
<p>Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.</p>
<p>Both <code>PersistentCompactIntMatrix</code> and <code>PersistentBitMatrix</code> come from the <code>obicompactvec</code> crate.</p>
<hr />
<h2 id="payload-architecture">Payload architecture</h2>
<p>The payload is orthogonal to the MPHF + evidence layer. <code>Layer</code> is parameterised by <code>D: LayerData</code>:</p>
<h2 id="mphflayer-autonomous-kmer-slot-mapping">MphfLayer — autonomous kmer → slot mapping</h2>
<p><code>MphfLayer</code> encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">MphfLayer</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">Mphf</span><span class="p">,</span>
<span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span>
<span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span>
<span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="c1">// number of indexed kmers = number of MPHF slots</span>
<span class="p">}</span>
</code></pre></div>
<p>Public API:</p>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">MphfLayer</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="bp">Self</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span><span class="w"> </span><span class="c1">// Some(slot) or None</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">usize</span>
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">unitig_writer</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="n">UnitigFileWriter</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span>
<span class="w"> </span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span>
<span class="w"> </span><span class="n">fill_slot</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">mut</span><span class="w"> </span><span class="k">impl</span><span class="w"> </span><span class="nb">FnMut</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="p">,</span>
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span>
</code></pre></div>
<p><code>find</code> returns <code>Some(slot)</code> only after verifying via evidence that the kmer is actually indexed. It returns <code>None</code> for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).</p>
<p><code>build</code> runs two sequential passes over <code>unitigs.bin</code>:</p>
<ol>
<li><strong>Pass 1</strong>: iterate all canonical kmers in parallel via rayon, construct and store <code>mphf.bin</code>. <code>new_from_par_iter</code> avoids materialising a full key <code>Vec</code>.</li>
<li><strong>Pass 2</strong>: iterate again sequentially, fill <code>evidence.bin</code>, call <code>fill_slot(slot, kmer)</code> once per kmer for payload population. A compact <code>n/8</code>-byte seen-bitset verifies MPHF injectivity inline.</li>
</ol>
<p>For empty layers (n = 0), <code>build</code> returns <code>Ok(0)</code> immediately after creating empty <code>mphf.bin</code> and <code>evidence.bin</code>.</p>
<hr />
<h2 id="layerd-layerdata-mphf-payload">Layer\&lt;D: LayerData&gt; — MPHF + payload</h2>
<p><code>Layer&lt;D&gt;</code> pairs an <code>MphfLayer</code> with one payload store.</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">trait</span><span class="w"> </span><span class="n">LayerData</span><span class="p">:</span><span class="w"> </span><span class="nb">Sized</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="p">;</span>
<span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="bp">Self</span><span class="o">&gt;</span><span class="p">;</span>
@@ -1347,10 +1252,8 @@
<span class="p">}</span>
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">Layer</span><span class="o">&lt;</span><span class="n">D</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerData</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">Mphf</span><span class="p">,</span>
<span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span>
<span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span>
<span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">D</span><span class="p">,</span>
<span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">MphfLayer</span><span class="p">,</span>
<span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">D</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">Hit</span><span class="o">&lt;</span><span class="n">T</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
@@ -1358,8 +1261,7 @@
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">T</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div>
<p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated <code>Context</code> type with no benefit over specialized <code>impl</code> blocks.</p>
<p>Implemented concrete types:</p>
<p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). Build signatures differ between modes and are not in the trait.</p>
<table>
<thead>
<tr>
@@ -1377,87 +1279,22 @@
<tr>
<td><code>PersistentCompactIntMatrix</code></td>
<td><code>Box&lt;[u32]&gt;</code></td>
<td>modes 2/4one count per column</td>
<td>mode 2 — count matrix (one u32 per column per slot)</td>
</tr>
<tr>
<td><code>PersistentBitMatrix</code></td>
<td><code>Box&lt;[bool]&gt;</code></td>
<td>mode 3 — one presence bit per column</td>
<td>mode 3 — presence matrix (one bit per genome per slot)</td>
</tr>
</tbody>
</table>
<p><code>LayeredMap</code> mirrors the same parameterisation: <code>LayeredMap&lt;D: LayerData = ()&gt;</code>.</p>
<hr />
<h2 id="three-level-hierarchy">Three-level hierarchy</h2>
<div class="highlight"><pre><span></span><code>index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
mphf.bin
unitigs.bin
unitigs.bin.idx
evidence.bin
counts/ [modes 2/4]
meta.json {&quot;n&quot;: N, &quot;n_cols&quot;: 1}
col_000000.pciv
presence/ [mode 3]
meta.json {&quot;n&quot;: N, &quot;n_cols&quot;: G}
col_000000.pbiv
col_000001.pbiv
...
layer_1/
...
part_00001/
layer_0/
...
</code></pre></div>
<p><strong>Collection</strong> (<code>index_root/</code>): global metadata — kmer size k, number of partitions, layer count, sample registry.</p>
<p><strong>Partition</strong> (<code>part_XXXXX/</code>): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in <code>part_XXXXX</code>. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in <code>meta.json</code>.</p>
<p><strong>Layer</strong> (<code>layer_N/</code>): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.</p>
<hr />
<h2 id="layer-file-layout">Layer file layout</h2>
<div class="highlight"><pre><span></span><code>layer_N/
mphf.bin — ptr_hash MPHF (epserde, ptr_hash native format)
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
counts/ — [modes 2/4] PersistentCompactIntMatrix
presence/ — [mode 3] PersistentBitMatrix
</code></pre></div>
<p><code>unitigs.bin</code> is the packed-2-bit sequence file produced by <code>obiskio::UnitigFileWriter</code>. The companion <code>.idx</code> file stores: magic <code>UIDX</code>, <code>n_unitigs: u32</code>, <code>n_kmers: u64</code>, <code>seqls: [u8; n_unitigs]</code> (kmer count 1 per chunk), and <code>packed_offsets: [u32; n_unitigs + 1]</code> (byte offsets into <code>unitigs.bin</code>, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.</p>
<h3 id="evidence-encoding">Evidence encoding</h3>
<p>Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.</p>
<div class="highlight"><pre><span></span><code>slot s → unitig_id: u25 | rank: u7
</code></pre></div>
<p>Packed into a <code>u32</code> (29 bits used, 3 spare). Decoding:</p>
<div class="highlight"><pre><span></span><code>kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
</code></pre></div>
<p><code>rank</code> is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A <code>u7</code> (0127) is sufficient.</p>
<hr />
<h2 id="ptr_hash-configuration">ptr_hash configuration</h2>
<p>The MPHF per layer is configured as:</p>
<div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o">&lt;</span>
<span class="w"> </span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="c1">// key type: canonical kmer raw encoding</span>
<span class="w"> </span><span class="n">CubicEps</span><span class="p">,</span><span class="w"> </span><span class="c1">// bucket fn: balanced (2.4 bits/key, λ=3.5)</span>
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">CachelineEf</span><span class="o">&gt;&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: 11.6 bits/entry vs 32 for Vec&lt;u32&gt;</span>
<span class="w"> </span><span class="n">Xx64</span><span class="p">,</span><span class="w"> </span><span class="c1">// hasher: XXH3-64 with seed, handles structured keys</span>
<span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">u8</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// pilots</span>
<span class="o">&gt;</span><span class="p">;</span>
</code></pre></div>
<p><strong>Hasher choice — <code>Xx64</code>:</strong> k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). <code>FxHash</code> (single multiply) distributes these poorly. <code>Xx64</code> (XXH3 64-bit, seeded) handles structured input correctly.</p>
<p><strong>Bucket function — <code>CubicEps</code> with <code>PtrHashParams::&lt;CubicEps&gt;::default()</code>:</strong> λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than <code>Linear/λ=3.0</code> (the <code>default_fast</code> preset), 20% less space. <code>default_compact</code> (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.</p>
<p><strong>Remap — <code>CachelineEfVec</code>:</strong> Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for <code>Vec&lt;u32&gt;</code>). Already a transitive dependency of <code>ptr_hash</code>. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.</p>
<hr />
<h2 id="build-path">Build path</h2>
<p>The build path is not part of <code>LayerData</code>. Each mode exposes its own <code>impl Layer&lt;D&gt;::build</code> with the exact signature it needs. Two private module-level helpers avoid code duplication:</p>
<p><strong><code>build_mphf(out_dir, n) -&gt; OLMResult&lt;Mphf&gt;</code></strong>: first pass — opens <code>unitigs.bin</code>, iterates all canonical kmers in parallel via <code>new_from_par_iter</code>, stores <code>mphf.bin</code>. O(n).</p>
<p><strong><code>build_second_pass(out_dir, n, mphf, fill_slot) -&gt; OLMResult&lt;()&gt;</code></strong>: second pass — opens <code>unitigs.bin</code> again, fills <code>evidence.bin</code> and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls <code>fill_slot(slot, kmer)</code> once per kmer for the mode-specific payload. O(n).</p>
<p><strong>Build signatures:</strong></p>
<div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span>
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span>
<span class="c1">// modes 2/4</span>
<span class="c1">// mode 2</span>
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">PersistentCompactIntMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_from_map</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">HashMap</span><span class="o">&lt;</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">&gt;</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
@@ -1472,35 +1309,104 @@
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span>
</code></pre></div>
<p>Mode 2 creates a <code>PersistentCompactIntMatrixBuilder</code> with 1 column and fills it via <code>build_second_pass</code>. Mode 3 creates a <code>PersistentBitMatrixBuilder</code> with <code>n_genomes</code> columns and fills all columns in a single pass.</p>
<p>Any duplicate slot or out-of-bounds index detected during <code>build_second_pass</code> returns <code>OLMError::Mphf</code>. <code>new_from_par_iter</code> avoids materialising all keys as <code>Vec&lt;u64&gt;</code>.</p>
<p>All build impls delegate MPHF + evidence construction to <code>MphfLayer::build</code> via a mode-specific <code>fill_slot</code> callback. Mode 2 pre-reads <code>n_kmers</code> from <code>unitigs.bin</code> to size the <code>PersistentCompactIntMatrixBuilder</code> before calling <code>MphfLayer::build</code>. Mode 3 does the same for <code>PersistentBitMatrixBuilder</code>.</p>
<hr />
<h2 id="layeredstores-and-aggregation-traits">LayeredStore\&lt;S&gt; and aggregation traits</h2>
<p><code>LayeredStore&lt;S&gt;</code> is a generic aggregation wrapper over <code>Vec&lt;S&gt;</code>. It propagates three traits from <code>obicompactvec::traits</code> up the hierarchy via blanket impls:</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">LayeredStore</span><span class="o">&lt;</span><span class="n">S</span><span class="o">&gt;</span><span class="p">(</span><span class="k">pub</span><span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">S</span><span class="o">&gt;</span><span class="p">);</span>
<span class="k">impl</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span><span class="w"> </span><span class="nc">ColumnWeights</span><span class="o">&gt;</span><span class="w"> </span><span class="n">ColumnWeights</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">LayeredStore</span><span class="o">&lt;</span><span class="n">S</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err"></span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// Σ col_weights across inner stores</span>
<span class="k">impl</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span><span class="w"> </span><span class="nc">CountPartials</span><span class="o">&gt;</span><span class="w"> </span><span class="n">CountPartials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">LayeredStore</span><span class="o">&lt;</span><span class="n">S</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err"></span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// element-wise Σ partials</span>
<span class="k">impl</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span><span class="w"> </span><span class="nc">BitPartials</span><span class="o">&gt;</span><span class="w"> </span><span class="n">BitPartials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">LayeredStore</span><span class="o">&lt;</span><span class="n">S</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err"></span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// element-wise Σ partials</span>
</code></pre></div>
<p>Because blanket impls compose, <code>LayeredStore&lt;LayeredStore&lt;S&gt;&gt;</code> automatically inherits all three traits when <code>S</code> does — providing the partitioned level without a separate type.</p>
<p><strong>Aggregation hierarchy:</strong></p>
<div class="highlight"><pre><span></span><code>PersistentCompactIntMatrix implements CountPartials
LayeredStore&lt;PersistentCompactIntMatrix&gt; via blanket impl (one partition)
LayeredStore&lt;LayeredStore&lt;&gt;&gt; via blanket impl (partitioned index)
</code></pre></div>
<p><strong>Leaf implementors</strong> (in <code>obicompactvec</code>):</p>
<table>
<thead>
<tr>
<th>Type</th>
<th>Traits</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>PersistentCompactIntMatrix</code></td>
<td><code>ColumnWeights</code> (via <code>sum()</code>) + <code>CountPartials</code></td>
</tr>
<tr>
<td><code>PersistentBitMatrix</code></td>
<td><code>ColumnWeights</code> (via <code>count_ones()</code>) + <code>BitPartials</code></td>
</tr>
</tbody>
</table>
<p><code>PersistentCompactIntVec</code> and <code>PersistentBitVec</code> do not implement these traits — they are single-column primitives, not matrix-level aggregators.</p>
<p>See <a href="../../architecture/index_architecture/">Kmer index architecture</a> for the full trait API and the two-pass normalised-metric pattern.</p>
<hr />
<h2 id="on-disk-structure">On-disk structure</h2>
<div class="highlight"><pre><span></span><code>index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
counts/ [mode 2] PersistentCompactIntMatrix
meta.json {&quot;n&quot;: N, &quot;n_cols&quot;: 1}
col_000000.pciv
presence/ [mode 3] PersistentBitMatrix
meta.json {&quot;n&quot;: N, &quot;n_cols&quot;: G}
col_000000.pbiv
layer_1/
part_00001/
</code></pre></div>
<p><strong>Partition</strong> (<code>part_XXXXX/</code>): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.</p>
<p><strong>Layer</strong> (<code>layer_N/</code>): one <code>MphfLayer</code> plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.</p>
<hr />
<h2 id="evidence-encoding">Evidence encoding</h2>
<p><code>evidence.bin</code> is a flat <code>[u32; n]</code> array with no header. Each u32 encodes one slot:</p>
<div class="highlight"><pre><span></span><code>bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
</code></pre></div>
<p>Decoding: <code>chunk_id = raw &gt;&gt; 7</code>, <code>rank = raw &amp; 0x7F</code>. Reconstructing the kmer: read k nucleotides at position <code>rank</code> within unitig <code>chunk_id</code>.</p>
<p>For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.</p>
<hr />
<h2 id="ptr_hash-configuration">ptr_hash configuration</h2>
<div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o">&lt;</span>
<span class="w"> </span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="c1">// key type: canonical kmer raw encoding</span>
<span class="w"> </span><span class="n">CubicEps</span><span class="p">,</span><span class="w"> </span><span class="c1">// bucket fn: 2.4 bits/key, λ=3.5, α=0.99</span>
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">CachelineEf</span><span class="o">&gt;&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: 11.6 bits/entry (Elias-Fano)</span>
<span class="w"> </span><span class="n">Xx64</span><span class="p">,</span><span class="w"> </span><span class="c1">// hasher: XXH3-64 with seed</span>
<span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">u8</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// pilots</span>
<span class="o">&gt;</span><span class="p">;</span>
</code></pre></div>
<p><code>Xx64</code> is chosen over <code>FxHash</code> because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.</p>
<p><code>CubicEps</code> with <code>PtrHashParams::&lt;CubicEps&gt;::default()</code> (λ=3.5) is a balanced tradeoff: 2× slower construction than <code>Linear/λ=3.0</code>, 20% less space.</p>
<hr />
<h2 id="query-path">Query path</h2>
<p>A kmer query routes through all three levels:</p>
<ol>
<li><strong>Partition routing</strong>: hash canonical minimiser of the query kmer → partition index → open <code>part_XXXXX/</code>.</li>
<li><strong>Layer probing</strong>: iterate layers in order; for each layer compute <code>slot = mphf.index(kmer)</code>, decode evidence, compare to query. First match wins.</li>
<li><strong>Data access</strong>: <code>layer.data.read(slot)</code> returns <code>D::Item</code>.</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="c1">// pseudo-code</span>
<span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="n">kmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">Hit</span><span class="o">&lt;</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">&gt;</span><span class="p">)</span><span class="o">&gt;</span><span class="p">:</span>
<span class="w"> </span><span class="nc">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">layer</span><span class="p">)</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">iter</span><span class="p">().</span><span class="n">enumerate</span><span class="p">():</span>
<span class="w"> </span><span class="nc">slot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">layer</span><span class="p">.</span><span class="n">mphf</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="o">&amp;</span><span class="n">kmer</span><span class="p">.</span><span class="n">raw</span><span class="p">())</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">layer</span><span class="p">.</span><span class="n">evidence</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span>
<span class="w"> </span><span class="nc">return</span><span class="w"> </span><span class="nb">Some</span><span class="p">((</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">Hit</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">slot</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">layer</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span><span class="w"> </span><span class="p">}))</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nb">None</span>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="n">Hit</span><span class="o">&lt;</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">mphf</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">kmer</span><span class="p">).</span><span class="n">map</span><span class="p">(</span><span class="o">|</span><span class="n">slot</span><span class="o">|</span><span class="w"> </span><span class="n">Hit</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">slot</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">self</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span><span class="w"> </span><span class="p">})</span>
<span class="p">}</span>
</code></pre></div>
<p>Expected probe depth: 1 for kmers in layer 0, increasing for later layers.</p>
<p>For mode 2, <code>hit.data</code> is <code>Box&lt;[u32]&gt;</code> with 1 element; <code>hit.data[0]</code> is the count. For mode 3, <code>hit.data</code> is <code>Box&lt;[bool]&gt;</code> with G elements, one per genome.</p>
<p><code>MphfLayer::find</code> probes the MPHF, decodes evidence, and verifies the kmer — returning <code>Some(slot)</code> on match, <code>None</code> otherwise. <code>data.read(slot)</code> is called only on a confirmed hit.</p>
<p>In <code>LayeredMap</code>, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.</p>
<hr />
<h2 id="add-layer-algorithm">Add-layer algorithm</h2>
<p>When adding dataset B to an existing index:</p>
<ol>
<li>For each partition, iterate kmers of B routed to that partition.</li>
<li>Probe existing layers; collect kmers absent from all layers → <code>B \ index</code>.</li>
<li>Build a new layer from <code>B \ index</code>.</li>
<li>Append the new layer directory under each <code>part_XXXXX/</code>.</li>
<li>Update <code>meta.json</code> (layer count, sample registry).</li>
<li>For each partition, probe existing layers for kmers of B routed to that partition.</li>
<li>Collect kmers absent from all layers → <code>B \ index</code>.</li>
<li>Write <code>B \ index</code> to a new <code>unitigs.bin</code> via <code>MphfLayer::unitig_writer</code>.</li>
<li>Call <code>Layer&lt;D&gt;::build</code> on the new directory.</li>
<li>Update <code>meta.json</code>.</li>
</ol>
<p>Each partition's new layer is built independently; the operation is fully parallel across partitions.</p>
<hr />
@@ -1515,19 +1421,19 @@
<tbody>
<tr>
<td><code>ptr_hash 1.1</code></td>
<td>MPHF per layer (epserde serialisation)</td>
<td>MPHF per layer</td>
</tr>
<tr>
<td><code>cacheline-ef 1.1</code></td>
<td>compact remap storage inside ptr_hash</td>
<td>compact remap inside ptr_hash</td>
</tr>
<tr>
<td><code>epserde 0.8</code></td>
<td>zero-copy serialisation of MPHF</td>
<td>zero-copy MPHF serialisation</td>
</tr>
<tr>
<td><code>memmap2</code></td>
<td>mmap of layer files</td>
<td><code>memmap2 0.9</code></td>
<td>mmap of evidence and payload files</td>
</tr>
<tr>
<td><code>obiskio</code></td>
@@ -1535,21 +1441,18 @@
</tr>
<tr>
<td><code>obicompactvec</code></td>
<td>payload types: <code>PersistentCompactIntMatrix</code>, <code>PersistentBitMatrix</code></td>
<td>payload types + aggregation traits</td>
</tr>
<tr>
<td><code>rayon 1</code></td>
<td>parallel MPHF construction pass</td>
</tr>
<tr>
<td><code>ndarray 0.16</code></td>
<td>aggregation output arrays</td>
</tr>
</tbody>
</table>
<hr />
<h2 id="relationship-to-target-architecture">Relationship to target architecture</h2>
<p>The target architecture (see <a href="../../architecture/index_architecture/">Kmer index architecture</a>) separates <code>MphfLayer</code> from data stores entirely and introduces a <code>PartitionedIndex</code> with parallel dispatch and an <code>Aggregator</code> pattern. The current implementation is a stepping stone: <code>obicompactvec</code> types are already fully decoupled from the MPHF; the remaining refactoring is within <code>obilayeredmap</code> itself.</p>
<hr />
<h2 id="open-questions">Open questions</h2>
<ul>
<li><strong>Mode 4</strong>: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses <code>PersistentCompactIntMatrix</code> with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.</li>
<li><strong>Layer merge</strong>: merging two <code>LayeredMap</code> instances into a single-layer index requires full rebuild. Define API and cost model.</li>
<li><strong>Canonical kmer orientation</strong>: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.</li>
<li><strong><code>try_new_from_par_iter</code></strong>: <code>ptr_hash::new_from_par_iter</code> silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A <code>try_new_from_par_iter</code> PR upstream would close this gap.</li>
</ul>