docs: document k-mer index architecture and refactor distance metrics

Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
2026-05-15 21:07:23 +08:00
parent 8409c852ef
commit 45d49ed501
25 changed files with 8842 additions and 117 deletions
@@ -745,6 +745,89 @@
    </span>
  </a>
  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#multilayer-index-architecture" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Multilayer index architecture
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Multilayer index architecture">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#motivation" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Motivation
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#layer-structure" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Layer structure
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#membership-verification" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Membership verification
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#query-algorithm" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Query algorithm
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#layer-count-and-probe-cost" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Layer count and probe cost
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#merging-layers" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Merging layers
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
 </li>
      
        <li class="md-nav__item">
@@ -795,6 +878,90 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../obilayeredmap/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    obilayeredmap crate
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../persistent_compact_int_vec/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    PersistentCompactIntVec
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../persistent_bit_vec/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    PersistentBitVec
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -877,6 +1044,34 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../architecture/index_architecture/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer index
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -1002,6 +1197,89 @@
    </span>
  </a>
  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#multilayer-index-architecture" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Multilayer index architecture
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Multilayer index architecture">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#motivation" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Motivation
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#layer-structure" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Layer structure
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#membership-verification" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Membership verification
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#query-algorithm" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Query algorithm
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#layer-count-and-probe-cost" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Layer count and probe cost
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#merging-layers" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Merging layers
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
 </li>
      
        <li class="md-nav__item">
@@ -1071,7 +1349,7 @@
 </ul>
 <h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
 <p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.</p>
-<p><strong>Phase 2</strong> (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.</p>
+<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.</p>
 <p>boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.</p>
 <hr />
 <h2 id="space-at-scale">Space at scale</h2>
@@ -1106,12 +1384,47 @@
 <h2 id="on-disk-and-mmap-considerations">On-disk and mmap considerations</h2>
 <p>All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the <code>ph</code> crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.</p>
 <p>No established Rust crate provides a natively on-disk MPHF. <strong>SSHash</strong> (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.</p>
+<hr />
+<h2 id="multilayer-index-architecture">Multilayer index architecture</h2>
+<h3 id="motivation">Motivation</h3>
+<p>An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.</p>
+<h3 id="layer-structure">Layer structure</h3>
+<p>Each layer is a self-contained unit:</p>
+<div class="highlight"><pre><span></span><code>layer_i/
+  unitigs.bin     — packed 2-bit nucleotide sequences
+  mphf.bin        — ptr_hash index (phase-2, exact key count)
+  evidence.bin    — [(unitig_id, rank)] per MPHF slot  (see unitig_evidence.md)
+  counts.bin      — [u32] per MPHF slot
+</code></pre></div>
+<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:</p>
+<ol>
+<li>For each kmer in B: query layer 0 — if found, accumulate count into <code>counts_0[MPHF_0(kmer)]</code>.</li>
+<li>Collect all kmers of B not present in any existing layer → set <code>B \ A</code>.</li>
+<li>Build layer 1 from <code>B \ A</code> using the standard two-phase pipeline (spectrum, filter, ptr_hash).</li>
+</ol>
+<p>Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from <code>C \ A \ B</code>.</p>
+<h3 id="membership-verification">Membership verification</h3>
+<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(unitig_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
+<p>This makes the evidence layer load-bearing for correctness, not only for locality.</p>
+<h3 id="query-algorithm">Query algorithm</h3>
+<div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;count&gt;:
+    for layer in layers:
+        slot = layer.mphf.query(kmer)
+        if layer.evidence.decode(slot) == kmer:
+            return Some(layer.counts[slot])
+    return None
+</code></pre></div>
+<p>Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.</p>
+<h3 id="layer-count-and-probe-cost">Layer count and probe cost</h3>
+<p>Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.</p>
+<h3 id="merging-layers">Merging layers</h3>
+<p>Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>
 <h2 id="open-questions">Open questions</h2>
 <ul>
 <li>Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.</li>
-<li>Revisit ptr_hash for phase 2 once the crate has broader production track record.</li>
-<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.</li>
+<li><strong>rkyv integration</strong>: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to <code>rkyv::Archive</code> — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.</li>
 <li>Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.</li>
+<li>Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.</li>
 </ul>