docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
@@ -757,6 +757,28 @@
    </span>
  </a>
  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#evidence-modes" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Evidence modes
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#build-functions" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Build functions
+      
+    </span>
+  </a>
+  
 </li>
        
          <li class="md-nav__item">
@@ -840,6 +862,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../evidence_elimination/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../obilayeredmap/" class="md-nav__link">
        
@@ -918,6 +968,62 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../merge/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../rebuild_filter/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -1165,6 +1271,28 @@
    </span>
  </a>
  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#evidence-modes" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Evidence modes
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#build-functions" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Build functions
+      
+    </span>
+  </a>
+  
 </li>
        
          <li class="md-nav__item">
@@ -1226,26 +1354,26 @@
 <h2 id="why-two-phases-are-needed">Why two phases are needed</h2>
 <p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.</p>
 <h3 id="phase-1-provisional-mphf-kmer-spectrum">Phase 1 — provisional MPHF + kmer spectrum</h3>
-<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code>.</p>
+<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code> → <code>count_partition()</code>.</p>
 <ol>
-<li><strong>Pass 1</strong>: read the dereplicated superkmer file; enumerate all unique canonical kmers into a <code>HashSet</code>. Exact count known after this pass.</li>
-<li><strong>Build a provisional MPHF</strong> (<code>GOFunction</code> from the <code>ph</code> crate) over the exact kmer set. Produces <code>mphf1.bin</code>.</li>
-<li><strong>Create <code>counts1.bin</code></strong>: one zero-initialised <code>u32</code> per MPHF slot (mmap'd).</li>
-<li><strong>Pass 2</strong>: re-read the dereplicated file; for each kmer, query <code>mphf1.get(kmer)</code> and atomically accumulate the superkmer count into <code>counts1[slot]</code>.</li>
+<li><strong>External sort</strong>: read the dereplicated superkmer file; extract the raw <code>u64</code> canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: <code>sorted_unique.bin</code> — a flat array of f0 distinct sorted <code>u64</code> values. Exact kmer count f0 is known at this point.</li>
+<li><strong>Build provisional MPHF</strong> (ptr_hash, same configuration as phase 2) over <code>sorted_unique.bin</code> using <code>new_from_par_iter</code>. Delete <code>sorted_unique.bin</code> immediately after. Persist to <code>mphf1.bin</code>.</li>
+<li><strong>Create <code>counts1.bin</code></strong>: <code>PersistentCompactIntVec</code> with f0 slots, zero-initialised.</li>
+<li><strong>Accumulation pass</strong>: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute <code>slot = mphf.index(kmer.raw())</code> and increment <code>counts1[slot]</code> by the superkmer's COUNT.</li>
 <li><strong>Build kmer frequency spectrum</strong> from <code>counts1</code>: histogram <code>{count → n_kmers}</code>, totals f0 (distinct kmers) and f1 (total abundance). Written to <code>kmer_spectrum_raw.json</code> per partition, then merged globally.</li>
 </ol>
 <p>Files produced per partition:</p>
 <div class="highlight"><pre><span></span><code>part_XXXXX/
-  mphf1.bin               — GOFunction (provisional MPHF, discarded after phase 2)
-  counts1.bin             — [u32; n_kmers] kmer counts, mmap&#39;d
+  mphf1.bin               — ptr_hash provisional MPHF (discarded after phase 2)
+  counts1.bin             — PersistentCompactIntVec, f0 × u32 kmer counts
  kmer_spectrum_raw.json  — local frequency spectrum
 </code></pre></div>
 <h3 id="phase-2-definitive-mphf">Phase 2 — definitive MPHF</h3>
 <p>After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see <a href="../pipeline/">Construction pipeline</a>), the exact filtered kmer set is available via <code>unitigs.bin</code>.</p>
-<p><code>MphfLayer::build</code> is called on the unitig file:</p>
+<p><code>MphfLayer::build(dir, block_bits, mode: &amp;IndexMode, fill_slot)</code> is called on the unitig directory:</p>
 <ol>
-<li><strong>Pass 1</strong>: iterate all canonical kmers from <code>unitigs.bin</code> in parallel, build and store <code>mphf.bin</code> (ptr_hash).</li>
-<li><strong>Pass 2</strong>: iterate sequentially, fill <code>evidence.bin</code>, call the mode-specific <code>fill_slot</code> callback.</li>
+<li><strong>Pass 1</strong> (parallel): a <code>CanonicalKmerIter</code> — clonable via <code>Arc&lt;Mmap&gt;</code>, no file reopening — is passed directly to <code>new_from_par_iter</code> via <code>par_bridge()</code>. No <code>.idx</code> is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces <code>mphf.bin</code>.</li>
+<li><strong>Pass 2</strong> (sequential): iterate with <code>iter_indexed_canonical_kmers</code>; fill evidence files; call <code>fill_slot(slot, kmer)</code> callback per kmer. For Exact/Hybrid, <code>.idx</code> is written at the end of this pass — never earlier.</li>
 </ol>
 <p><code>mphf1.bin</code> and <code>counts1.bin</code> are no longer needed after phase 2 and can be deleted.</p>
 <hr />
@@ -1265,13 +1393,11 @@
 <p><strong>FMPH/FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
 <ul>
 <li>~2.1 bits/key — most compact; good query speed; deterministic construction</li>
-<li>Works well from an exact or slightly overestimated count</li>
-<li><code>GOFunction</code> (group-oriented variant) is the specific type used</li>
+<li><code>GOFunction</code> (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well</li>
 </ul>
 <h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
-<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): <code>ph::fmph::GOFunction</code>. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of <code>count_kmer</code>.</p>
-<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.</p>
-<p>boomphf is eliminated: largest space overhead, streaming advantage does not apply.</p>
+<p><strong>Both phases</strong>: <strong>ptr_hash</strong>, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the <code>ph</code> crate dependency.</p>
+<p>boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.</p>
 <hr />
 <h2 id="space-at-scale">Space at scale</h2>
 <p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
@@ -1320,9 +1446,12 @@
 <h3 id="layer-structure">Layer structure</h3>
 <p>Each layer is a self-contained unit. See <a href="../obilayeredmap/">obilayeredmap</a> for the full on-disk layout. The MPHF-relevant files are:</p>
 <div class="highlight"><pre><span></span><code>layer_i/
-  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence)
+  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence source)
+  unitigs.bin.idx  — random-access block index (block_bits controls granularity)
  mphf.bin         — ptr_hash phase-2 MPHF
-  evidence.bin     — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
+  evidence.bin     — n × (chunk_id: 25 bits | rank: 7 bits) per slot  [exact mode]
+  fingerprint.bin  — n × b-bit fingerprints per slot                  [approx mode]
+  [no layer_meta.json — mode stored once in partition-level meta.json]
 </code></pre></div>
 <p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:</p>
 <ol>
@@ -1330,17 +1459,43 @@
 <li>Collect kmers of B not present in any layer → set <code>B \ A</code>.</li>
 <li>Build layer 1 from <code>B \ A</code> (dereplicate → count → De Bruijn → unitigs → <code>MphfLayer::build</code>).</li>
 </ol>
+<h3 id="evidence-modes">Evidence modes</h3>
+<p>Three evidence modes are supported via <code>IndexMode</code>, stored once in <code>PartitionMeta</code> at partition root. There is no <code>layer_meta.json</code>.</p>
+<p><strong>Exact</strong> (<code>IndexMode::Exact</code>): <code>evidence.bin</code> stores one <code>(chunk_id, rank)</code> pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. <code>.idx</code> required at query time.</p>
+<p><strong>Approx</strong> (<code>IndexMode::Approx { b, z }</code>): <code>fingerprint.bin</code> stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No <code>.idx</code> written or needed.</p>
+<p><strong>Hybrid</strong> (<code>IndexMode::Hybrid { b, z }</code>): both <code>fingerprint.bin</code> and <code>evidence.bin</code> + <code>.idx</code>. <code>find()</code> uses the fingerprint (O(1)); <code>find_strict()</code> uses exact evidence (O(1)).</p>
+<h3 id="build-functions">Build functions</h3>
+<div class="highlight"><pre><span></span><code>MphfLayer::build(dir, block_bits, mode: &amp;IndexMode, fill_slot)
+    Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin  (no .idx used)
+    Pass 2: sequential iter → fill evidence files + call fill_slot
+            .idx written last for Exact/Hybrid (query-time only)
+
+MphfLayer::build_exact_evidence(dir, block_bits)
+    Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
+    Uses open_sequential(); no .idx required on entry
+
+MphfLayer::build_approx_evidence(dir, b, z)
+    Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
+    Uses open_sequential(); never writes .idx
+</code></pre></div>
+<p>There is no <code>build_evidence</code> dispatch wrapper. Callers choose the appropriate post-hoc build directly.</p>
+<p>In <code>obikpartitionner</code>, <code>build_index_layer</code> receives <code>block_bits: u8</code> from <code>IndexConfig::block_bits</code> and forwards it directly to <code>Layer::build</code> and <code>Layer::build_approx_evidence</code>.</p>
 <h3 id="membership-verification">Membership verification</h3>
-<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(chunk_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
+<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:</p>
+<ul>
+<li><strong>Exact</strong>: decode <code>(chunk_id, rank)</code> from <code>evidence.bin</code>; reconstruct the kmer via <code>unitigs.verify_canonical_kmer</code>; compare to query.</li>
+<li><strong>Approx</strong>: compare <code>kmer.seq_hash()</code> to the b-bit fingerprint stored at the slot.</li>
+</ul>
+<p>A mismatch in either mode means the kmer is absent from this layer; probe the next layer.</p>
 <h3 id="query-algorithm">Query algorithm</h3>
 <div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;(layer_index, slot)&gt;:
    for (i, layer) in layers.iter().enumerate():
        slot = layer.mphf.index(kmer)
-        if layer.evidence.decode(slot) matches kmer:
+        if layer.evidence.matches(slot, kmer):   // exact or approx dispatch
            return Some((i, slot))
    return None
 </code></pre></div>
-<p>Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.</p>
+<p><code>MphfLayer::find</code> dispatches on <code>LayerEvidence</code> at O(1) — no panicking <code>find_exact</code>/<code>find_approx</code> methods. <code>find_strict</code> always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.</p>
 <h3 id="merging-layers">Merging layers</h3>
 <p>Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>