docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
@@ -773,6 +773,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../evidence_elimination/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../obilayeredmap/" class="md-nav__link">
        
@@ -851,6 +879,62 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../merge/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../rebuild_filter/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -1104,7 +1188,9 @@
 <li><strong>error valley</strong> → suggests min_count (typically the local minimum between the error peak and the coverage peak)</li>
 </ul>
 <h2 id="phase-1-scatter">Phase 1 — Scatter</h2>
-<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:</p>
+<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.</p>
+<p>Input files are read via <code>open_nuc_stream</code>, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of <code>NucPage</code> buffers. Each <code>NucPage</code> is a flat 64 KB buffer of normalised bytes (<code>ACGT</code> + <code>\x00</code> separators), carrying a k−1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.</p>
+<p>For each read fragment within a page:</p>
 <ol>
 <li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base; discard fragments shorter than k.</li>
 <li><strong>Entropy filter</strong>: scan each fragment with a sliding window of size k. When the kmer <span class="arithmatex">\(K_i = S[i \mathinner{..} i+k-1]\)</span> ended by nucleotide <span class="arithmatex">\(S[j]\)</span> (with <span class="arithmatex">\(j = i+k-1\)</span>) has entropy below threshold <span class="arithmatex">\(\theta\)</span>, emit the current segment and start a new one (see algorithm below). <span class="arithmatex">\(K_i\)</span> belongs to neither segment, and no valid kmer is lost.</li>
@@ -1154,8 +1240,13 @@ B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
    for each kmer in sequence:
        kmer_counts[canonical(kmer)] += COUNT
 </code></pre></div>
-<p>Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.</p>
-<p>Abundance filter applied here: kmers with <code>total_count &lt; q</code> are discarded. <code>q</code> is a collection parameter (0 = keep all, including singletons for ≤1x data).</p>
+<p>Implemented as a three-step pipeline in <code>count_partition()</code>:</p>
+<ol>
+<li><strong>External sort</strong> (<code>kmer_sort::sort_unique_kmers</code>): read dereplicated superkmers, extract canonical kmer raw <code>u64</code> values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → <code>sorted_unique.bin</code>. f0 is now known exactly.</li>
+<li><strong>Provisional MPHF</strong> (ptr_hash): built from <code>sorted_unique.bin</code> via <code>new_from_par_iter(f0, ...)</code>. Stored to <code>mphf1.bin</code>; <code>sorted_unique.bin</code> deleted immediately.</li>
+<li><strong>Accumulation pass</strong>: re-read dereplicated superkmers; for each kmer, <code>slot = mphf.index(kmer.raw())</code>, increment <code>counts1[slot]</code> by the superkmer COUNT. Stored in a <code>PersistentCompactIntVec</code> (<code>counts1.bin</code>).</li>
+</ol>
+<p>At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (<code>spectrums/{label}.json</code>) is written to the index root.</p>
 <p>No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.</p>
 <h2 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer compaction</h2>
 <p>The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:</p>
@@ -1188,14 +1279,52 @@ branching / dead-end → unitig start or end
 <p>Output: <code>unitigs.bin</code> — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.</p>
 <p><strong>Scope of local unitigs:</strong> these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.</p>
 <h2 id="phase-6-mphf-construction-and-index-finalisation">Phase 6 — MPHF construction and index finalisation</h2>
-<p>Built once on the definitive kmer set (all kmers in all unitigs of the partition). See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for the current implementation.</p>
-<div class="highlight"><pre><span></span><code>kmers from unitigs → MPHF → mphf.bin
-                   → evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
-                   → payload      : counts/ (mode 2) or presence/ (mode 3)
+<p><code>build_index_layer</code> is called per partition (in parallel via <code>build_layers</code>) with the following parameters sourced from <code>IndexConfig</code>:</p>
+<ul>
+<li><code>block_bits</code> — from <code>IndexConfig::block_bits</code>; controls the <code>.idx</code> block size (2^block_bits unitig chunks per block) for exact evidence</li>
+<li><code>evidence</code> — <code>EvidenceKind::Exact</code> or <code>EvidenceKind::Approx { b, z }</code>; propagated unchanged from <code>IndexConfig::evidence</code></li>
+<li><code>min_ab</code> / <code>max_ab</code> — abundance bounds applied before graph construction</li>
+<li><code>with_counts</code> — whether to store kmer counts alongside set membership</li>
+</ul>
+<p><strong>Abundance filtering:</strong> when <code>min_ab &gt; 1</code> or <code>max_ab.is_some()</code>, the provisional <code>mphf1.bin</code> and <code>counts1.bin</code> produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in <code>counts1</code> satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).</p>
+<div class="highlight"><pre><span></span><code>for each kmer in dereplicated super-kmer:
+    ab = counts1[mphf1.index(kmer.raw())]
+    if ab &lt; min_ab || ab &gt; max_ab: skip
+    graph.push(kmer)
 </code></pre></div>
-<p>The MPHF is built in two passes over <code>unitigs.bin</code>: parallel pass for <code>mphf.bin</code>, sequential pass for <code>evidence.bin</code> and payload. The exact kmer count is available from the unitig index (<code>unitigs.bin.idx</code>) before the passes begin.</p>
-<p><strong>Exact verification via unitig evidence:</strong></p>
-<p><code>unitigs.bin</code> serves as the evidence structure. The MPHF maps every input to <code>[0, N)</code> including absent kmers — the unitig read-back (via <code>evidence.bin</code>) is the only correct membership test.</p>
+<p><strong>Graph build and unitig write:</strong></p>
+<p>The surviving kmers are fed into <code>GraphDeBruijn</code>, which computes degrees and yields unitigs. Unitigs are written to <code>layer_0/unitigs.bin</code> via a <code>UnitigFileWriter</code>.</p>
+<p><strong>MPHF and evidence build:</strong></p>
+<p><code>Layer::build</code> (membership-only) or <code>Layer::&lt;PersistentCompactIntMatrix&gt;::build</code> (with counts) is called next. Internally, <code>MphfLayer::build</code> performs two passes:</p>
+<ol>
+<li><strong>Pass 1 (parallel):</strong> build <code>unitigs.bin.idx</code> (block size = 2^<code>block_bits</code>) then construct the MPHF from all canonical kmers in <code>unitigs.bin</code>; store to <code>mphf.bin</code>.</li>
+<li><strong>Pass 2 (sequential):</strong> for each kmer in <code>unitigs.bin</code>, compute its slot and write <code>evidence.bin</code> (<code>chunk_id: 25 bits | rank: 7 bits</code> packed into a <code>u32</code>); also invoke the payload callback (<code>fill_slot</code>) to populate <code>counts/</code> if <code>with_counts</code>.</li>
+</ol>
+<p>After <code>Layer::build</code> completes, <code>layer_meta.json</code> records <code>EvidenceKind::Exact</code>.</p>
+<p><strong>Approximate evidence override:</strong></p>
+<p>If <code>evidence</code> is <code>EvidenceKind::Approx { b, z }</code>, <code>build_approx_evidence</code> is called immediately after <code>Layer::build</code>. It overwrites the exact evidence bundle with <code>fingerprint.bin</code> (b-bit hash per slot) and rewrites <code>layer_meta.json</code> with <code>EvidenceKind::Approx { b, z }</code>. No <code>.idx</code> file is needed at query time in this mode.</p>
+<div class="highlight"><pre><span></span><code>// Exact path  →  evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
+// Approx path →  fingerprint.bin + layer_meta.json(Approx{b,z})
+//               (evidence.bin left on disk but not used)
+</code></pre></div>
+<p><strong>Partition metadata:</strong></p>
+<p>After all layer files are written, <code>PartitionMeta { n_layers: 1 }</code> is serialised to <code>index/meta.json</code> inside the partition directory. This file is required by <code>LayeredMap::open</code> for subsequent merge operations.</p>
+<p><strong>File layout per partition after phase 6:</strong></p>
+<div class="highlight"><pre><span></span><code>part_XXXXX/
+  index/
+    meta.json                ← PartitionMeta { n_layers: 1 }
+    layer_0/
+      unitigs.bin            ← permanent evidence (all modes)
+      unitigs.bin.idx        ← block index (exact mode only)
+      mphf.bin               ← MPHF
+      evidence.bin           ← exact evidence (exact mode)
+      fingerprint.bin        ← b-bit fingerprints (approx mode)
+      layer_meta.json        ← EvidenceKind tag
+      counts/                ← PersistentCompactIntMatrix (with_counts only)
+</code></pre></div>
+<p><strong>Cleanup:</strong> unless <code>--keep-intermediate</code> is set, <code>remove_build_artifacts</code> deletes <code>dereplicated.skmer.zst</code>, <code>mphf1.bin</code>, and <code>counts1.bin</code> after all partitions are indexed.</p>
+<p>See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for data structure details.</p>
+<p><strong>Query path (exact evidence):</strong></p>
 <div class="highlight"><pre><span></span><code>query kmer q
  → canonical_minimizer(q) → hash → PART → part_XXXXX/
  → MPHF(q) → slot s
@@ -1204,7 +1333,13 @@ branching / dead-end → unitig start or end
  → match   : return payload[s]   ← exact hit
  → no match: kmer absent         ← MPHF collision on absent kmer
 </code></pre></div>
-<p><code>superkmers.bin.gz</code> is no longer needed at this point and can be deleted.</p>
+<p><strong>Query path (approximate evidence):</strong></p>
+<div class="highlight"><pre><span></span><code>query kmer q
+  → MPHF(q) → slot s
+  → fingerprint[s] matches seq_hash(q)?
+  → yes : probable hit (FP rate = 1/2^b per kmer, 1/2^(b·z) per z-window)
+  → no  : kmer absent
+</code></pre></div>
 <div class="footnote">
 <hr />
 <ol>