docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
@@ -213,6 +213,17 @@
    </label>
    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
      
+        <li class="md-nav__item">
+  <a href="#subcommands" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Subcommands
+      
+    </span>
+  </a>
+  
+</li>
+      
        <li class="md-nav__item">
  <a href="#constraints" class="md-nav__link">
    <span class="md-ellipsis">
@@ -222,6 +233,28 @@
    </span>
  </a>
  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Parameter constraints (enforced at CLI)
+      
+    </span>
+  </a>
+  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#genome-label-constraints" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Genome label constraints
+      
+    </span>
+  </a>
+  
 </li>
      
        <li class="md-nav__item">
@@ -714,6 +747,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="implementation/evidence_elimination/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="implementation/obilayeredmap/" class="md-nav__link">
        
@@ -792,6 +853,62 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="implementation/merge/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="implementation/rebuild_filter/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -935,6 +1052,17 @@
    </label>
    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
      
+        <li class="md-nav__item">
+  <a href="#subcommands" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Subcommands
+      
+    </span>
+  </a>
+  
+</li>
+      
        <li class="md-nav__item">
  <a href="#constraints" class="md-nav__link">
    <span class="md-ellipsis">
@@ -944,6 +1072,28 @@
    </span>
  </a>
  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Parameter constraints (enforced at CLI)
+      
+    </span>
+  </a>
+  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#genome-label-constraints" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Genome label constraints
+      
+    </span>
+  </a>
+  
 </li>
      
        <li class="md-nav__item">
@@ -976,12 +1126,155 @@

 <h1 id="obikmer">obikmer</h1>
 <p><code>obikmer</code> is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.</p>
+<h2 id="subcommands">Subcommands</h2>
+<table>
+<thead>
+<tr>
+<th>Subcommand</th>
+<th>Purpose</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><code>superkmer</code></td>
+<td>Extract super-kmers from a sequence file and write to stdout</td>
+</tr>
+<tr>
+<td><code>index</code></td>
+<td>Build a complete genome index (scatter → dereplicate → count → layered MPHF)</td>
+</tr>
+<tr>
+<td><code>merge</code></td>
+<td>Merge multiple built indexes into one</td>
+</tr>
+<tr>
+<td><code>rebuild</code></td>
+<td>Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata</td>
+</tr>
+<tr>
+<td><code>query</code></td>
+<td>Query an index with sequences and annotate matches</td>
+</tr>
+<tr>
+<td><code>dump</code></td>
+<td>Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
+</tr>
+<tr>
+<td><code>annotate</code></td>
+<td>Add or update genome metadata from a CSV file; or dump metadata as CSV</td>
+</tr>
+<tr>
+<td><code>distance</code></td>
+<td>Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees</td>
+</tr>
+<tr>
+<td><code>unitig</code></td>
+<td>Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
+</tr>
+<tr>
+<td><code>estimate</code></td>
+<td>Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing</td>
+</tr>
+<tr>
+<td><code>reindex</code></td>
+<td>Convert an index's evidence in-place: exact ↔ approx</td>
+</tr>
+<tr>
+<td><code>utils</code></td>
+<td>Miscellaneous index utilities: <code>--new-label NEW=OLD</code> renames a genome label; <code>--upgrade-index</code> adds missing <code>layer_meta.json</code> to old indexes</td>
+</tr>
+<tr>
+<td><code>pack</code></td>
+<td>Pack per-column matrix files into single-file format to reduce query I/O</td>
+</tr>
+</tbody>
+</table>
 <h2 id="constraints">Constraints</h2>
 <ul>
 <li>Target scale: individual genome datasets, tens of Gbases</li>
 <li>Maximum efficiency in computation, memory, and disk usage</li>
-<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li>
+<li>k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)</li>
+<li>Canonical form: <code>min(kmer, revcomp(kmer))</code> reduces strand-symmetric space by half</li>
+<li>Input formats for <code>index</code>/<code>superkmer</code>: FASTA (<code>.fa</code>, <code>.fasta</code>), FASTQ (<code>.fq</code>, <code>.fastq</code>), GenBank flat file (<code>.gb</code>, <code>.gbk</code>, <code>.gbff</code>), all optionally gzip-compressed; directories expanded recursively; streaming stdin via <code>-</code></li>
+<li>Input formats for <code>query</code>: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via <code>-</code></li>
 </ul>
+<h2 id="parameter-constraints-enforced-at-cli">Parameter constraints (enforced at CLI)</h2>
+<p>All constraints below are checked by <code>CommonArgs::validate()</code> at the start of <code>superkmer</code> and <code>index</code>. Invalid values exit immediately with an error.</p>
+<table>
+<thead>
+<tr>
+<th>Parameter</th>
+<th>Constraint</th>
+<th>Reason</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>k (<code>--kmer-size</code>)</td>
+<td>odd</td>
+<td>even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant</td>
+</tr>
+<tr>
+<td>k (<code>--kmer-size</code>)</td>
+<td>k ∈ [11, 31]</td>
+<td>k &gt; 31 overflows u64 at 2 bits/base; k &lt; 11 gives insufficient specificity</td>
+</tr>
+<tr>
+<td>m (<code>--minimizer-size</code>)</td>
+<td>odd</td>
+<td>same palindrome argument as k</td>
+</tr>
+<tr>
+<td>m (<code>--minimizer-size</code>)</td>
+<td>3 ≤ m ≤ k−1</td>
+<td>minimizer must be strictly shorter than the kmer</td>
+</tr>
+<tr>
+<td>z (<code>-z</code>, Findere, <code>index --approx</code> only)</td>
+<td>z ≤ k−1</td>
+<td>effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0</td>
+</tr>
+</tbody>
+</table>
+<h2 id="genome-label-constraints">Genome label constraints</h2>
+<p>Genome labels are arbitrary Unicode strings with the following restrictions:</p>
+<table>
+<thead>
+<tr>
+<th>Character</th>
+<th>Forbidden</th>
+<th>Reason</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><code>/</code></td>
+<td>yes</td>
+<td>filesystem path separator</td>
+</tr>
+<tr>
+<td><code>=</code></td>
+<td>yes</td>
+<td><code>--new-label</code> parser separator</td>
+</tr>
+<tr>
+<td><code>\0</code></td>
+<td>yes</td>
+<td>null byte</td>
+</tr>
+<tr>
+<td><code>\n</code> <code>\r</code> <code>\t</code></td>
+<td>yes</td>
+<td>break CSV output</td>
+</tr>
+<tr>
+<td>spaces</td>
+<td><strong>allowed</strong></td>
+<td>use shell quoting: <code>--new-label 'new label=old label'</code></td>
+</tr>
+</tbody>
+</table>
+<p>Empty labels are also rejected. Labels derived automatically from the index directory name (when <code>--label</code> is omitted) are not validated since they come from the filesystem and are already safe.</p>
 <h2 id="priority-operations">Priority operations</h2>
 <ul>
 <li>Kmer counting (frequencies)</li>