docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
Eric Coissac
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
+294 -1
View File
@@ -213,6 +213,17 @@
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#subcommands" class="md-nav__link">
<span class="md-ellipsis">
Subcommands
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#constraints" class="md-nav__link">
<span class="md-ellipsis">
@@ -222,6 +233,28 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
<span class="md-ellipsis">
Parameter constraints (enforced at CLI)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#genome-label-constraints" class="md-nav__link">
<span class="md-ellipsis">
Genome label constraints
</span>
</a>
</li>
<li class="md-nav__item">
@@ -714,6 +747,34 @@
<li class="md-nav__item">
<a href="implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="implementation/obilayeredmap/" class="md-nav__link">
@@ -792,6 +853,62 @@
<li class="md-nav__item">
<a href="implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul>
</nav>
@@ -935,6 +1052,17 @@
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#subcommands" class="md-nav__link">
<span class="md-ellipsis">
Subcommands
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#constraints" class="md-nav__link">
<span class="md-ellipsis">
@@ -944,6 +1072,28 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
<span class="md-ellipsis">
Parameter constraints (enforced at CLI)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#genome-label-constraints" class="md-nav__link">
<span class="md-ellipsis">
Genome label constraints
</span>
</a>
</li>
<li class="md-nav__item">
@@ -976,12 +1126,155 @@
<h1 id="obikmer">obikmer</h1>
<p><code>obikmer</code> is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.</p>
<h2 id="subcommands">Subcommands</h2>
<table>
<thead>
<tr>
<th>Subcommand</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>superkmer</code></td>
<td>Extract super-kmers from a sequence file and write to stdout</td>
</tr>
<tr>
<td><code>index</code></td>
<td>Build a complete genome index (scatter → dereplicate → count → layered MPHF)</td>
</tr>
<tr>
<td><code>merge</code></td>
<td>Merge multiple built indexes into one</td>
</tr>
<tr>
<td><code>rebuild</code></td>
<td>Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata</td>
</tr>
<tr>
<td><code>query</code></td>
<td>Query an index with sequences and annotate matches</td>
</tr>
<tr>
<td><code>dump</code></td>
<td>Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
</tr>
<tr>
<td><code>annotate</code></td>
<td>Add or update genome metadata from a CSV file; or dump metadata as CSV</td>
</tr>
<tr>
<td><code>distance</code></td>
<td>Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees</td>
</tr>
<tr>
<td><code>unitig</code></td>
<td>Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
</tr>
<tr>
<td><code>estimate</code></td>
<td>Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing</td>
</tr>
<tr>
<td><code>reindex</code></td>
<td>Convert an index's evidence in-place: exact ↔ approx</td>
</tr>
<tr>
<td><code>utils</code></td>
<td>Miscellaneous index utilities: <code>--new-label NEW=OLD</code> renames a genome label; <code>--upgrade-index</code> adds missing <code>layer_meta.json</code> to old indexes</td>
</tr>
<tr>
<td><code>pack</code></td>
<td>Pack per-column matrix files into single-file format to reduce query I/O</td>
</tr>
</tbody>
</table>
<h2 id="constraints">Constraints</h2>
<ul>
<li>Target scale: individual genome datasets, tens of Gbases</li>
<li>Maximum efficiency in computation, memory, and disk usage</li>
<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li>
<li>k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)</li>
<li>Canonical form: <code>min(kmer, revcomp(kmer))</code> reduces strand-symmetric space by half</li>
<li>Input formats for <code>index</code>/<code>superkmer</code>: FASTA (<code>.fa</code>, <code>.fasta</code>), FASTQ (<code>.fq</code>, <code>.fastq</code>), GenBank flat file (<code>.gb</code>, <code>.gbk</code>, <code>.gbff</code>), all optionally gzip-compressed; directories expanded recursively; streaming stdin via <code>-</code></li>
<li>Input formats for <code>query</code>: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via <code>-</code></li>
</ul>
<h2 id="parameter-constraints-enforced-at-cli">Parameter constraints (enforced at CLI)</h2>
<p>All constraints below are checked by <code>CommonArgs::validate()</code> at the start of <code>superkmer</code> and <code>index</code>. Invalid values exit immediately with an error.</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th>Constraint</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td>k (<code>--kmer-size</code>)</td>
<td>odd</td>
<td>even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant</td>
</tr>
<tr>
<td>k (<code>--kmer-size</code>)</td>
<td>k ∈ [11, 31]</td>
<td>k &gt; 31 overflows u64 at 2 bits/base; k &lt; 11 gives insufficient specificity</td>
</tr>
<tr>
<td>m (<code>--minimizer-size</code>)</td>
<td>odd</td>
<td>same palindrome argument as k</td>
</tr>
<tr>
<td>m (<code>--minimizer-size</code>)</td>
<td>3 ≤ m ≤ k1</td>
<td>minimizer must be strictly shorter than the kmer</td>
</tr>
<tr>
<td>z (<code>-z</code>, Findere, <code>index --approx</code> only)</td>
<td>z ≤ k1</td>
<td>effective indexed kmer size is kz+1; z ≥ k would make it ≤ 0</td>
</tr>
</tbody>
</table>
<h2 id="genome-label-constraints">Genome label constraints</h2>
<p>Genome labels are arbitrary Unicode strings with the following restrictions:</p>
<table>
<thead>
<tr>
<th>Character</th>
<th>Forbidden</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>/</code></td>
<td>yes</td>
<td>filesystem path separator</td>
</tr>
<tr>
<td><code>=</code></td>
<td>yes</td>
<td><code>--new-label</code> parser separator</td>
</tr>
<tr>
<td><code>\0</code></td>
<td>yes</td>
<td>null byte</td>
</tr>
<tr>
<td><code>\n</code> <code>\r</code> <code>\t</code></td>
<td>yes</td>
<td>break CSV output</td>
</tr>
<tr>
<td>spaces</td>
<td><strong>allowed</strong></td>
<td>use shell quoting: <code>--new-label 'new label=old label'</code></td>
</tr>
</tbody>
</table>
<p>Empty labels are also rejected. Labels derived automatically from the index directory name (when <code>--label</code> is omitted) are not validated since they come from the filesystem and are already safe.</p>
<h2 id="priority-operations">Priority operations</h2>
<ul>
<li>Kmer counting (frequencies)</li>