docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
+294
-1
@@ -213,6 +213,17 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#subcommands" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Subcommands
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@@ -222,6 +233,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Parameter constraints (enforced at CLI)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#genome-label-constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Genome label constraints
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -714,6 +747,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -792,6 +853,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -935,6 +1052,17 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#subcommands" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Subcommands
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@@ -944,6 +1072,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Parameter constraints (enforced at CLI)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#genome-label-constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Genome label constraints
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -976,12 +1126,155 @@
|
||||
|
||||
<h1 id="obikmer">obikmer</h1>
|
||||
<p><code>obikmer</code> is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.</p>
|
||||
<h2 id="subcommands">Subcommands</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Subcommand</th>
|
||||
<th>Purpose</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>superkmer</code></td>
|
||||
<td>Extract super-kmers from a sequence file and write to stdout</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>index</code></td>
|
||||
<td>Build a complete genome index (scatter → dereplicate → count → layered MPHF)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>merge</code></td>
|
||||
<td>Merge multiple built indexes into one</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>rebuild</code></td>
|
||||
<td>Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>query</code></td>
|
||||
<td>Query an index with sequences and annotate matches</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>dump</code></td>
|
||||
<td>Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>annotate</code></td>
|
||||
<td>Add or update genome metadata from a CSV file; or dump metadata as CSV</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>distance</code></td>
|
||||
<td>Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>unitig</code></td>
|
||||
<td>Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>estimate</code></td>
|
||||
<td>Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>reindex</code></td>
|
||||
<td>Convert an index's evidence in-place: exact ↔ approx</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>utils</code></td>
|
||||
<td>Miscellaneous index utilities: <code>--new-label NEW=OLD</code> renames a genome label; <code>--upgrade-index</code> adds missing <code>layer_meta.json</code> to old indexes</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>pack</code></td>
|
||||
<td>Pack per-column matrix files into single-file format to reduce query I/O</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="constraints">Constraints</h2>
|
||||
<ul>
|
||||
<li>Target scale: individual genome datasets, tens of Gbases</li>
|
||||
<li>Maximum efficiency in computation, memory, and disk usage</li>
|
||||
<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li>
|
||||
<li>k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)</li>
|
||||
<li>Canonical form: <code>min(kmer, revcomp(kmer))</code> reduces strand-symmetric space by half</li>
|
||||
<li>Input formats for <code>index</code>/<code>superkmer</code>: FASTA (<code>.fa</code>, <code>.fasta</code>), FASTQ (<code>.fq</code>, <code>.fastq</code>), GenBank flat file (<code>.gb</code>, <code>.gbk</code>, <code>.gbff</code>), all optionally gzip-compressed; directories expanded recursively; streaming stdin via <code>-</code></li>
|
||||
<li>Input formats for <code>query</code>: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via <code>-</code></li>
|
||||
</ul>
|
||||
<h2 id="parameter-constraints-enforced-at-cli">Parameter constraints (enforced at CLI)</h2>
|
||||
<p>All constraints below are checked by <code>CommonArgs::validate()</code> at the start of <code>superkmer</code> and <code>index</code>. Invalid values exit immediately with an error.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Parameter</th>
|
||||
<th>Constraint</th>
|
||||
<th>Reason</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>k (<code>--kmer-size</code>)</td>
|
||||
<td>odd</td>
|
||||
<td>even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>k (<code>--kmer-size</code>)</td>
|
||||
<td>k ∈ [11, 31]</td>
|
||||
<td>k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>m (<code>--minimizer-size</code>)</td>
|
||||
<td>odd</td>
|
||||
<td>same palindrome argument as k</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>m (<code>--minimizer-size</code>)</td>
|
||||
<td>3 ≤ m ≤ k−1</td>
|
||||
<td>minimizer must be strictly shorter than the kmer</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>z (<code>-z</code>, Findere, <code>index --approx</code> only)</td>
|
||||
<td>z ≤ k−1</td>
|
||||
<td>effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="genome-label-constraints">Genome label constraints</h2>
|
||||
<p>Genome labels are arbitrary Unicode strings with the following restrictions:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Character</th>
|
||||
<th>Forbidden</th>
|
||||
<th>Reason</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>/</code></td>
|
||||
<td>yes</td>
|
||||
<td>filesystem path separator</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>=</code></td>
|
||||
<td>yes</td>
|
||||
<td><code>--new-label</code> parser separator</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>\0</code></td>
|
||||
<td>yes</td>
|
||||
<td>null byte</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>\n</code> <code>\r</code> <code>\t</code></td>
|
||||
<td>yes</td>
|
||||
<td>break CSV output</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>spaces</td>
|
||||
<td><strong>allowed</strong></td>
|
||||
<td>use shell quoting: <code>--new-label 'new label=old label'</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Empty labels are also rejected. Labels derived automatically from the index directory name (when <code>--label</code> is omitted) are not validated since they come from the filesystem and are already safe.</p>
|
||||
<h2 id="priority-operations">Priority operations</h2>
|
||||
<ul>
|
||||
<li>Kmer counting (frequencies)</li>
|
||||
|
||||
Reference in New Issue
Block a user