bb7adc1154
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
63 lines
3.6 KiB
Markdown
63 lines
3.6 KiB
Markdown
# obikmer
|
||
|
||
`obikmer` is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.
|
||
|
||
## Subcommands
|
||
|
||
| Subcommand | Purpose |
|
||
|-------------|---------|
|
||
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
|
||
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
|
||
| `merge` | Merge multiple built indexes into one |
|
||
| `rebuild` | Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata |
|
||
| `query` | Query an index with sequences and annotate matches |
|
||
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as `rebuild` |
|
||
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
|
||
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees |
|
||
| `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as `rebuild` |
|
||
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
|
||
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
|
||
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
|
||
| `pack` | Pack per-column matrix files into single-file format to reduce query I/O |
|
||
|
||
## Constraints
|
||
|
||
- Target scale: individual genome datasets, tens of Gbases
|
||
- Maximum efficiency in computation, memory, and disk usage
|
||
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
|
||
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
|
||
- Input formats for `index`/`superkmer`: FASTA (`.fa`, `.fasta`), FASTQ (`.fq`, `.fastq`), GenBank flat file (`.gb`, `.gbk`, `.gbff`), all optionally gzip-compressed; directories expanded recursively; streaming stdin via `-`
|
||
- Input formats for `query`: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via `-`
|
||
|
||
## Parameter constraints (enforced at CLI)
|
||
|
||
All constraints below are checked by `CommonArgs::validate()` at the start of `superkmer` and `index`. Invalid values exit immediately with an error.
|
||
|
||
| Parameter | Constraint | Reason |
|
||
|-----------|-----------|--------|
|
||
| k (`--kmer-size`) | odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant |
|
||
| k (`--kmer-size`) | k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity |
|
||
| m (`--minimizer-size`) | odd | same palindrome argument as k |
|
||
| m (`--minimizer-size`) | 3 ≤ m ≤ k−1 | minimizer must be strictly shorter than the kmer |
|
||
| z (`-z`, Findere, `index --approx` only) | z ≤ k−1 | effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0 |
|
||
|
||
## Genome label constraints
|
||
|
||
Genome labels are arbitrary Unicode strings with the following restrictions:
|
||
|
||
| Character | Forbidden | Reason |
|
||
|-----------|-----------|--------|
|
||
| `/` | yes | filesystem path separator |
|
||
| `=` | yes | `--new-label` parser separator |
|
||
| `\0` | yes | null byte |
|
||
| `\n` `\r` `\t` | yes | break CSV output |
|
||
| spaces | **allowed** | use shell quoting: `--new-label 'new label=old label'` |
|
||
|
||
Empty labels are also rejected. Labels derived automatically from the index directory name (when `--label` is omitted) are not validated since they come from the filesystem and are already safe.
|
||
|
||
## Priority operations
|
||
|
||
- Kmer counting (frequencies)
|
||
- Fast search / query
|
||
- Set operations: union, intersection, difference
|