036d044291
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
34 lines
1.6 KiB
Markdown
34 lines
1.6 KiB
Markdown
# obikmer
|
|
|
|
`obikmer` is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.
|
|
|
|
## Subcommands
|
|
|
|
| Subcommand | Purpose |
|
|
|-------------|---------|
|
|
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
|
|
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
|
|
| `merge` | Merge multiple built indexes into one |
|
|
| `rebuild` | Filter and compact an existing index into a new single-layer index |
|
|
| `query` | Query an index with sequences and annotate matches |
|
|
| `dump` | Dump all indexed kmers as CSV (kmer + per-genome counts or presence) |
|
|
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
|
|
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees |
|
|
| `unitig` | Dump unitigs from a built index to stdout (debug) |
|
|
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
|
|
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
|
|
|
|
## Constraints
|
|
|
|
- Target scale: individual genome datasets, tens of Gbases
|
|
- Maximum efficiency in computation, memory, and disk usage
|
|
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
|
|
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
|
|
- Input formats: FASTA, FASTQ, gzip, streaming stdin
|
|
|
|
## Priority operations
|
|
|
|
- Kmer counting (frequencies)
|
|
- Fast search / query
|
|
- Set operations: union, intersection, difference
|