Files
Eric Coissac 970460be42 refactor: rename rebuild subcommand to filter
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
2026-06-09 15:26:37 +02:00

64 lines
4.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# obikmer
`obikmer` is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.
## Subcommands
| Subcommand | Purpose |
|-------------|---------|
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
| `merge` | Merge multiple built indexes into one |
| `filter` | Apply row-level selection (σ) to an index: retain only k-mers matching the ingroup/outgroup predicates. Output is a new single-layer index — compaction is a consequence, not the goal. Supports the shared [kmer filtering](implementation/filtering.md) system |
| `query` | Query an index with sequences and annotate matches |
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
| `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
| `select` | Project and/or aggregate genome columns into a new or in-place index; the column-axis counterpart of `filter` (see [select](implementation/select.md)) |
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
| `pack` | Pack per-column matrix files into single-file format to reduce query I/O |
## Constraints
- Target scale: individual genome datasets, tens of Gbases
- Maximum efficiency in computation, memory, and disk usage
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
- Input formats for `index`/`superkmer`: FASTA (`.fa`, `.fasta`), FASTQ (`.fq`, `.fastq`), GenBank flat file (`.gb`, `.gbk`, `.gbff`), all optionally gzip-compressed; directories expanded recursively; streaming stdin via `-`
- Input formats for `query`: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via `-`
## Parameter constraints (enforced at CLI)
All constraints below are checked by `CommonArgs::validate()` at the start of `superkmer` and `index`. Invalid values exit immediately with an error.
| Parameter | Constraint | Reason |
|-----------|-----------|--------|
| k (`--kmer-size`) | odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant |
| k (`--kmer-size`) | k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity |
| m (`--minimizer-size`) | odd | same palindrome argument as k |
| m (`--minimizer-size`) | 3 ≤ m ≤ k1 | minimizer must be strictly shorter than the kmer |
| z (`-z`, Findere, `index --approx` only) | z ≤ k1 | effective indexed kmer size is kz+1; z ≥ k would make it ≤ 0 |
## Genome label constraints
Genome labels are arbitrary Unicode strings with the following restrictions:
| Character | Forbidden | Reason |
|-----------|-----------|--------|
| `/` | yes | filesystem path separator |
| `=` | yes | `--new-label` parser separator |
| `\0` | yes | null byte |
| `\n` `\r` `\t` | yes | break CSV output |
| spaces | **allowed** | use shell quoting: `--new-label 'new label=old label'` |
Empty labels are also rejected. Labels derived automatically from the index directory name (when `--label` is omitted) are not validated since they come from the filesystem and are already safe.
## Priority operations
- Kmer counting (frequencies)
- Fast search / query
- Set operations: union, intersection, difference