docmd/index.md

# obikmer

`obikmer` is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.

## Subcommands

| Subcommand  | Purpose |
|-------------|---------|
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
| `index`     | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
| `merge`     | Merge multiple built indexes into one |
| `rebuild` / `filter` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system (`filter` is an alias for `rebuild`) |
| `query`     | Query an index with sequences and annotate matches |
| `dump`      | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
| `annotate`  | Add or update genome metadata from a CSV file; or dump metadata as CSV |
| `distance`  | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
| `unitig`    | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
| `select`    | Project and/or aggregate genome columns into a new or in-place index; the column-axis counterpart of `filter` (see [select](implementation/select.md)) |
| `estimate`  | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
| `reindex`   | Convert an index's evidence in-place: exact ↔ approx |
| `utils`     | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
| `pack`      | Pack per-column matrix files into single-file format to reduce query I/O |

## Constraints

- Target scale: individual genome datasets, tens of Gbases
- Maximum efficiency in computation, memory, and disk usage
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
- Input formats for `index`/`superkmer`: FASTA (`.fa`, `.fasta`), FASTQ (`.fq`, `.fastq`), GenBank flat file (`.gb`, `.gbk`, `.gbff`), all optionally gzip-compressed; directories expanded recursively; streaming stdin via `-`
- Input formats for `query`: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via `-`

## Parameter constraints (enforced at CLI)

All constraints below are checked by `CommonArgs::validate()` at the start of `superkmer` and `index`. Invalid values exit immediately with an error.

| Parameter | Constraint | Reason |
|-----------|-----------|--------|
| k (`--kmer-size`) | odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant |
| k (`--kmer-size`) | k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity |
| m (`--minimizer-size`) | odd | same palindrome argument as k |
| m (`--minimizer-size`) | 3 ≤ m ≤ k−1 | minimizer must be strictly shorter than the kmer |
| z (`-z`, Findere, `index --approx` only) | z ≤ k−1 | effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0 |

## Genome label constraints

Genome labels are arbitrary Unicode strings with the following restrictions:

| Character | Forbidden | Reason |
|-----------|-----------|--------|
| `/` | yes | filesystem path separator |
| `=` | yes | `--new-label` parser separator |
| `\0` | yes | null byte |
| `\n` `\r` `\t` | yes | break CSV output |
| spaces | **allowed** | use shell quoting: `--new-label 'new label=old label'` |

Empty labels are also rejected. Labels derived automatically from the index directory name (when `--label` is omitted) are not validated since they come from the filesystem and are already safe.

## Priority operations

- Kmer counting (frequencies)
- Fast search / query
- Set operations: union, intersection, difference
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								# obikmer
 								`obikmer` is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								## Subcommands
 								| Subcommand  | Purpose |
 								|-------------|---------|
 								| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
 								| `index`     | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
 								| `merge`     | Merge multiple built indexes into one |
-											feat: add select command for genome column projection and aggregation
										
										
											2026-06-09 15:05:08 +02:00
+								| `rebuild` / `filter` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system (`filter` is an alias for `rebuild`) |
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								| `query`     | Query an index with sequences and annotate matches |
-											refactor: centralize k-mer filtering logic and add validation
										
										
											2026-06-09 09:57:38 +02:00
+								| `dump`      | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								| `annotate`  | Add or update genome metadata from a CSV file; or dump metadata as CSV |
-											feat: add --head and --presence-threshold to dump and distance
										
										
											2026-06-09 09:47:44 +02:00
+								| `distance`  | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
-											refactor: centralize k-mer filtering logic and add validation
										
										
											2026-06-09 09:57:38 +02:00
+								| `unitig`    | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
-											feat: add select command for genome column projection and aggregation
										
										
											2026-06-09 15:05:08 +02:00
+								| `select`    | Project and/or aggregate genome columns into a new or in-place index; the column-axis counterpart of `filter` (see [select](implementation/select.md)) |
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								| `estimate`  | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
 								| `reindex`   | Convert an index's evidence in-place: exact ↔ approx |
-											docs: expand kmer indexing, filtering, and merging documentation
										
										
											2026-06-04 21:27:01 +02:00
+								| `utils`     | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
 								| `pack`      | Pack per-column matrix files into single-file format to reduce query I/O |
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								## Constraints
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								- Target scale: individual genome datasets, tens of Gbases
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								- Maximum efficiency in computation, memory, and disk usage
-											refactor: update core types and add approximate evidence support
										
										
											2026-05-26 09:12:41 +02:00
+								- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
 								- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
-											docs: clarify query pipeline, Findere trick, and input formats
										
										
											2026-05-30 15:54:13 +02:00
+								- Input formats for `index`/`superkmer`: FASTA (`.fa`, `.fasta`), FASTQ (`.fq`, `.fastq`), GenBank flat file (`.gb`, `.gbk`, `.gbff`), all optionally gzip-compressed; directories expanded recursively; streaming stdin via `-`
 								- Input formats for `query`: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via `-`
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											feat: enforce runtime validation for kmer and minimizer parameters
										
										
											2026-05-26 22:55:05 +02:00
+								## Parameter constraints (enforced at CLI)
 								All constraints below are checked by `CommonArgs::validate()` at the start of `superkmer` and `index`. Invalid values exit immediately with an error.
 								| Parameter | Constraint | Reason |
 								|-----------|-----------|--------|
 								| k (`--kmer-size`) | odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant |
 								| k (`--kmer-size`) | k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity |
 								| m (`--minimizer-size`) | odd | same palindrome argument as k |
 								| m (`--minimizer-size`) | 3 ≤ m ≤ k−1 | minimizer must be strictly shorter than the kmer |
 								| z (`-z`, Findere, `index --approx` only) | z ≤ k−1 | effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0 |
-											refactor: add rolling buffer methods and document label constraints
										
										
											2026-05-26 14:54:26 +02:00
+								## Genome label constraints
 								Genome labels are arbitrary Unicode strings with the following restrictions:
 								| Character | Forbidden | Reason |
 								|-----------|-----------|--------|
 								| `/` | yes | filesystem path separator |
 								| `=` | yes | `--new-label` parser separator |
 								| `\0` | yes | null byte |
 								| `\n` `\r` `\t` | yes | break CSV output |
 								| spaces | **allowed** | use shell quoting: `--new-label 'new label=old label'` |
 								Empty labels are also rejected. Labels derived automatically from the index directory name (when `--label` is omitted) are not validated since they come from the filesystem and are already safe.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								## Priority operations
 								- Kmer counting (frequencies)
 								- Fast search / query
 								- Set operations: union, intersection, difference