970460be42
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
235 lines
7.6 KiB
Markdown
235 lines
7.6 KiB
Markdown
# `select` — column projection and aggregation
|
||
|
||
`select` transforms an index by operating on its **genome columns**: projecting a
|
||
subset of columns, aggregating groups of genomes into synthetic columns, or both.
|
||
It is the column-axis counterpart of `filter` (row-axis operations).
|
||
|
||
Following relational algebra conventions:
|
||
|
||
| Command | Relational operation | Axis |
|
||
|----------|---------------------|----------|
|
||
| `filter` | σ — selection | rows (k-mers) |
|
||
| `select` | π — projection | columns (genomes) |
|
||
|
||
The two commands compose naturally: run `filter` first to restrict the kmer set,
|
||
then `select` to reshape the genome columns.
|
||
|
||
`select` never changes the kmer set. The MPHF and `unitigs.bin` of each layer
|
||
are preserved unchanged; only the data matrices are rewritten.
|
||
|
||
---
|
||
|
||
## Synopsis
|
||
|
||
```sh
|
||
obikmer select <input-index>
|
||
{ --output <dir> | --in-place }
|
||
[--group <name>:<pred> ...]
|
||
[--group-op <name>:<op> ...]
|
||
[--aggregate-by <key> ]
|
||
[--aggregate-op <op> ]
|
||
[--select <col1,col2,...> ]
|
||
[--presence-threshold <N> ]
|
||
```
|
||
|
||
---
|
||
|
||
## Output destination
|
||
|
||
Exactly one of `--output` or `--in-place` must be specified.
|
||
|
||
**`--output <dir>`** — writes a new index to `<dir>`. The source index is
|
||
unchanged. The MPHF and unitig files are copied; only the data matrices are
|
||
rewritten with the new column layout.
|
||
|
||
**`--in-place`** — rewrites the data matrices of the source index directly.
|
||
Removed or replaced columns are lost. The operation writes to temporary files
|
||
first, then renames atomically, so an interrupted run leaves the index intact.
|
||
|
||
---
|
||
|
||
## Defining output columns
|
||
|
||
### Named groups — `--group`
|
||
|
||
```
|
||
--group <name>:<pred>
|
||
```
|
||
|
||
Defines a named group of genomes using the same predicate syntax as `filter`.
|
||
Repeatable; a genome can belong to several groups.
|
||
|
||
```sh
|
||
--group "pub:species=Betula_pubescens"
|
||
--group "nan:species=Betula_nana"
|
||
```
|
||
|
||
### Per-group operator — `--group-op`
|
||
|
||
```
|
||
--group-op <name>:<op>
|
||
```
|
||
|
||
Assigns an aggregation operator to a named group. Optional; if absent, the
|
||
default operator applies (see below).
|
||
|
||
```sh
|
||
--group-op "pub:any"
|
||
--group-op "nan:all"
|
||
```
|
||
|
||
### Shorthand — `--aggregate-by` / `--aggregate-op`
|
||
|
||
`--aggregate-by <key>` automatically creates one group per unique value of the
|
||
metadata key `<key>`. Equivalent to one `--group <val>:<key>=<val>` per distinct
|
||
value. `--aggregate-op <op>` sets the operator for all auto-generated groups.
|
||
|
||
`--aggregate-by` and `--group` are mutually exclusive.
|
||
|
||
### Column selection and ordering — `--select`
|
||
|
||
```
|
||
--select col1,col2,...
|
||
```
|
||
|
||
Lists the output columns in order. Each element is either a group name (defined
|
||
by `--group` or generated by `--aggregate-by`) or a genome label from the source
|
||
index (pass-through, no aggregation).
|
||
|
||
**Default when `--select` is absent:**
|
||
all defined groups in declaration order (for `--group`), or all generated groups
|
||
in metadata-value order (for `--aggregate-by`). Individual genomes not in any
|
||
group are excluded unless named explicitly.
|
||
|
||
**When neither `--group` nor `--aggregate-by` is specified:**
|
||
`--select` can still reference genome labels for pure column projection (no
|
||
aggregation). If `--select` is also absent, all genomes are output unchanged
|
||
(identity transform — useful combined with row filtering via a prior `filter`
|
||
run).
|
||
|
||
---
|
||
|
||
## Aggregation operators
|
||
|
||
| Operator | Input | Output | Semantics |
|
||
|----------|-------------|----------|-----------|
|
||
| `any` | pres / count | presence | 1 if ≥ 1 genome in group carries the k-mer |
|
||
| `all` | pres / count | presence | 1 if every genome in group carries the k-mer |
|
||
| `none` | pres / count | presence | 1 if no genome in group carries the k-mer |
|
||
| `sum` | count | count | sum of counts across the group |
|
||
| `min` | count | count | minimum count |
|
||
| `max` | count | count | maximum count |
|
||
|
||
**Default operator:**
|
||
- Presence index: `any`
|
||
- Count index: `sum`
|
||
|
||
Logical operators (`any`/`all`/`none`) on a count index use
|
||
`--presence-threshold N` (default 0): a genome "carries" the k-mer if its count
|
||
is > N.
|
||
|
||
**Output index type:**
|
||
- If the source is a presence index, the output is always a presence index.
|
||
- If the source is a count index and every output column uses a logical operator
|
||
or is a pass-through from a presence source, the output is a presence index.
|
||
- Otherwise (at least one arithmetic operator on a count source), the output is
|
||
a count index.
|
||
|
||
---
|
||
|
||
## Behaviour for edge cases
|
||
|
||
| Situation | Behaviour |
|
||
|-----------|-----------|
|
||
| Genome missing the metadata key in `--aggregate-by` | genome ignored (no `NA` group) |
|
||
| Genome in multiple groups | contributes independently to each |
|
||
| `--group-op` references undefined group | error |
|
||
| `--select` element is neither group name nor genome label | error |
|
||
| `--output` and `--in-place` both specified | error |
|
||
| Neither `--output` nor `--in-place` | error |
|
||
| Group with zero matching genomes | column is all-zeros (or all-ones for `none`) |
|
||
|
||
---
|
||
|
||
## Examples
|
||
|
||
### Aggregate by metadata group, default operators
|
||
|
||
```sh
|
||
obikmer select myindex --output out --aggregate-by group
|
||
# one column per unique value of "group"; presence→any, count→sum
|
||
```
|
||
|
||
### Named groups with different operators
|
||
|
||
```sh
|
||
obikmer select myindex --output out \
|
||
--group "pub:species=Betula_pubescens" \
|
||
--group "nan:species=Betula_nana" \
|
||
--group-op "pub:any" \
|
||
--group-op "nan:all" \
|
||
--select "pub,nan"
|
||
```
|
||
|
||
### Mix aggregated group and individual genome
|
||
|
||
```sh
|
||
obikmer select myindex --output out \
|
||
--group "A:group=A" \
|
||
--select "A,Betula_nana--IGA-24-39"
|
||
```
|
||
|
||
### Pure column projection (no aggregation)
|
||
|
||
```sh
|
||
obikmer select myindex --output out \
|
||
--select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"
|
||
```
|
||
|
||
### In-place: keep only group A
|
||
|
||
```sh
|
||
obikmer select myindex --in-place --group "A:group=A" --select "A"
|
||
```
|
||
|
||
### Compose with filter
|
||
|
||
```sh
|
||
# Step 1: keep only B. nana-specific k-mers
|
||
obikmer filter myindex --output filtered \
|
||
--ingroup "species=Betula_nana" --outgroup "*"
|
||
|
||
# Step 2: aggregate genome columns by collection site
|
||
obikmer select filtered --output final --aggregate-by site
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation notes
|
||
|
||
`select` does not rebuild the MPHF. The 256 partitions are processed in parallel
|
||
(rayon), each writing its output independently; results require no synchronisation
|
||
because every partition owns a distinct set of files.
|
||
|
||
For each layer in each partition:
|
||
|
||
1. The slot count `n` is read by opening the source data matrix.
|
||
2. A new data matrix is built with M columns (M = number of output columns).
|
||
3. For each slot `s` in `0..n`:
|
||
- `old_row = matrix.fill_row(s)` — reads the original `N`-column row without allocating.
|
||
- For each output column `j`:
|
||
- `new_row[j] = aggregate(op, old_row[group_indices])`.
|
||
- Pass-through columns are represented as single-element groups with the
|
||
default operator (`any` for presence, `sum` for count) — same code path.
|
||
- The new row is written slot by slot into each column builder.
|
||
4. All plain files in the source layer directory (`mphf.bin`, `unitigs.bin`,
|
||
evidence files, `layer_meta.json`) are copied verbatim; only the `presence/`
|
||
or `counts/` subdirectory is rewritten.
|
||
5. `index.meta` is rewritten with the new genome list and updated `with_counts`.
|
||
|
||
**`--in-place` write strategy:** new data is written to a temporary sibling
|
||
directory (`presence_new/` or `counts_new/`); on success the old directory is
|
||
removed and the temporary one is renamed into place. An interrupted run leaves
|
||
at most one stale `*_new/` directory; the original data is intact until the
|
||
rename step.
|