e66adef23d
Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.
235 lines
7.6 KiB
Markdown
235 lines
7.6 KiB
Markdown
# `select` — column projection and aggregation
|
||
|
||
`select` transforms an index by operating on its **genome columns**: projecting a
|
||
subset of columns, aggregating groups of genomes into synthetic columns, or both.
|
||
It is the column-axis counterpart of `filter` (row-axis operations).
|
||
|
||
Following relational algebra conventions:
|
||
|
||
| Command | Relational operation | Axis |
|
||
|----------|---------------------|----------|
|
||
| `filter` | σ — selection | rows (k-mers) |
|
||
| `select` | π — projection | columns (genomes) |
|
||
|
||
The two commands compose naturally: run `filter` first to restrict the kmer set,
|
||
then `select` to reshape the genome columns.
|
||
|
||
`select` never changes the kmer set. The MPHF and `unitigs.bin` of each layer
|
||
are preserved unchanged; only the data matrices are rewritten.
|
||
|
||
---
|
||
|
||
## Synopsis
|
||
|
||
```sh
|
||
obikmer select <input-index>
|
||
{ --output <dir> | --in-place }
|
||
[--group <name>:<pred> ...]
|
||
[--group-op <name>:<op> ...]
|
||
[--aggregate-by <key> ]
|
||
[--aggregate-op <op> ]
|
||
[--select <col1,col2,...> ]
|
||
[--presence-threshold <N> ]
|
||
```
|
||
|
||
---
|
||
|
||
## Output destination
|
||
|
||
Exactly one of `--output` or `--in-place` must be specified.
|
||
|
||
**`--output <dir>`** — writes a new index to `<dir>`. The source index is
|
||
unchanged. The MPHF and unitig files are copied; only the data matrices are
|
||
rewritten with the new column layout.
|
||
|
||
**`--in-place`** — rewrites the data matrices of the source index directly.
|
||
Removed or replaced columns are lost. The operation writes to temporary files
|
||
first, then renames atomically, so an interrupted run leaves the index intact.
|
||
|
||
---
|
||
|
||
## Defining output columns
|
||
|
||
### Named groups — `--group`
|
||
|
||
```
|
||
--group <name>:<pred>
|
||
```
|
||
|
||
Defines a named group of genomes using the same predicate syntax as
|
||
`filter`/`rebuild`. Repeatable; a genome can belong to several groups.
|
||
|
||
```sh
|
||
--group "pub:species=Betula_pubescens"
|
||
--group "nan:species=Betula_nana"
|
||
```
|
||
|
||
### Per-group operator — `--group-op`
|
||
|
||
```
|
||
--group-op <name>:<op>
|
||
```
|
||
|
||
Assigns an aggregation operator to a named group. Optional; if absent, the
|
||
default operator applies (see below).
|
||
|
||
```sh
|
||
--group-op "pub:any"
|
||
--group-op "nan:all"
|
||
```
|
||
|
||
### Shorthand — `--aggregate-by` / `--aggregate-op`
|
||
|
||
`--aggregate-by <key>` automatically creates one group per unique value of the
|
||
metadata key `<key>`. Equivalent to one `--group <val>:<key>=<val>` per distinct
|
||
value. `--aggregate-op <op>` sets the operator for all auto-generated groups.
|
||
|
||
`--aggregate-by` and `--group` are mutually exclusive.
|
||
|
||
### Column selection and ordering — `--select`
|
||
|
||
```
|
||
--select col1,col2,...
|
||
```
|
||
|
||
Lists the output columns in order. Each element is either a group name (defined
|
||
by `--group` or generated by `--aggregate-by`) or a genome label from the source
|
||
index (pass-through, no aggregation).
|
||
|
||
**Default when `--select` is absent:**
|
||
all defined groups in declaration order (for `--group`), or all generated groups
|
||
in metadata-value order (for `--aggregate-by`). Individual genomes not in any
|
||
group are excluded unless named explicitly.
|
||
|
||
**When neither `--group` nor `--aggregate-by` is specified:**
|
||
`--select` can still reference genome labels for pure column projection (no
|
||
aggregation). If `--select` is also absent, all genomes are output unchanged
|
||
(identity transform — useful combined with row filtering via a prior `filter`
|
||
run).
|
||
|
||
---
|
||
|
||
## Aggregation operators
|
||
|
||
| Operator | Input | Output | Semantics |
|
||
|----------|-------------|----------|-----------|
|
||
| `any` | pres / count | presence | 1 if ≥ 1 genome in group carries the k-mer |
|
||
| `all` | pres / count | presence | 1 if every genome in group carries the k-mer |
|
||
| `none` | pres / count | presence | 1 if no genome in group carries the k-mer |
|
||
| `sum` | count | count | sum of counts across the group |
|
||
| `min` | count | count | minimum count |
|
||
| `max` | count | count | maximum count |
|
||
|
||
**Default operator:**
|
||
- Presence index: `any`
|
||
- Count index: `sum`
|
||
|
||
Logical operators (`any`/`all`/`none`) on a count index use
|
||
`--presence-threshold N` (default 0): a genome "carries" the k-mer if its count
|
||
is > N.
|
||
|
||
**Output index type:**
|
||
- If the source is a presence index, the output is always a presence index.
|
||
- If the source is a count index and every output column uses a logical operator
|
||
or is a pass-through from a presence source, the output is a presence index.
|
||
- Otherwise (at least one arithmetic operator on a count source), the output is
|
||
a count index.
|
||
|
||
---
|
||
|
||
## Behaviour for edge cases
|
||
|
||
| Situation | Behaviour |
|
||
|-----------|-----------|
|
||
| Genome missing the metadata key in `--aggregate-by` | genome ignored (no `NA` group) |
|
||
| Genome in multiple groups | contributes independently to each |
|
||
| `--group-op` references undefined group | error |
|
||
| `--select` element is neither group name nor genome label | error |
|
||
| `--output` and `--in-place` both specified | error |
|
||
| Neither `--output` nor `--in-place` | error |
|
||
| Group with zero matching genomes | column is all-zeros (or all-ones for `none`) |
|
||
|
||
---
|
||
|
||
## Examples
|
||
|
||
### Aggregate by metadata group, default operators
|
||
|
||
```sh
|
||
obikmer select myindex --output out --aggregate-by group
|
||
# one column per unique value of "group"; presence→any, count→sum
|
||
```
|
||
|
||
### Named groups with different operators
|
||
|
||
```sh
|
||
obikmer select myindex --output out \
|
||
--group "pub:species=Betula_pubescens" \
|
||
--group "nan:species=Betula_nana" \
|
||
--group-op "pub:any" \
|
||
--group-op "nan:all" \
|
||
--select "pub,nan"
|
||
```
|
||
|
||
### Mix aggregated group and individual genome
|
||
|
||
```sh
|
||
obikmer select myindex --output out \
|
||
--group "A:group=A" \
|
||
--select "A,Betula_nana--IGA-24-39"
|
||
```
|
||
|
||
### Pure column projection (no aggregation)
|
||
|
||
```sh
|
||
obikmer select myindex --output out \
|
||
--select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"
|
||
```
|
||
|
||
### In-place: keep only group A
|
||
|
||
```sh
|
||
obikmer select myindex --in-place --group "A:group=A" --select "A"
|
||
```
|
||
|
||
### Compose with filter
|
||
|
||
```sh
|
||
# Step 1: keep only B. nana-specific k-mers
|
||
obikmer filter myindex --output filtered \
|
||
--ingroup "species=Betula_nana" --outgroup "*"
|
||
|
||
# Step 2: aggregate genome columns by collection site
|
||
obikmer select filtered --output final --aggregate-by site
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation notes
|
||
|
||
`select` does not rebuild the MPHF. The 256 partitions are processed in parallel
|
||
(rayon), each writing its output independently; results require no synchronisation
|
||
because every partition owns a distinct set of files.
|
||
|
||
For each layer in each partition:
|
||
|
||
1. The slot count `n` is read by opening the source data matrix.
|
||
2. A new data matrix is built with M columns (M = number of output columns).
|
||
3. For each slot `s` in `0..n`:
|
||
- `old_row = matrix.fill_row(s)` — reads the original `N`-column row without allocating.
|
||
- For each output column `j`:
|
||
- `new_row[j] = aggregate(op, old_row[group_indices])`.
|
||
- Pass-through columns are represented as single-element groups with the
|
||
default operator (`any` for presence, `sum` for count) — same code path.
|
||
- The new row is written slot by slot into each column builder.
|
||
4. All plain files in the source layer directory (`mphf.bin`, `unitigs.bin`,
|
||
evidence files, `layer_meta.json`) are copied verbatim; only the `presence/`
|
||
or `counts/` subdirectory is rewritten.
|
||
5. `index.meta` is rewritten with the new genome list and updated `with_counts`.
|
||
|
||
**`--in-place` write strategy:** new data is written to a temporary sibling
|
||
directory (`presence_new/` or `counts_new/`); on success the old directory is
|
||
removed and the temporary one is renamed into place. An interrupted run leaves
|
||
at most one stale `*_new/` directory; the original data is intact until the
|
||
rename step.
|