Files
obikmer/docmd/implementation/select.md
T

235 lines
7.6 KiB
Markdown
Raw Normal View History

# `select` — column projection and aggregation
`select` transforms an index by operating on its **genome columns**: projecting a
subset of columns, aggregating groups of genomes into synthetic columns, or both.
It is the column-axis counterpart of `filter` (row-axis operations).
Following relational algebra conventions:
| Command | Relational operation | Axis |
|----------|---------------------|----------|
| `filter` | σ — selection | rows (k-mers) |
| `select` | π — projection | columns (genomes) |
The two commands compose naturally: run `filter` first to restrict the kmer set,
then `select` to reshape the genome columns.
`select` never changes the kmer set. The MPHF and `unitigs.bin` of each layer
are preserved unchanged; only the data matrices are rewritten.
---
## Synopsis
```sh
obikmer select <input-index>
{ --output <dir> | --in-place }
[--group <name>:<pred> ...]
[--group-op <name>:<op> ...]
[--aggregate-by <key> ]
[--aggregate-op <op> ]
[--select <col1,col2,...> ]
[--presence-threshold <N> ]
```
---
## Output destination
Exactly one of `--output` or `--in-place` must be specified.
**`--output <dir>`** — writes a new index to `<dir>`. The source index is
unchanged. The MPHF and unitig files are copied; only the data matrices are
rewritten with the new column layout.
**`--in-place`** — rewrites the data matrices of the source index directly.
Removed or replaced columns are lost. The operation writes to temporary files
first, then renames atomically, so an interrupted run leaves the index intact.
---
## Defining output columns
### Named groups — `--group`
```
--group <name>:<pred>
```
Defines a named group of genomes using the same predicate syntax as `filter`.
Repeatable; a genome can belong to several groups.
```sh
--group "pub:species=Betula_pubescens"
--group "nan:species=Betula_nana"
```
### Per-group operator — `--group-op`
```
--group-op <name>:<op>
```
Assigns an aggregation operator to a named group. Optional; if absent, the
default operator applies (see below).
```sh
--group-op "pub:any"
--group-op "nan:all"
```
### Shorthand — `--aggregate-by` / `--aggregate-op`
`--aggregate-by <key>` automatically creates one group per unique value of the
metadata key `<key>`. Equivalent to one `--group <val>:<key>=<val>` per distinct
value. `--aggregate-op <op>` sets the operator for all auto-generated groups.
`--aggregate-by` and `--group` are mutually exclusive.
### Column selection and ordering — `--select`
```
--select col1,col2,...
```
Lists the output columns in order. Each element is either a group name (defined
by `--group` or generated by `--aggregate-by`) or a genome label from the source
index (pass-through, no aggregation).
**Default when `--select` is absent:**
all defined groups in declaration order (for `--group`), or all generated groups
in metadata-value order (for `--aggregate-by`). Individual genomes not in any
group are excluded unless named explicitly.
**When neither `--group` nor `--aggregate-by` is specified:**
`--select` can still reference genome labels for pure column projection (no
aggregation). If `--select` is also absent, all genomes are output unchanged
(identity transform — useful combined with row filtering via a prior `filter`
run).
---
## Aggregation operators
| Operator | Input | Output | Semantics |
|----------|-------------|----------|-----------|
| `any` | pres / count | presence | 1 if ≥ 1 genome in group carries the k-mer |
| `all` | pres / count | presence | 1 if every genome in group carries the k-mer |
| `none` | pres / count | presence | 1 if no genome in group carries the k-mer |
| `sum` | count | count | sum of counts across the group |
| `min` | count | count | minimum count |
| `max` | count | count | maximum count |
**Default operator:**
- Presence index: `any`
- Count index: `sum`
Logical operators (`any`/`all`/`none`) on a count index use
`--presence-threshold N` (default 0): a genome "carries" the k-mer if its count
is > N.
**Output index type:**
- If the source is a presence index, the output is always a presence index.
- If the source is a count index and every output column uses a logical operator
or is a pass-through from a presence source, the output is a presence index.
- Otherwise (at least one arithmetic operator on a count source), the output is
a count index.
---
## Behaviour for edge cases
| Situation | Behaviour |
|-----------|-----------|
| Genome missing the metadata key in `--aggregate-by` | genome ignored (no `NA` group) |
| Genome in multiple groups | contributes independently to each |
| `--group-op` references undefined group | error |
| `--select` element is neither group name nor genome label | error |
| `--output` and `--in-place` both specified | error |
| Neither `--output` nor `--in-place` | error |
| Group with zero matching genomes | column is all-zeros (or all-ones for `none`) |
---
## Examples
### Aggregate by metadata group, default operators
```sh
obikmer select myindex --output out --aggregate-by group
# one column per unique value of "group"; presence→any, count→sum
```
### Named groups with different operators
```sh
obikmer select myindex --output out \
--group "pub:species=Betula_pubescens" \
--group "nan:species=Betula_nana" \
--group-op "pub:any" \
--group-op "nan:all" \
--select "pub,nan"
```
### Mix aggregated group and individual genome
```sh
obikmer select myindex --output out \
--group "A:group=A" \
--select "A,Betula_nana--IGA-24-39"
```
### Pure column projection (no aggregation)
```sh
obikmer select myindex --output out \
--select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"
```
### In-place: keep only group A
```sh
obikmer select myindex --in-place --group "A:group=A" --select "A"
```
### Compose with filter
```sh
# Step 1: keep only B. nana-specific k-mers
obikmer filter myindex --output filtered \
--ingroup "species=Betula_nana" --outgroup "*"
# Step 2: aggregate genome columns by collection site
obikmer select filtered --output final --aggregate-by site
```
---
## Implementation notes
`select` does not rebuild the MPHF. The 256 partitions are processed in parallel
(rayon), each writing its output independently; results require no synchronisation
because every partition owns a distinct set of files.
For each layer in each partition:
1. The slot count `n` is read by opening the source data matrix.
2. A new data matrix is built with M columns (M = number of output columns).
3. For each slot `s` in `0..n`:
- `old_row = matrix.fill_row(s)` — reads the original `N`-column row without allocating.
- For each output column `j`:
- `new_row[j] = aggregate(op, old_row[group_indices])`.
- Pass-through columns are represented as single-element groups with the
default operator (`any` for presence, `sum` for count) — same code path.
- The new row is written slot by slot into each column builder.
4. All plain files in the source layer directory (`mphf.bin`, `unitigs.bin`,
evidence files, `layer_meta.json`) are copied verbatim; only the `presence/`
or `counts/` subdirectory is rewritten.
5. `index.meta` is rewritten with the new genome list and updated `with_counts`.
**`--in-place` write strategy:** new data is written to a temporary sibling
directory (`presence_new/` or `counts_new/`); on success the old directory is
removed and the temporary one is renamed into place. An interrupted run leaves
at most one stale `*_new/` directory; the original data is intact until the
rename step.