2026-06-09 15:05:08 +02:00
|
|
|
|
# `select` — column projection and aggregation
|
|
|
|
|
|
|
|
|
|
|
|
`select` transforms an index by operating on its **genome columns**: projecting a
|
|
|
|
|
|
subset of columns, aggregating groups of genomes into synthetic columns, or both.
|
|
|
|
|
|
It is the column-axis counterpart of `filter` (row-axis operations).
|
|
|
|
|
|
|
|
|
|
|
|
Following relational algebra conventions:
|
|
|
|
|
|
|
|
|
|
|
|
| Command | Relational operation | Axis |
|
|
|
|
|
|
|----------|---------------------|----------|
|
|
|
|
|
|
| `filter` | σ — selection | rows (k-mers) |
|
|
|
|
|
|
| `select` | π — projection | columns (genomes) |
|
|
|
|
|
|
|
|
|
|
|
|
The two commands compose naturally: run `filter` first to restrict the kmer set,
|
|
|
|
|
|
then `select` to reshape the genome columns.
|
|
|
|
|
|
|
|
|
|
|
|
`select` never changes the kmer set. The MPHF and `unitigs.bin` of each layer
|
|
|
|
|
|
are preserved unchanged; only the data matrices are rewritten.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Synopsis
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
obikmer select <input-index>
|
|
|
|
|
|
{ --output <dir> | --in-place }
|
|
|
|
|
|
[--group <name>:<pred> ...]
|
|
|
|
|
|
[--group-op <name>:<op> ...]
|
|
|
|
|
|
[--aggregate-by <key> ]
|
|
|
|
|
|
[--aggregate-op <op> ]
|
|
|
|
|
|
[--select <col1,col2,...> ]
|
|
|
|
|
|
[--presence-threshold <N> ]
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Output destination
|
|
|
|
|
|
|
|
|
|
|
|
Exactly one of `--output` or `--in-place` must be specified.
|
|
|
|
|
|
|
|
|
|
|
|
**`--output <dir>`** — writes a new index to `<dir>`. The source index is
|
|
|
|
|
|
unchanged. The MPHF and unitig files are copied; only the data matrices are
|
|
|
|
|
|
rewritten with the new column layout.
|
|
|
|
|
|
|
|
|
|
|
|
**`--in-place`** — rewrites the data matrices of the source index directly.
|
|
|
|
|
|
Removed or replaced columns are lost. The operation writes to temporary files
|
|
|
|
|
|
first, then renames atomically, so an interrupted run leaves the index intact.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Defining output columns
|
|
|
|
|
|
|
|
|
|
|
|
### Named groups — `--group`
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
--group <name>:<pred>
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-06-09 15:22:57 +02:00
|
|
|
|
Defines a named group of genomes using the same predicate syntax as `filter`.
|
|
|
|
|
|
Repeatable; a genome can belong to several groups.
|
2026-06-09 15:05:08 +02:00
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
--group "pub:species=Betula_pubescens"
|
|
|
|
|
|
--group "nan:species=Betula_nana"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Per-group operator — `--group-op`
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
--group-op <name>:<op>
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Assigns an aggregation operator to a named group. Optional; if absent, the
|
|
|
|
|
|
default operator applies (see below).
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
--group-op "pub:any"
|
|
|
|
|
|
--group-op "nan:all"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Shorthand — `--aggregate-by` / `--aggregate-op`
|
|
|
|
|
|
|
|
|
|
|
|
`--aggregate-by <key>` automatically creates one group per unique value of the
|
|
|
|
|
|
metadata key `<key>`. Equivalent to one `--group <val>:<key>=<val>` per distinct
|
|
|
|
|
|
value. `--aggregate-op <op>` sets the operator for all auto-generated groups.
|
|
|
|
|
|
|
|
|
|
|
|
`--aggregate-by` and `--group` are mutually exclusive.
|
|
|
|
|
|
|
|
|
|
|
|
### Column selection and ordering — `--select`
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
--select col1,col2,...
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Lists the output columns in order. Each element is either a group name (defined
|
|
|
|
|
|
by `--group` or generated by `--aggregate-by`) or a genome label from the source
|
|
|
|
|
|
index (pass-through, no aggregation).
|
|
|
|
|
|
|
|
|
|
|
|
**Default when `--select` is absent:**
|
|
|
|
|
|
all defined groups in declaration order (for `--group`), or all generated groups
|
|
|
|
|
|
in metadata-value order (for `--aggregate-by`). Individual genomes not in any
|
|
|
|
|
|
group are excluded unless named explicitly.
|
|
|
|
|
|
|
|
|
|
|
|
**When neither `--group` nor `--aggregate-by` is specified:**
|
|
|
|
|
|
`--select` can still reference genome labels for pure column projection (no
|
|
|
|
|
|
aggregation). If `--select` is also absent, all genomes are output unchanged
|
|
|
|
|
|
(identity transform — useful combined with row filtering via a prior `filter`
|
|
|
|
|
|
run).
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Aggregation operators
|
|
|
|
|
|
|
|
|
|
|
|
| Operator | Input | Output | Semantics |
|
|
|
|
|
|
|----------|-------------|----------|-----------|
|
|
|
|
|
|
| `any` | pres / count | presence | 1 if ≥ 1 genome in group carries the k-mer |
|
|
|
|
|
|
| `all` | pres / count | presence | 1 if every genome in group carries the k-mer |
|
|
|
|
|
|
| `none` | pres / count | presence | 1 if no genome in group carries the k-mer |
|
|
|
|
|
|
| `sum` | count | count | sum of counts across the group |
|
|
|
|
|
|
| `min` | count | count | minimum count |
|
|
|
|
|
|
| `max` | count | count | maximum count |
|
|
|
|
|
|
|
|
|
|
|
|
**Default operator:**
|
|
|
|
|
|
- Presence index: `any`
|
|
|
|
|
|
- Count index: `sum`
|
|
|
|
|
|
|
|
|
|
|
|
Logical operators (`any`/`all`/`none`) on a count index use
|
|
|
|
|
|
`--presence-threshold N` (default 0): a genome "carries" the k-mer if its count
|
|
|
|
|
|
is > N.
|
|
|
|
|
|
|
|
|
|
|
|
**Output index type:**
|
|
|
|
|
|
- If the source is a presence index, the output is always a presence index.
|
|
|
|
|
|
- If the source is a count index and every output column uses a logical operator
|
|
|
|
|
|
or is a pass-through from a presence source, the output is a presence index.
|
|
|
|
|
|
- Otherwise (at least one arithmetic operator on a count source), the output is
|
|
|
|
|
|
a count index.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Behaviour for edge cases
|
|
|
|
|
|
|
|
|
|
|
|
| Situation | Behaviour |
|
|
|
|
|
|
|-----------|-----------|
|
|
|
|
|
|
| Genome missing the metadata key in `--aggregate-by` | genome ignored (no `NA` group) |
|
|
|
|
|
|
| Genome in multiple groups | contributes independently to each |
|
|
|
|
|
|
| `--group-op` references undefined group | error |
|
|
|
|
|
|
| `--select` element is neither group name nor genome label | error |
|
|
|
|
|
|
| `--output` and `--in-place` both specified | error |
|
|
|
|
|
|
| Neither `--output` nor `--in-place` | error |
|
|
|
|
|
|
| Group with zero matching genomes | column is all-zeros (or all-ones for `none`) |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Examples
|
|
|
|
|
|
|
|
|
|
|
|
### Aggregate by metadata group, default operators
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
obikmer select myindex --output out --aggregate-by group
|
|
|
|
|
|
# one column per unique value of "group"; presence→any, count→sum
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Named groups with different operators
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
obikmer select myindex --output out \
|
|
|
|
|
|
--group "pub:species=Betula_pubescens" \
|
|
|
|
|
|
--group "nan:species=Betula_nana" \
|
|
|
|
|
|
--group-op "pub:any" \
|
|
|
|
|
|
--group-op "nan:all" \
|
|
|
|
|
|
--select "pub,nan"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Mix aggregated group and individual genome
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
obikmer select myindex --output out \
|
|
|
|
|
|
--group "A:group=A" \
|
|
|
|
|
|
--select "A,Betula_nana--IGA-24-39"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Pure column projection (no aggregation)
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
obikmer select myindex --output out \
|
|
|
|
|
|
--select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### In-place: keep only group A
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
obikmer select myindex --in-place --group "A:group=A" --select "A"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Compose with filter
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
# Step 1: keep only B. nana-specific k-mers
|
|
|
|
|
|
obikmer filter myindex --output filtered \
|
|
|
|
|
|
--ingroup "species=Betula_nana" --outgroup "*"
|
|
|
|
|
|
|
|
|
|
|
|
# Step 2: aggregate genome columns by collection site
|
|
|
|
|
|
obikmer select filtered --output final --aggregate-by site
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Implementation notes
|
|
|
|
|
|
|
|
|
|
|
|
`select` does not rebuild the MPHF. The 256 partitions are processed in parallel
|
|
|
|
|
|
(rayon), each writing its output independently; results require no synchronisation
|
|
|
|
|
|
because every partition owns a distinct set of files.
|
|
|
|
|
|
|
|
|
|
|
|
For each layer in each partition:
|
|
|
|
|
|
|
|
|
|
|
|
1. The slot count `n` is read by opening the source data matrix.
|
|
|
|
|
|
2. A new data matrix is built with M columns (M = number of output columns).
|
|
|
|
|
|
3. For each slot `s` in `0..n`:
|
|
|
|
|
|
- `old_row = matrix.fill_row(s)` — reads the original `N`-column row without allocating.
|
|
|
|
|
|
- For each output column `j`:
|
|
|
|
|
|
- `new_row[j] = aggregate(op, old_row[group_indices])`.
|
|
|
|
|
|
- Pass-through columns are represented as single-element groups with the
|
|
|
|
|
|
default operator (`any` for presence, `sum` for count) — same code path.
|
|
|
|
|
|
- The new row is written slot by slot into each column builder.
|
|
|
|
|
|
4. All plain files in the source layer directory (`mphf.bin`, `unitigs.bin`,
|
|
|
|
|
|
evidence files, `layer_meta.json`) are copied verbatim; only the `presence/`
|
|
|
|
|
|
or `counts/` subdirectory is rewritten.
|
|
|
|
|
|
5. `index.meta` is rewritten with the new genome list and updated `with_counts`.
|
|
|
|
|
|
|
|
|
|
|
|
**`--in-place` write strategy:** new data is written to a temporary sibling
|
|
|
|
|
|
directory (`presence_new/` or `counts_new/`); on success the old directory is
|
|
|
|
|
|
removed and the temporary one is renamed into place. An interrupted run leaves
|
|
|
|
|
|
at most one stale `*_new/` directory; the original data is intact until the
|
|
|
|
|
|
rename step.
|