Files
Eric Coissac 970460be42 refactor: rename rebuild subcommand to filter
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
2026-06-09 15:26:37 +02:00

7.6 KiB
Raw Permalink Blame History

select — column projection and aggregation

select transforms an index by operating on its genome columns: projecting a subset of columns, aggregating groups of genomes into synthetic columns, or both. It is the column-axis counterpart of filter (row-axis operations).

Following relational algebra conventions:

Command Relational operation Axis
filter σ — selection rows (k-mers)
select π — projection columns (genomes)

The two commands compose naturally: run filter first to restrict the kmer set, then select to reshape the genome columns.

select never changes the kmer set. The MPHF and unitigs.bin of each layer are preserved unchanged; only the data matrices are rewritten.


Synopsis

obikmer select <input-index>
        { --output <dir> | --in-place }
        [--group    <name>:<pred>  ...]
        [--group-op <name>:<op>    ...]
        [--aggregate-by <key>          ]
        [--aggregate-op <op>           ]
        [--select   <col1,col2,...>    ]
        [--presence-threshold <N>      ]

Output destination

Exactly one of --output or --in-place must be specified.

--output <dir> — writes a new index to <dir>. The source index is unchanged. The MPHF and unitig files are copied; only the data matrices are rewritten with the new column layout.

--in-place — rewrites the data matrices of the source index directly. Removed or replaced columns are lost. The operation writes to temporary files first, then renames atomically, so an interrupted run leaves the index intact.


Defining output columns

Named groups — --group

--group <name>:<pred>

Defines a named group of genomes using the same predicate syntax as filter. Repeatable; a genome can belong to several groups.

--group "pub:species=Betula_pubescens"
--group "nan:species=Betula_nana"

Per-group operator — --group-op

--group-op <name>:<op>

Assigns an aggregation operator to a named group. Optional; if absent, the default operator applies (see below).

--group-op "pub:any"
--group-op "nan:all"

Shorthand — --aggregate-by / --aggregate-op

--aggregate-by <key> automatically creates one group per unique value of the metadata key <key>. Equivalent to one --group <val>:<key>=<val> per distinct value. --aggregate-op <op> sets the operator for all auto-generated groups.

--aggregate-by and --group are mutually exclusive.

Column selection and ordering — --select

--select col1,col2,...

Lists the output columns in order. Each element is either a group name (defined by --group or generated by --aggregate-by) or a genome label from the source index (pass-through, no aggregation).

Default when --select is absent: all defined groups in declaration order (for --group), or all generated groups in metadata-value order (for --aggregate-by). Individual genomes not in any group are excluded unless named explicitly.

When neither --group nor --aggregate-by is specified: --select can still reference genome labels for pure column projection (no aggregation). If --select is also absent, all genomes are output unchanged (identity transform — useful combined with row filtering via a prior filter run).


Aggregation operators

Operator Input Output Semantics
any pres / count presence 1 if ≥ 1 genome in group carries the k-mer
all pres / count presence 1 if every genome in group carries the k-mer
none pres / count presence 1 if no genome in group carries the k-mer
sum count count sum of counts across the group
min count count minimum count
max count count maximum count

Default operator:

  • Presence index: any
  • Count index: sum

Logical operators (any/all/none) on a count index use --presence-threshold N (default 0): a genome "carries" the k-mer if its count is > N.

Output index type:

  • If the source is a presence index, the output is always a presence index.
  • If the source is a count index and every output column uses a logical operator or is a pass-through from a presence source, the output is a presence index.
  • Otherwise (at least one arithmetic operator on a count source), the output is a count index.

Behaviour for edge cases

Situation Behaviour
Genome missing the metadata key in --aggregate-by genome ignored (no NA group)
Genome in multiple groups contributes independently to each
--group-op references undefined group error
--select element is neither group name nor genome label error
--output and --in-place both specified error
Neither --output nor --in-place error
Group with zero matching genomes column is all-zeros (or all-ones for none)

Examples

Aggregate by metadata group, default operators

obikmer select myindex --output out --aggregate-by group
# one column per unique value of "group"; presence→any, count→sum

Named groups with different operators

obikmer select myindex --output out \
  --group    "pub:species=Betula_pubescens" \
  --group    "nan:species=Betula_nana" \
  --group-op "pub:any" \
  --group-op "nan:all" \
  --select   "pub,nan"

Mix aggregated group and individual genome

obikmer select myindex --output out \
  --group  "A:group=A" \
  --select "A,Betula_nana--IGA-24-39"

Pure column projection (no aggregation)

obikmer select myindex --output out \
  --select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"

In-place: keep only group A

obikmer select myindex --in-place --group "A:group=A" --select "A"

Compose with filter

# Step 1: keep only B. nana-specific k-mers
obikmer filter myindex --output filtered \
  --ingroup "species=Betula_nana" --outgroup "*"

# Step 2: aggregate genome columns by collection site
obikmer select filtered --output final --aggregate-by site

Implementation notes

select does not rebuild the MPHF. The 256 partitions are processed in parallel (rayon), each writing its output independently; results require no synchronisation because every partition owns a distinct set of files.

For each layer in each partition:

  1. The slot count n is read by opening the source data matrix.
  2. A new data matrix is built with M columns (M = number of output columns).
  3. For each slot s in 0..n:
    • old_row = matrix.fill_row(s) — reads the original N-column row without allocating.
    • For each output column j:
      • new_row[j] = aggregate(op, old_row[group_indices]).
      • Pass-through columns are represented as single-element groups with the default operator (any for presence, sum for count) — same code path.
    • The new row is written slot by slot into each column builder.
  4. All plain files in the source layer directory (mphf.bin, unitigs.bin, evidence files, layer_meta.json) are copied verbatim; only the presence/ or counts/ subdirectory is rewritten.
  5. index.meta is rewritten with the new genome list and updated with_counts.

--in-place write strategy: new data is written to a temporary sibling directory (presence_new/ or counts_new/); on success the old directory is removed and the temporary one is renamed into place. An interrupted run leaves at most one stale *_new/ directory; the original data is intact until the rename step.