feat: add select command for genome column projection and aggregation

Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.
2026-06-09 15:05:08 +02:00
parent b0dab452f6
commit e66adef23d
11 changed files with 958 additions and 1 deletions
@@ -0,0 +1,234 @@
+# `select` — column projection and aggregation
+
+`select` transforms an index by operating on its **genome columns**: projecting a
+subset of columns, aggregating groups of genomes into synthetic columns, or both.
+It is the column-axis counterpart of `filter` (row-axis operations).
+
+Following relational algebra conventions:
+
+| Command  | Relational operation | Axis     |
+|----------|---------------------|----------|
+| `filter` | σ — selection       | rows (k-mers) |
+| `select` | π — projection      | columns (genomes) |
+
+The two commands compose naturally: run `filter` first to restrict the kmer set,
+then `select` to reshape the genome columns.
+
+`select` never changes the kmer set. The MPHF and `unitigs.bin` of each layer
+are preserved unchanged; only the data matrices are rewritten.
+
+---
+
+## Synopsis
+
+```sh
+obikmer select <input-index>
+        { --output <dir> | --in-place }
+        [--group    <name>:<pred>  ...]
+        [--group-op <name>:<op>    ...]
+        [--aggregate-by <key>          ]
+        [--aggregate-op <op>           ]
+        [--select   <col1,col2,...>    ]
+        [--presence-threshold <N>      ]
+```
+
+---
+
+## Output destination
+
+Exactly one of `--output` or `--in-place` must be specified.
+
+**`--output <dir>`** — writes a new index to `<dir>`. The source index is
+unchanged. The MPHF and unitig files are copied; only the data matrices are
+rewritten with the new column layout.
+
+**`--in-place`** — rewrites the data matrices of the source index directly.
+Removed or replaced columns are lost. The operation writes to temporary files
+first, then renames atomically, so an interrupted run leaves the index intact.
+
+---
+
+## Defining output columns
+
+### Named groups — `--group`
+
+```
+--group <name>:<pred>
+```
+
+Defines a named group of genomes using the same predicate syntax as
+`filter`/`rebuild`. Repeatable; a genome can belong to several groups.
+
+```sh
+--group "pub:species=Betula_pubescens"
+--group "nan:species=Betula_nana"
+```
+
+### Per-group operator — `--group-op`
+
+```
+--group-op <name>:<op>
+```
+
+Assigns an aggregation operator to a named group. Optional; if absent, the
+default operator applies (see below).
+
+```sh
+--group-op "pub:any"
+--group-op "nan:all"
+```
+
+### Shorthand — `--aggregate-by` / `--aggregate-op`
+
+`--aggregate-by <key>` automatically creates one group per unique value of the
+metadata key `<key>`. Equivalent to one `--group <val>:<key>=<val>` per distinct
+value. `--aggregate-op <op>` sets the operator for all auto-generated groups.
+
+`--aggregate-by` and `--group` are mutually exclusive.
+
+### Column selection and ordering — `--select`
+
+```
+--select col1,col2,...
+```
+
+Lists the output columns in order. Each element is either a group name (defined
+by `--group` or generated by `--aggregate-by`) or a genome label from the source
+index (pass-through, no aggregation).
+
+**Default when `--select` is absent:**
+all defined groups in declaration order (for `--group`), or all generated groups
+in metadata-value order (for `--aggregate-by`). Individual genomes not in any
+group are excluded unless named explicitly.
+
+**When neither `--group` nor `--aggregate-by` is specified:**
+`--select` can still reference genome labels for pure column projection (no
+aggregation). If `--select` is also absent, all genomes are output unchanged
+(identity transform — useful combined with row filtering via a prior `filter`
+run).
+
+---
+
+## Aggregation operators
+
+| Operator | Input       | Output   | Semantics |
+|----------|-------------|----------|-----------|
+| `any`    | pres / count | presence | 1 if ≥ 1 genome in group carries the k-mer |
+| `all`    | pres / count | presence | 1 if every genome in group carries the k-mer |
+| `none`   | pres / count | presence | 1 if no genome in group carries the k-mer |
+| `sum`    | count        | count    | sum of counts across the group |
+| `min`    | count        | count    | minimum count |
+| `max`    | count        | count    | maximum count |
+
+**Default operator:**
+- Presence index: `any`
+- Count index: `sum`
+
+Logical operators (`any`/`all`/`none`) on a count index use
+`--presence-threshold N` (default 0): a genome "carries" the k-mer if its count
+is > N.
+
+**Output index type:**
+- If the source is a presence index, the output is always a presence index.
+- If the source is a count index and every output column uses a logical operator
+  or is a pass-through from a presence source, the output is a presence index.
+- Otherwise (at least one arithmetic operator on a count source), the output is
+  a count index.
+
+---
+
+## Behaviour for edge cases
+
+| Situation | Behaviour |
+|-----------|-----------|
+| Genome missing the metadata key in `--aggregate-by` | genome ignored (no `NA` group) |
+| Genome in multiple groups | contributes independently to each |
+| `--group-op` references undefined group | error |
+| `--select` element is neither group name nor genome label | error |
+| `--output` and `--in-place` both specified | error |
+| Neither `--output` nor `--in-place` | error |
+| Group with zero matching genomes | column is all-zeros (or all-ones for `none`) |
+
+---
+
+## Examples
+
+### Aggregate by metadata group, default operators
+
+```sh
+obikmer select myindex --output out --aggregate-by group
+# one column per unique value of "group"; presence→any, count→sum
+```
+
+### Named groups with different operators
+
+```sh
+obikmer select myindex --output out \
+  --group    "pub:species=Betula_pubescens" \
+  --group    "nan:species=Betula_nana" \
+  --group-op "pub:any" \
+  --group-op "nan:all" \
+  --select   "pub,nan"
+```
+
+### Mix aggregated group and individual genome
+
+```sh
+obikmer select myindex --output out \
+  --group  "A:group=A" \
+  --select "A,Betula_nana--IGA-24-39"
+```
+
+### Pure column projection (no aggregation)
+
+```sh
+obikmer select myindex --output out \
+  --select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"
+```
+
+### In-place: keep only group A
+
+```sh
+obikmer select myindex --in-place --group "A:group=A" --select "A"
+```
+
+### Compose with filter
+
+```sh
+# Step 1: keep only B. nana-specific k-mers
+obikmer filter myindex --output filtered \
+  --ingroup "species=Betula_nana" --outgroup "*"
+
+# Step 2: aggregate genome columns by collection site
+obikmer select filtered --output final --aggregate-by site
+```
+
+---
+
+## Implementation notes
+
+`select` does not rebuild the MPHF. The 256 partitions are processed in parallel
+(rayon), each writing its output independently; results require no synchronisation
+because every partition owns a distinct set of files.
+
+For each layer in each partition:
+
+1. The slot count `n` is read by opening the source data matrix.
+2. A new data matrix is built with M columns (M = number of output columns).
+3. For each slot `s` in `0..n`:
+   - `old_row = matrix.fill_row(s)` — reads the original `N`-column row without allocating.
+   - For each output column `j`:
+     - `new_row[j] = aggregate(op, old_row[group_indices])`.
+     - Pass-through columns are represented as single-element groups with the
+       default operator (`any` for presence, `sum` for count) — same code path.
+   - The new row is written slot by slot into each column builder.
+4. All plain files in the source layer directory (`mphf.bin`, `unitigs.bin`,
+   evidence files, `layer_meta.json`) are copied verbatim; only the `presence/`
+   or `counts/` subdirectory is rewritten.
+5. `index.meta` is rewritten with the new genome list and updated `with_counts`.
+
+**`--in-place` write strategy:** new data is written to a temporary sibling
+directory (`presence_new/` or `counts_new/`); on success the old directory is
+removed and the temporary one is renamed into place. An interrupted run leaves
+at most one stale `*_new/` directory; the original data is intact until the
+rename step.
@@ -9,12 +9,13 @@
 | `superkmer` | Extract super-kmers from a sequence file and write to stdout |
 | `index`     | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
 | `merge`     | Merge multiple built indexes into one |
-| `rebuild`   | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system |
+| `rebuild` / `filter` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system (`filter` is an alias for `rebuild`) |
 | `query`     | Query an index with sequences and annotate matches |
 | `dump`      | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
 | `annotate`  | Add or update genome metadata from a CSV file; or dump metadata as CSV |
 | `distance`  | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
 | `unitig`    | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
+| `select`    | Project and/or aggregate genome columns into a new or in-place index; the column-axis counterpart of `filter` (see [select](implementation/select.md)) |
 | `estimate`  | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
 | `reindex`   | Convert an index's evidence in-place: exact ↔ approx |
 | `utils`     | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
@@ -50,6 +50,7 @@ nav:
      - PersistentBitVec: implementation/persistent_bit_vec.md
      - Merge command: implementation/merge.md
      - Kmer filtering: implementation/filtering.md
+      - Select command: implementation/select.md
  - Architecture:
      - Sequences: architecture/sequences/invariant.md
      - Kmer index: architecture/index_architecture.md
@@ -7,6 +7,7 @@ mod index;
 mod merge;
 mod rebuild;
 mod reindex;
+mod select;
 mod stats;

 pub use error::{OKIError, OKIResult};
@@ -0,0 +1,166 @@
+use std::fs;
+use std::io;
+use std::path::Path;
+
+use obikpartitionner::{KmerPartition, OutputCol, PARTITIONS_SUBDIR};
+use obisys::{Stage, progress_bar};
+use rayon::prelude::*;
+use tracing::info;
+
+use crate::error::{OKIError, OKIResult};
+use crate::index::KmerIndex;
+use crate::meta::{GenomeInfo, IndexMeta};
+use crate::state::{IndexState, SENTINEL_INDEXED};
+
+impl KmerIndex {
+    /// Create a new index at `output` by projecting/aggregating the genome columns
+    /// of `src` according to `specs`.
+    ///
+    /// `output_presence` — if true, output uses bit matrices (0/1), regardless of
+    /// whether the source stores counts. The caller is responsible for ensuring all
+    /// specs use logical operators when `output_presence=true` on a count source.
+    pub fn select<P: AsRef<Path>>(
+        output: P,
+        src: &KmerIndex,
+        specs: &[OutputCol],
+        threshold: u32,
+        output_presence: bool,
+        force: bool,
+    ) -> OKIResult<Self> {
+        let output = output.as_ref();
+
+        if src.state() != IndexState::Indexed {
+            return Err(OKIError::NotIndexed(src.root_path.clone()));
+        }
+
+        if output.exists() {
+            if force {
+                fs::remove_dir_all(output)?;
+            } else {
+                return Err(OKIError::Io(io::Error::new(
+                    io::ErrorKind::AlreadyExists,
+                    format!("{}: output directory already exists", output.display()),
+                )));
+            }
+        }
+
+        fs::create_dir_all(output)?;
+        let mut meta = IndexMeta::new(src.meta.config.clone());
+        meta.config.with_counts = !output_presence;
+        meta.genomes = specs.iter()
+            .map(|s| GenomeInfo::new(s.label.clone()))
+            .collect();
+        meta.write(output)?;
+
+        let n_src_genomes  = src.meta.genomes.len();
+        let n_partitions   = src.partition.n_partitions();
+
+        fs::create_dir_all(output.join(PARTITIONS_SUBDIR))?;
+        let dst_partition = KmerPartition::open_with_config(
+            output,
+            meta.config.kmer_size,
+            meta.config.minimizer_size,
+            meta.config.n_bits,
+        )?;
+
+        info!(
+            "select: {} partition(s), {} source genome(s) → {} output column(s)",
+            n_partitions, n_src_genomes, specs.len(),
+        );
+
+        let t   = Stage::start("select");
+        let pb  = progress_bar("select", n_partitions as u64, "partitions");
+        let src_partition = &src.partition;
+
+        let errors: Vec<obiskio::SKError> = (0..n_partitions)
+            .into_par_iter()
+            .filter_map(|i| {
+                let result = dst_partition.select_partition(
+                    src_partition, i, specs,
+                    n_src_genomes, threshold, output_presence,
+                    false,
+                );
+                pb.inc(1);
+                result.err()
+            })
+            .collect();
+
+        pb.finish_and_clear();
+
+        if let Some(e) = errors.into_iter().next() {
+            return Err(OKIError::Partition(e));
+        }
+
+        let _ = t.stop();
+
+        fs::File::create(output.join(SENTINEL_INDEXED))?;
+
+        let idx = KmerIndex::open(output)?;
+        idx.pack_matrices()?;
+        Ok(idx)
+    }
+
+    /// Rewrite the genome columns of this index in-place according to `specs`.
+    ///
+    /// The MPHF and unitig files are unchanged; only data matrices are rewritten.
+    pub fn select_in_place(
+        &mut self,
+        specs: &[OutputCol],
+        threshold: u32,
+        output_presence: bool,
+    ) -> OKIResult<()> {
+        if self.state() != IndexState::Indexed {
+            return Err(OKIError::NotIndexed(self.root_path.clone()));
+        }
+
+        let n_src_genomes = self.meta.genomes.len();
+        let n_partitions  = self.partition.n_partitions();
+
+        // Open a second handle to the same path so we can borrow src and dst simultaneously.
+        let src_partition = KmerPartition::open_with_config(
+            &self.root_path,
+            self.meta.config.kmer_size,
+            self.meta.config.minimizer_size,
+            self.meta.config.n_bits,
+        )?;
+
+        info!(
+            "select (in-place): {} partition(s), {} source genome(s) → {} output column(s)",
+            n_partitions, n_src_genomes, specs.len(),
+        );
+
+        let t  = Stage::start("select");
+        let pb = progress_bar("select", n_partitions as u64, "partitions");
+
+        let errors: Vec<obiskio::SKError> = (0..n_partitions)
+            .into_par_iter()
+            .filter_map(|i| {
+                let result = self.partition.select_partition(
+                    &src_partition, i, specs,
+                    n_src_genomes, threshold, output_presence,
+                    true,
+                );
+                pb.inc(1);
+                result.err()
+            })
+            .collect();
+
+        pb.finish_and_clear();
+
+        if let Some(e) = errors.into_iter().next() {
+            return Err(OKIError::Partition(e));
+        }
+
+        let _ = t.stop();
+
+        // Update index.meta with new genome list and with_counts flag.
+        self.meta.config.with_counts = !output_presence;
+        self.meta.genomes = specs.iter()
+            .map(|s| GenomeInfo::new(s.label.clone()))
+            .collect();
+        self.meta.write(&self.root_path)?;
+
+        self.pack_matrices()?;
+        Ok(())
+    }
+}
@@ -1,6 +1,7 @@
 pub mod annotate;
 pub mod pack;
 pub(crate) mod predicate;
+pub mod select;
 pub mod utils;
 pub mod distance;
 pub mod dump;
@@ -230,6 +230,16 @@ impl FilterArgs {
    }
 }

+/// Returns indices of genomes matching `pred_str` (single predicate).
+pub fn matching_genome_indices(pred_str: &str, genomes: &[GenomeInfo]) -> Result<Vec<usize>, String> {
+    let pred = MetaPred::parse(pred_str)?;
+    Ok(genomes.iter().enumerate()
+        .filter_map(|(i, g)| {
+            if pred.eval(&g.meta) == Some(true) { Some(i) } else { std::option::Option::None }
+        })
+        .collect())
+}
+
 pub struct GroupFilterParams {
    pub threshold:          u32,
    pub min_count:          Option<usize>,
@@ -0,0 +1,248 @@
+use std::collections::{BTreeMap, HashMap};
+use std::path::PathBuf;
+
+use clap::{Args, ValueEnum};
+use obikindex::{GenomeInfo, KmerIndex};
+use obikpartitionner::{AggOp, OutputCol};
+use tracing::info;
+
+use super::predicate::matching_genome_indices;
+
+// ── CLI types ─────────────────────────────────────────────────────────────────
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq, ValueEnum)]
+pub enum AggOpArg {
+    Any,
+    All,
+    None,
+    Sum,
+    Min,
+    Max,
+}
+
+impl From<AggOpArg> for AggOp {
+    fn from(a: AggOpArg) -> Self {
+        match a {
+            AggOpArg::Any  => AggOp::Any,
+            AggOpArg::All  => AggOp::All,
+            AggOpArg::None => AggOp::None,
+            AggOpArg::Sum  => AggOp::Sum,
+            AggOpArg::Min  => AggOp::Min,
+            AggOpArg::Max  => AggOp::Max,
+        }
+    }
+}
+
+#[derive(Args)]
+pub struct SelectArgs {
+    /// Source index directory
+    pub source: PathBuf,
+
+    /// Output index directory (mutually exclusive with --in-place)
+    #[arg(long, conflicts_with = "in_place")]
+    pub output: Option<PathBuf>,
+
+    /// Rewrite the source index in-place (mutually exclusive with --output)
+    #[arg(long)]
+    pub in_place: bool,
+
+    /// Define a named group: `<name>:<pred>` (repeatable; mutually exclusive with --aggregate-by)
+    #[arg(long, value_name = "NAME:PRED", conflicts_with = "aggregate_by")]
+    pub group: Vec<String>,
+
+    /// Per-group aggregation operator: `<name>:<op>` (repeatable)
+    #[arg(long, value_name = "NAME:OP")]
+    pub group_op: Vec<String>,
+
+    /// Auto-create one group per unique value of metadata key <KEY>
+    #[arg(long, value_name = "KEY", conflicts_with = "group")]
+    pub aggregate_by: Option<String>,
+
+    /// Aggregation operator for all auto-generated groups
+    #[arg(long, value_name = "OP")]
+    pub aggregate_op: Option<AggOpArg>,
+
+    /// Output columns in order: group names or genome labels, comma-separated
+    #[arg(long, value_name = "COL,...", value_delimiter = ',')]
+    pub select: Option<Vec<String>>,
+
+    /// Minimum count to consider a genome as "carrying" the k-mer (logical ops only)
+    #[arg(long, default_value = "0")]
+    pub presence_threshold: u32,
+
+    /// Overwrite existing output directory
+    #[arg(short, long)]
+    pub force: bool,
+}
+
+// ── Helpers ───────────────────────────────────────────────────────────────────
+
+fn parse_name_value(s: &str, flag: &str) -> (String, String) {
+    match s.find(':') {
+        Some(pos) => (s[..pos].trim().to_string(), s[pos + 1..].to_string()),
+        std::option::Option::None => {
+            eprintln!("error in {flag}: expected <name>:<value>, got: {s}");
+            std::process::exit(1);
+        }
+    }
+}
+
+fn parse_agg_op(s: &str) -> AggOp {
+    match s.to_lowercase().as_str() {
+        "any"  => AggOp::Any,
+        "all"  => AggOp::All,
+        "none" => AggOp::None,
+        "sum"  => AggOp::Sum,
+        "min"  => AggOp::Min,
+        "max"  => AggOp::Max,
+        other  => {
+            eprintln!("unknown aggregation operator: {other}; valid: any, all, none, sum, min, max");
+            std::process::exit(1);
+        }
+    }
+}
+
+fn default_op(src_is_count: bool) -> AggOp {
+    if src_is_count { AggOp::Sum } else { AggOp::Any }
+}
+
+// ── build_specs ───────────────────────────────────────────────────────────────
+
+/// Resolve CLI arguments into an ordered list of `OutputCol`.
+///
+/// Returns `(specs, output_presence)`.
+fn build_specs(
+    args: &SelectArgs,
+    genomes: &[GenomeInfo],
+    src_is_count: bool,
+) -> (Vec<OutputCol>, bool) {
+    // ── 1. Build group_indices: name → Vec<usize> ────────────────────────────
+    // Also keep insertion order for the default `--select *` case.
+    let mut group_order: Vec<String> = Vec::new();
+    let mut group_indices: HashMap<String, Vec<usize>> = HashMap::new();
+
+    if let Some(ref key) = args.aggregate_by {
+        // One group per unique value of `key`, in sorted order.
+        let mut value_to_indices: BTreeMap<String, Vec<usize>> = BTreeMap::new();
+        for (i, g) in genomes.iter().enumerate() {
+            if let Some(v) = g.meta.get(key) {
+                value_to_indices.entry(v.clone()).or_default().push(i);
+            }
+        }
+        for (v, idxs) in value_to_indices {
+            group_order.push(v.clone());
+            group_indices.insert(v, idxs);
+        }
+    } else {
+        for raw in &args.group {
+            let (name, pred) = parse_name_value(raw, "--group");
+            let idxs = matching_genome_indices(&pred, genomes).unwrap_or_else(|e| {
+                eprintln!("error in --group {name}: {e}");
+                std::process::exit(1);
+            });
+            if !group_indices.contains_key(&name) {
+                group_order.push(name.clone());
+            }
+            group_indices.insert(name, idxs);
+        }
+    }
+
+    // ── 2. Build per-group ops ────────────────────────────────────────────────
+    let global_op = args.aggregate_op.map(AggOp::from);
+    let mut group_op: HashMap<String, AggOp> = HashMap::new();
+    for raw in &args.group_op {
+        let (name, op_str) = parse_name_value(raw, "--group-op");
+        if !group_indices.contains_key(&name) {
+            eprintln!("--group-op references undefined group: {name}");
+            std::process::exit(1);
+        }
+        group_op.insert(name, parse_agg_op(&op_str));
+    }
+
+    // ── 3. Genome label → index map for pass-through columns ─────────────────
+    let label_to_idx: HashMap<&str, usize> = genomes.iter().enumerate()
+        .map(|(i, g)| (g.label.as_str(), i))
+        .collect();
+
+    // ── 4. Determine output column names ─────────────────────────────────────
+    let col_names: Vec<String> = if let Some(ref sel) = args.select {
+        sel.clone()
+    } else if !group_order.is_empty() {
+        group_order.clone()
+    } else {
+        // Identity: all genomes in original order
+        genomes.iter().map(|g| g.label.clone()).collect()
+    };
+
+    // ── 5. Build OutputCol list ───────────────────────────────────────────────
+    let mut specs: Vec<OutputCol> = Vec::with_capacity(col_names.len());
+
+    for name in &col_names {
+        if let Some(idxs) = group_indices.get(name) {
+            let op = group_op.get(name)
+                .copied()
+                .or(global_op)
+                .unwrap_or_else(|| default_op(src_is_count));
+            specs.push(OutputCol { label: name.clone(), indices: idxs.clone(), op });
+        } else if let Some(&idx) = label_to_idx.get(name.as_str()) {
+            // Pass-through: single-element group with default op.
+            let op = default_op(src_is_count);
+            specs.push(OutputCol { label: name.clone(), indices: vec![idx], op });
+        } else {
+            eprintln!("--select: unknown column '{name}' (not a group name or genome label)");
+            std::process::exit(1);
+        }
+    }
+
+    if specs.is_empty() {
+        eprintln!("select: no output columns defined");
+        std::process::exit(1);
+    }
+
+    // ── 6. Determine output type ──────────────────────────────────────────────
+    let output_presence = !src_is_count
+        || specs.iter().all(|s| s.op.is_logical());
+
+    (specs, output_presence)
+}
+
+// ── run ───────────────────────────────────────────────────────────────────────
+
+pub fn run(args: SelectArgs) {
+    if !args.in_place && args.output.is_none() {
+        eprintln!("error: one of --output or --in-place must be specified");
+        std::process::exit(1);
+    }
+
+    let mut src = KmerIndex::open(&args.source).unwrap_or_else(|e| {
+        eprintln!("error opening source index: {e}");
+        std::process::exit(1);
+    });
+
+    let src_is_count = src.meta().config.with_counts;
+    let (specs, output_presence) = build_specs(&args, &src.meta().genomes.clone(), src_is_count);
+
+    info!(
+        "select: {} genome(s) → {} output column(s), output={}",
+        src.meta().genomes.len(),
+        specs.len(),
+        if output_presence { "presence" } else { "count" },
+    );
+
+    if args.in_place {
+        src.select_in_place(&specs, args.presence_threshold, output_presence)
+            .unwrap_or_else(|e| {
+                eprintln!("select error: {e}");
+                std::process::exit(1);
+            });
+        info!("selected in-place → {}", args.source.display());
+    } else {
+        let output = args.output.unwrap();
+        KmerIndex::select(&output, &src, &specs, args.presence_threshold, output_presence, args.force)
+            .unwrap_or_else(|e| {
+                eprintln!("select error: {e}");
+                std::process::exit(1);
+            });
+        info!("selected index → {}", output.display());
+    }
+}
@@ -22,6 +22,10 @@ enum Commands {
    Merge(cmd::merge::MergeArgs),
    /// Filter and compact an existing index into a new single-layer index
    Rebuild(cmd::rebuild::RebuildArgs),
+    /// Alias for rebuild
+    Filter(cmd::rebuild::RebuildArgs),
+    /// Project and/or aggregate genome columns into a new or in-place index
+    Select(cmd::select::SelectArgs),
    /// Query an index with sequences and annotate matches
    Query(cmd::query::QueryArgs),
    /// Dump all indexed kmers as CSV (kmer + per-genome counts or presence)
@@ -66,6 +70,8 @@ fn main() {
        Commands::Merge(args)     => cmd::merge::run(args),
        Commands::Dump(args)      => cmd::dump::run(args),
        Commands::Rebuild(args)   => cmd::rebuild::run(args),
+        Commands::Filter(args)    => cmd::rebuild::run(args),
+        Commands::Select(args)    => cmd::select::run(args),
        Commands::Query(args)     => cmd::query::run(args),
        Commands::Annotate(args)  => cmd::annotate::run(args),
        Commands::Distance(args)  => cmd::distance::run(args),
@@ -7,7 +7,9 @@ mod merge_layer;
 mod partition;
 mod query_layer;
 mod rebuild_layer;
+mod select_layer;

 pub use filter::{GroupQuorumFilter, KmerFilter, passes_all};
 pub use merge_layer::MergeMode;
 pub use partition::{KmerPartition, KmerSpectrum, PARTITIONS_SUBDIR};
+pub use select_layer::{AggOp, OutputCol};
@@ -0,0 +1,287 @@
+use std::fs;
+use std::io;
+use std::path::{Path, PathBuf};
+
+use obicompactvec::{
+    PersistentBitMatrix, PersistentBitMatrixBuilder, PersistentBitVecBuilder,
+    PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder, PersistentCompactIntVecBuilder,
+};
+use obilayeredmap::meta::PartitionMeta;
+use obilayeredmap::OLMError;
+use obiskio::{SKError, SKResult};
+
+use crate::partition::KmerPartition;
+
+const INDEX_SUBDIR: &str = "index";
+
+// ── AggOp ─────────────────────────────────────────────────────────────────────
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum AggOp {
+    Any,
+    All,
+    None,
+    Sum,
+    Min,
+    Max,
+}
+
+impl AggOp {
+    pub fn is_logical(self) -> bool {
+        matches!(self, AggOp::Any | AggOp::All | AggOp::None)
+    }
+}
+
+// ── OutputCol ─────────────────────────────────────────────────────────────────
+
+pub struct OutputCol {
+    pub label:   String,
+    pub indices: Vec<usize>,
+    pub op:      AggOp,
+}
+
+// ── Aggregation ───────────────────────────────────────────────────────────────
+
+#[inline]
+fn aggregate(op: AggOp, indices: &[usize], src_row: &[u32], threshold: u32) -> u32 {
+    match op {
+        AggOp::Any => {
+            if indices.iter().any(|&i| src_row[i] > threshold) { 1 } else { 0 }
+        }
+        AggOp::All => {
+            if indices.is_empty() { return 0; }
+            if indices.iter().all(|&i| src_row[i] > threshold) { 1 } else { 0 }
+        }
+        AggOp::None => {
+            if indices.iter().all(|&i| src_row[i] <= threshold) { 1 } else { 0 }
+        }
+        AggOp::Sum => {
+            indices.iter().map(|&i| src_row[i]).fold(0u32, |a, b| a.saturating_add(b))
+        }
+        AggOp::Min => indices.iter().map(|&i| src_row[i]).min().unwrap_or(0),
+        AggOp::Max => indices.iter().map(|&i| src_row[i]).max().unwrap_or(0),
+    }
+}
+
+// ── ColBuilder ────────────────────────────────────────────────────────────────
+
+enum ColBuilder {
+    Bit(PersistentBitVecBuilder),
+    Int(PersistentCompactIntVecBuilder),
+}
+
+impl ColBuilder {
+    fn set_val(&mut self, slot: usize, value: u32) {
+        match self {
+            ColBuilder::Bit(b) => b.set(slot, value > 0),
+            ColBuilder::Int(b) => b.set(slot, value),
+        }
+    }
+
+    fn close(self) -> SKResult<()> {
+        match self {
+            ColBuilder::Bit(b) => b.close().map_err(SKError::Io),
+            ColBuilder::Int(b) => b.close().map_err(SKError::Io),
+        }
+    }
+}
+
+// ── Helpers ───────────────────────────────────────────────────────────────────
+
+fn olm_to_sk(e: OLMError) -> SKError {
+    match e {
+        OLMError::Io(e) => SKError::Io(e),
+        other => SKError::InvalidData { context: "select", detail: other.to_string() },
+    }
+}
+
+fn col_path_bit(dir: &Path, col: usize) -> PathBuf {
+    dir.join(format!("col_{col:06}.pbiv"))
+}
+
+fn col_path_int(dir: &Path, col: usize) -> PathBuf {
+    dir.join(format!("col_{col:06}.pciv"))
+}
+
+fn write_matrix_meta(dir: &Path, n: usize, n_cols: usize) -> io::Result<()> {
+    fs::write(
+        dir.join("meta.json"),
+        format!("{{\"n\":{n},\"n_cols\":{n_cols}}}\n"),
+    )
+}
+
+/// Copy all plain files (not subdirectories) from `src_dir` to `dst_dir`.
+fn copy_layer_files(src_dir: &Path, dst_dir: &Path) -> io::Result<()> {
+    for entry in fs::read_dir(src_dir)? {
+        let entry = entry?;
+        let path = entry.path();
+        if path.is_file() {
+            fs::copy(&path, dst_dir.join(entry.file_name()))?;
+        }
+    }
+    Ok(())
+}
+
+// ── fill_builders ─────────────────────────────────────────────────────────────
+
+fn fill_builders(
+    builders: &mut [ColBuilder],
+    specs: &[OutputCol],
+    n: usize,
+    n_src: usize,
+    src_layer_dir: &Path,
+    src_is_count: bool,
+    threshold: u32,
+) -> SKResult<()> {
+    let mut src_buf = vec![0u32; n_src];
+
+    if src_is_count {
+        let mat = PersistentCompactIntMatrix::open(src_layer_dir).map_err(SKError::Io)?;
+        for slot in 0..n {
+            mat.fill_row(slot, &mut src_buf);
+            for (col, spec) in specs.iter().enumerate() {
+                builders[col].set_val(slot, aggregate(spec.op, &spec.indices, &src_buf, threshold));
+            }
+        }
+    } else {
+        let mat = PersistentBitMatrix::open(src_layer_dir).map_err(SKError::Io)?;
+        for slot in 0..n {
+            mat.fill_row(slot, &mut src_buf);
+            for (col, spec) in specs.iter().enumerate() {
+                builders[col].set_val(slot, aggregate(spec.op, &spec.indices, &src_buf, threshold));
+            }
+        }
+    }
+    Ok(())
+}
+
+// ── KmerPartition::select_partition ──────────────────────────────────────────
+
+impl KmerPartition {
+    /// Rewrite the data matrices of partition `i` in `src` into `self`.
+    ///
+    /// `specs` defines the output columns (projection/aggregation).
+    /// `output_presence` — if true, all output builders use bit (0/1) format.
+    /// `in_place` — `self` and `src` share the same root; write to temp dirs then swap.
+    pub fn select_partition(
+        &self,
+        src: &KmerPartition,
+        i: usize,
+        specs: &[OutputCol],
+        n_src_genomes: usize,
+        threshold: u32,
+        output_presence: bool,
+        in_place: bool,
+    ) -> SKResult<()> {
+        let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR);
+        if !src_index_dir.exists() {
+            return Ok(());
+        }
+
+        let src_meta = PartitionMeta::load(&src_index_dir).map_err(olm_to_sk)?;
+        if src_meta.n_layers == 0 {
+            return Ok(());
+        }
+
+        let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR);
+        if !in_place {
+            fs::create_dir_all(&dst_index_dir)?;
+        }
+
+        let n_out = specs.len();
+        let data_subdir = if output_presence { "presence" } else { "counts" };
+
+        for l in 0..src_meta.n_layers {
+            let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
+            if !src_layer_dir.exists() { continue; }
+
+            let dst_layer_dir = dst_index_dir.join(format!("layer_{l}"));
+
+            let counts_dir   = src_layer_dir.join("counts");
+            let presence_dir = src_layer_dir.join("presence");
+            let src_is_count = counts_dir.exists() && !presence_dir.exists();
+
+            // Determine number of slots from the source matrix.
+            let n = if counts_dir.exists() {
+                PersistentCompactIntMatrix::open(&src_layer_dir).map_err(SKError::Io)?.n()
+            } else if presence_dir.exists() {
+                PersistentBitMatrix::open(&src_layer_dir).map_err(SKError::Io)?.n()
+            } else {
+                // Implicit single-genome layer: no data matrix needed in output either.
+                if !in_place {
+                    fs::create_dir_all(&dst_layer_dir)?;
+                    copy_layer_files(&src_layer_dir, &dst_layer_dir)?;
+                }
+                continue;
+            };
+
+            // Choose the output data directory (temp name for in-place).
+            let (dst_data_dir, final_data_dir) = if in_place {
+                let tmp  = dst_layer_dir.join(format!("{data_subdir}_new"));
+                let perm = dst_layer_dir.join(data_subdir);
+                (tmp, perm)
+            } else {
+                let perm = dst_layer_dir.join(data_subdir);
+                (perm.clone(), perm)
+            };
+
+            if !in_place {
+                fs::create_dir_all(&dst_layer_dir)?;
+                copy_layer_files(&src_layer_dir, &dst_layer_dir)?;
+            }
+            fs::create_dir_all(&dst_data_dir)?;
+
+            // Initialise packed-format skeleton.
+            if output_presence {
+                PersistentBitMatrixBuilder::new(n, &dst_data_dir)
+                    .map_err(SKError::Io)?.close().map_err(SKError::Io)?;
+            } else {
+                PersistentCompactIntMatrixBuilder::new(n, &dst_data_dir)
+                    .map_err(SKError::Io)?.close().map_err(SKError::Io)?;
+            }
+
+            // Create column builders.
+            let mut builders: Vec<ColBuilder> = (0..n_out)
+                .map(|col| -> SKResult<ColBuilder> {
+                    if output_presence {
+                        Ok(ColBuilder::Bit(PersistentBitVecBuilder::new(
+                            n, &col_path_bit(&dst_data_dir, col),
+                        )?))
+                    } else {
+                        Ok(ColBuilder::Int(PersistentCompactIntVecBuilder::new(
+                            n, &col_path_int(&dst_data_dir, col),
+                        )?))
+                    }
+                })
+                .collect::<SKResult<_>>()?;
+
+            fill_builders(
+                &mut builders, specs, n, n_src_genomes,
+                &src_layer_dir, src_is_count, threshold,
+            )?;
+
+            for b in builders { b.close()?; }
+            write_matrix_meta(&dst_data_dir, n, n_out).map_err(SKError::Io)?;
+
+            // In-place: swap old data dir for new.
+            if in_place {
+                let old_data_dir = if src_is_count {
+                    dst_layer_dir.join("counts")
+                } else {
+                    dst_layer_dir.join("presence")
+                };
+                if old_data_dir.exists() {
+                    fs::remove_dir_all(&old_data_dir)?;
+                }
+                fs::rename(&dst_data_dir, &final_data_dir)?;
+            }
+        }
+
+        if !in_place {
+            PartitionMeta::load(&src_index_dir).map_err(olm_to_sk)?
+                .save(&dst_index_dir).map_err(olm_to_sk)?;
+        }
+
+        Ok(())
+    }
+}