feat: add select command for genome column projection and aggregation

Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.
This commit is contained in:
Eric Coissac
2026-06-09 15:05:08 +02:00
parent b0dab452f6
commit e66adef23d
11 changed files with 958 additions and 1 deletions
+234
View File
@@ -0,0 +1,234 @@
# `select` — column projection and aggregation
`select` transforms an index by operating on its **genome columns**: projecting a
subset of columns, aggregating groups of genomes into synthetic columns, or both.
It is the column-axis counterpart of `filter` (row-axis operations).
Following relational algebra conventions:
| Command | Relational operation | Axis |
|----------|---------------------|----------|
| `filter` | σ — selection | rows (k-mers) |
| `select` | π — projection | columns (genomes) |
The two commands compose naturally: run `filter` first to restrict the kmer set,
then `select` to reshape the genome columns.
`select` never changes the kmer set. The MPHF and `unitigs.bin` of each layer
are preserved unchanged; only the data matrices are rewritten.
---
## Synopsis
```sh
obikmer select <input-index>
{ --output <dir> | --in-place }
[--group <name>:<pred> ...]
[--group-op <name>:<op> ...]
[--aggregate-by <key> ]
[--aggregate-op <op> ]
[--select <col1,col2,...> ]
[--presence-threshold <N> ]
```
---
## Output destination
Exactly one of `--output` or `--in-place` must be specified.
**`--output <dir>`** — writes a new index to `<dir>`. The source index is
unchanged. The MPHF and unitig files are copied; only the data matrices are
rewritten with the new column layout.
**`--in-place`** — rewrites the data matrices of the source index directly.
Removed or replaced columns are lost. The operation writes to temporary files
first, then renames atomically, so an interrupted run leaves the index intact.
---
## Defining output columns
### Named groups — `--group`
```
--group <name>:<pred>
```
Defines a named group of genomes using the same predicate syntax as
`filter`/`rebuild`. Repeatable; a genome can belong to several groups.
```sh
--group "pub:species=Betula_pubescens"
--group "nan:species=Betula_nana"
```
### Per-group operator — `--group-op`
```
--group-op <name>:<op>
```
Assigns an aggregation operator to a named group. Optional; if absent, the
default operator applies (see below).
```sh
--group-op "pub:any"
--group-op "nan:all"
```
### Shorthand — `--aggregate-by` / `--aggregate-op`
`--aggregate-by <key>` automatically creates one group per unique value of the
metadata key `<key>`. Equivalent to one `--group <val>:<key>=<val>` per distinct
value. `--aggregate-op <op>` sets the operator for all auto-generated groups.
`--aggregate-by` and `--group` are mutually exclusive.
### Column selection and ordering — `--select`
```
--select col1,col2,...
```
Lists the output columns in order. Each element is either a group name (defined
by `--group` or generated by `--aggregate-by`) or a genome label from the source
index (pass-through, no aggregation).
**Default when `--select` is absent:**
all defined groups in declaration order (for `--group`), or all generated groups
in metadata-value order (for `--aggregate-by`). Individual genomes not in any
group are excluded unless named explicitly.
**When neither `--group` nor `--aggregate-by` is specified:**
`--select` can still reference genome labels for pure column projection (no
aggregation). If `--select` is also absent, all genomes are output unchanged
(identity transform — useful combined with row filtering via a prior `filter`
run).
---
## Aggregation operators
| Operator | Input | Output | Semantics |
|----------|-------------|----------|-----------|
| `any` | pres / count | presence | 1 if ≥ 1 genome in group carries the k-mer |
| `all` | pres / count | presence | 1 if every genome in group carries the k-mer |
| `none` | pres / count | presence | 1 if no genome in group carries the k-mer |
| `sum` | count | count | sum of counts across the group |
| `min` | count | count | minimum count |
| `max` | count | count | maximum count |
**Default operator:**
- Presence index: `any`
- Count index: `sum`
Logical operators (`any`/`all`/`none`) on a count index use
`--presence-threshold N` (default 0): a genome "carries" the k-mer if its count
is > N.
**Output index type:**
- If the source is a presence index, the output is always a presence index.
- If the source is a count index and every output column uses a logical operator
or is a pass-through from a presence source, the output is a presence index.
- Otherwise (at least one arithmetic operator on a count source), the output is
a count index.
---
## Behaviour for edge cases
| Situation | Behaviour |
|-----------|-----------|
| Genome missing the metadata key in `--aggregate-by` | genome ignored (no `NA` group) |
| Genome in multiple groups | contributes independently to each |
| `--group-op` references undefined group | error |
| `--select` element is neither group name nor genome label | error |
| `--output` and `--in-place` both specified | error |
| Neither `--output` nor `--in-place` | error |
| Group with zero matching genomes | column is all-zeros (or all-ones for `none`) |
---
## Examples
### Aggregate by metadata group, default operators
```sh
obikmer select myindex --output out --aggregate-by group
# one column per unique value of "group"; presence→any, count→sum
```
### Named groups with different operators
```sh
obikmer select myindex --output out \
--group "pub:species=Betula_pubescens" \
--group "nan:species=Betula_nana" \
--group-op "pub:any" \
--group-op "nan:all" \
--select "pub,nan"
```
### Mix aggregated group and individual genome
```sh
obikmer select myindex --output out \
--group "A:group=A" \
--select "A,Betula_nana--IGA-24-39"
```
### Pure column projection (no aggregation)
```sh
obikmer select myindex --output out \
--select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"
```
### In-place: keep only group A
```sh
obikmer select myindex --in-place --group "A:group=A" --select "A"
```
### Compose with filter
```sh
# Step 1: keep only B. nana-specific k-mers
obikmer filter myindex --output filtered \
--ingroup "species=Betula_nana" --outgroup "*"
# Step 2: aggregate genome columns by collection site
obikmer select filtered --output final --aggregate-by site
```
---
## Implementation notes
`select` does not rebuild the MPHF. The 256 partitions are processed in parallel
(rayon), each writing its output independently; results require no synchronisation
because every partition owns a distinct set of files.
For each layer in each partition:
1. The slot count `n` is read by opening the source data matrix.
2. A new data matrix is built with M columns (M = number of output columns).
3. For each slot `s` in `0..n`:
- `old_row = matrix.fill_row(s)` — reads the original `N`-column row without allocating.
- For each output column `j`:
- `new_row[j] = aggregate(op, old_row[group_indices])`.
- Pass-through columns are represented as single-element groups with the
default operator (`any` for presence, `sum` for count) — same code path.
- The new row is written slot by slot into each column builder.
4. All plain files in the source layer directory (`mphf.bin`, `unitigs.bin`,
evidence files, `layer_meta.json`) are copied verbatim; only the `presence/`
or `counts/` subdirectory is rewritten.
5. `index.meta` is rewritten with the new genome list and updated `with_counts`.
**`--in-place` write strategy:** new data is written to a temporary sibling
directory (`presence_new/` or `counts_new/`); on success the old directory is
removed and the temporary one is renamed into place. An interrupted run leaves
at most one stale `*_new/` directory; the original data is intact until the
rename step.
+2 -1
View File
@@ -9,12 +9,13 @@
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
| `merge` | Merge multiple built indexes into one |
| `rebuild` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system |
| `rebuild` / `filter` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system (`filter` is an alias for `rebuild`) |
| `query` | Query an index with sequences and annotate matches |
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
| `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
| `select` | Project and/or aggregate genome columns into a new or in-place index; the column-axis counterpart of `filter` (see [select](implementation/select.md)) |
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
+1
View File
@@ -50,6 +50,7 @@ nav:
- PersistentBitVec: implementation/persistent_bit_vec.md
- Merge command: implementation/merge.md
- Kmer filtering: implementation/filtering.md
- Select command: implementation/select.md
- Architecture:
- Sequences: architecture/sequences/invariant.md
- Kmer index: architecture/index_architecture.md
+1
View File
@@ -7,6 +7,7 @@ mod index;
mod merge;
mod rebuild;
mod reindex;
mod select;
mod stats;
pub use error::{OKIError, OKIResult};
+166
View File
@@ -0,0 +1,166 @@
use std::fs;
use std::io;
use std::path::Path;
use obikpartitionner::{KmerPartition, OutputCol, PARTITIONS_SUBDIR};
use obisys::{Stage, progress_bar};
use rayon::prelude::*;
use tracing::info;
use crate::error::{OKIError, OKIResult};
use crate::index::KmerIndex;
use crate::meta::{GenomeInfo, IndexMeta};
use crate::state::{IndexState, SENTINEL_INDEXED};
impl KmerIndex {
/// Create a new index at `output` by projecting/aggregating the genome columns
/// of `src` according to `specs`.
///
/// `output_presence` — if true, output uses bit matrices (0/1), regardless of
/// whether the source stores counts. The caller is responsible for ensuring all
/// specs use logical operators when `output_presence=true` on a count source.
pub fn select<P: AsRef<Path>>(
output: P,
src: &KmerIndex,
specs: &[OutputCol],
threshold: u32,
output_presence: bool,
force: bool,
) -> OKIResult<Self> {
let output = output.as_ref();
if src.state() != IndexState::Indexed {
return Err(OKIError::NotIndexed(src.root_path.clone()));
}
if output.exists() {
if force {
fs::remove_dir_all(output)?;
} else {
return Err(OKIError::Io(io::Error::new(
io::ErrorKind::AlreadyExists,
format!("{}: output directory already exists", output.display()),
)));
}
}
fs::create_dir_all(output)?;
let mut meta = IndexMeta::new(src.meta.config.clone());
meta.config.with_counts = !output_presence;
meta.genomes = specs.iter()
.map(|s| GenomeInfo::new(s.label.clone()))
.collect();
meta.write(output)?;
let n_src_genomes = src.meta.genomes.len();
let n_partitions = src.partition.n_partitions();
fs::create_dir_all(output.join(PARTITIONS_SUBDIR))?;
let dst_partition = KmerPartition::open_with_config(
output,
meta.config.kmer_size,
meta.config.minimizer_size,
meta.config.n_bits,
)?;
info!(
"select: {} partition(s), {} source genome(s) → {} output column(s)",
n_partitions, n_src_genomes, specs.len(),
);
let t = Stage::start("select");
let pb = progress_bar("select", n_partitions as u64, "partitions");
let src_partition = &src.partition;
let errors: Vec<obiskio::SKError> = (0..n_partitions)
.into_par_iter()
.filter_map(|i| {
let result = dst_partition.select_partition(
src_partition, i, specs,
n_src_genomes, threshold, output_presence,
false,
);
pb.inc(1);
result.err()
})
.collect();
pb.finish_and_clear();
if let Some(e) = errors.into_iter().next() {
return Err(OKIError::Partition(e));
}
let _ = t.stop();
fs::File::create(output.join(SENTINEL_INDEXED))?;
let idx = KmerIndex::open(output)?;
idx.pack_matrices()?;
Ok(idx)
}
/// Rewrite the genome columns of this index in-place according to `specs`.
///
/// The MPHF and unitig files are unchanged; only data matrices are rewritten.
pub fn select_in_place(
&mut self,
specs: &[OutputCol],
threshold: u32,
output_presence: bool,
) -> OKIResult<()> {
if self.state() != IndexState::Indexed {
return Err(OKIError::NotIndexed(self.root_path.clone()));
}
let n_src_genomes = self.meta.genomes.len();
let n_partitions = self.partition.n_partitions();
// Open a second handle to the same path so we can borrow src and dst simultaneously.
let src_partition = KmerPartition::open_with_config(
&self.root_path,
self.meta.config.kmer_size,
self.meta.config.minimizer_size,
self.meta.config.n_bits,
)?;
info!(
"select (in-place): {} partition(s), {} source genome(s) → {} output column(s)",
n_partitions, n_src_genomes, specs.len(),
);
let t = Stage::start("select");
let pb = progress_bar("select", n_partitions as u64, "partitions");
let errors: Vec<obiskio::SKError> = (0..n_partitions)
.into_par_iter()
.filter_map(|i| {
let result = self.partition.select_partition(
&src_partition, i, specs,
n_src_genomes, threshold, output_presence,
true,
);
pb.inc(1);
result.err()
})
.collect();
pb.finish_and_clear();
if let Some(e) = errors.into_iter().next() {
return Err(OKIError::Partition(e));
}
let _ = t.stop();
// Update index.meta with new genome list and with_counts flag.
self.meta.config.with_counts = !output_presence;
self.meta.genomes = specs.iter()
.map(|s| GenomeInfo::new(s.label.clone()))
.collect();
self.meta.write(&self.root_path)?;
self.pack_matrices()?;
Ok(())
}
}
+1
View File
@@ -1,6 +1,7 @@
pub mod annotate;
pub mod pack;
pub(crate) mod predicate;
pub mod select;
pub mod utils;
pub mod distance;
pub mod dump;
+10
View File
@@ -230,6 +230,16 @@ impl FilterArgs {
}
}
/// Returns indices of genomes matching `pred_str` (single predicate).
pub fn matching_genome_indices(pred_str: &str, genomes: &[GenomeInfo]) -> Result<Vec<usize>, String> {
let pred = MetaPred::parse(pred_str)?;
Ok(genomes.iter().enumerate()
.filter_map(|(i, g)| {
if pred.eval(&g.meta) == Some(true) { Some(i) } else { std::option::Option::None }
})
.collect())
}
pub struct GroupFilterParams {
pub threshold: u32,
pub min_count: Option<usize>,
+248
View File
@@ -0,0 +1,248 @@
use std::collections::{BTreeMap, HashMap};
use std::path::PathBuf;
use clap::{Args, ValueEnum};
use obikindex::{GenomeInfo, KmerIndex};
use obikpartitionner::{AggOp, OutputCol};
use tracing::info;
use super::predicate::matching_genome_indices;
// ── CLI types ─────────────────────────────────────────────────────────────────
#[derive(Debug, Clone, Copy, PartialEq, Eq, ValueEnum)]
pub enum AggOpArg {
Any,
All,
None,
Sum,
Min,
Max,
}
impl From<AggOpArg> for AggOp {
fn from(a: AggOpArg) -> Self {
match a {
AggOpArg::Any => AggOp::Any,
AggOpArg::All => AggOp::All,
AggOpArg::None => AggOp::None,
AggOpArg::Sum => AggOp::Sum,
AggOpArg::Min => AggOp::Min,
AggOpArg::Max => AggOp::Max,
}
}
}
#[derive(Args)]
pub struct SelectArgs {
/// Source index directory
pub source: PathBuf,
/// Output index directory (mutually exclusive with --in-place)
#[arg(long, conflicts_with = "in_place")]
pub output: Option<PathBuf>,
/// Rewrite the source index in-place (mutually exclusive with --output)
#[arg(long)]
pub in_place: bool,
/// Define a named group: `<name>:<pred>` (repeatable; mutually exclusive with --aggregate-by)
#[arg(long, value_name = "NAME:PRED", conflicts_with = "aggregate_by")]
pub group: Vec<String>,
/// Per-group aggregation operator: `<name>:<op>` (repeatable)
#[arg(long, value_name = "NAME:OP")]
pub group_op: Vec<String>,
/// Auto-create one group per unique value of metadata key <KEY>
#[arg(long, value_name = "KEY", conflicts_with = "group")]
pub aggregate_by: Option<String>,
/// Aggregation operator for all auto-generated groups
#[arg(long, value_name = "OP")]
pub aggregate_op: Option<AggOpArg>,
/// Output columns in order: group names or genome labels, comma-separated
#[arg(long, value_name = "COL,...", value_delimiter = ',')]
pub select: Option<Vec<String>>,
/// Minimum count to consider a genome as "carrying" the k-mer (logical ops only)
#[arg(long, default_value = "0")]
pub presence_threshold: u32,
/// Overwrite existing output directory
#[arg(short, long)]
pub force: bool,
}
// ── Helpers ───────────────────────────────────────────────────────────────────
fn parse_name_value(s: &str, flag: &str) -> (String, String) {
match s.find(':') {
Some(pos) => (s[..pos].trim().to_string(), s[pos + 1..].to_string()),
std::option::Option::None => {
eprintln!("error in {flag}: expected <name>:<value>, got: {s}");
std::process::exit(1);
}
}
}
fn parse_agg_op(s: &str) -> AggOp {
match s.to_lowercase().as_str() {
"any" => AggOp::Any,
"all" => AggOp::All,
"none" => AggOp::None,
"sum" => AggOp::Sum,
"min" => AggOp::Min,
"max" => AggOp::Max,
other => {
eprintln!("unknown aggregation operator: {other}; valid: any, all, none, sum, min, max");
std::process::exit(1);
}
}
}
fn default_op(src_is_count: bool) -> AggOp {
if src_is_count { AggOp::Sum } else { AggOp::Any }
}
// ── build_specs ───────────────────────────────────────────────────────────────
/// Resolve CLI arguments into an ordered list of `OutputCol`.
///
/// Returns `(specs, output_presence)`.
fn build_specs(
args: &SelectArgs,
genomes: &[GenomeInfo],
src_is_count: bool,
) -> (Vec<OutputCol>, bool) {
// ── 1. Build group_indices: name → Vec<usize> ────────────────────────────
// Also keep insertion order for the default `--select *` case.
let mut group_order: Vec<String> = Vec::new();
let mut group_indices: HashMap<String, Vec<usize>> = HashMap::new();
if let Some(ref key) = args.aggregate_by {
// One group per unique value of `key`, in sorted order.
let mut value_to_indices: BTreeMap<String, Vec<usize>> = BTreeMap::new();
for (i, g) in genomes.iter().enumerate() {
if let Some(v) = g.meta.get(key) {
value_to_indices.entry(v.clone()).or_default().push(i);
}
}
for (v, idxs) in value_to_indices {
group_order.push(v.clone());
group_indices.insert(v, idxs);
}
} else {
for raw in &args.group {
let (name, pred) = parse_name_value(raw, "--group");
let idxs = matching_genome_indices(&pred, genomes).unwrap_or_else(|e| {
eprintln!("error in --group {name}: {e}");
std::process::exit(1);
});
if !group_indices.contains_key(&name) {
group_order.push(name.clone());
}
group_indices.insert(name, idxs);
}
}
// ── 2. Build per-group ops ────────────────────────────────────────────────
let global_op = args.aggregate_op.map(AggOp::from);
let mut group_op: HashMap<String, AggOp> = HashMap::new();
for raw in &args.group_op {
let (name, op_str) = parse_name_value(raw, "--group-op");
if !group_indices.contains_key(&name) {
eprintln!("--group-op references undefined group: {name}");
std::process::exit(1);
}
group_op.insert(name, parse_agg_op(&op_str));
}
// ── 3. Genome label → index map for pass-through columns ─────────────────
let label_to_idx: HashMap<&str, usize> = genomes.iter().enumerate()
.map(|(i, g)| (g.label.as_str(), i))
.collect();
// ── 4. Determine output column names ─────────────────────────────────────
let col_names: Vec<String> = if let Some(ref sel) = args.select {
sel.clone()
} else if !group_order.is_empty() {
group_order.clone()
} else {
// Identity: all genomes in original order
genomes.iter().map(|g| g.label.clone()).collect()
};
// ── 5. Build OutputCol list ───────────────────────────────────────────────
let mut specs: Vec<OutputCol> = Vec::with_capacity(col_names.len());
for name in &col_names {
if let Some(idxs) = group_indices.get(name) {
let op = group_op.get(name)
.copied()
.or(global_op)
.unwrap_or_else(|| default_op(src_is_count));
specs.push(OutputCol { label: name.clone(), indices: idxs.clone(), op });
} else if let Some(&idx) = label_to_idx.get(name.as_str()) {
// Pass-through: single-element group with default op.
let op = default_op(src_is_count);
specs.push(OutputCol { label: name.clone(), indices: vec![idx], op });
} else {
eprintln!("--select: unknown column '{name}' (not a group name or genome label)");
std::process::exit(1);
}
}
if specs.is_empty() {
eprintln!("select: no output columns defined");
std::process::exit(1);
}
// ── 6. Determine output type ──────────────────────────────────────────────
let output_presence = !src_is_count
|| specs.iter().all(|s| s.op.is_logical());
(specs, output_presence)
}
// ── run ───────────────────────────────────────────────────────────────────────
pub fn run(args: SelectArgs) {
if !args.in_place && args.output.is_none() {
eprintln!("error: one of --output or --in-place must be specified");
std::process::exit(1);
}
let mut src = KmerIndex::open(&args.source).unwrap_or_else(|e| {
eprintln!("error opening source index: {e}");
std::process::exit(1);
});
let src_is_count = src.meta().config.with_counts;
let (specs, output_presence) = build_specs(&args, &src.meta().genomes.clone(), src_is_count);
info!(
"select: {} genome(s) → {} output column(s), output={}",
src.meta().genomes.len(),
specs.len(),
if output_presence { "presence" } else { "count" },
);
if args.in_place {
src.select_in_place(&specs, args.presence_threshold, output_presence)
.unwrap_or_else(|e| {
eprintln!("select error: {e}");
std::process::exit(1);
});
info!("selected in-place → {}", args.source.display());
} else {
let output = args.output.unwrap();
KmerIndex::select(&output, &src, &specs, args.presence_threshold, output_presence, args.force)
.unwrap_or_else(|e| {
eprintln!("select error: {e}");
std::process::exit(1);
});
info!("selected index → {}", output.display());
}
}
+6
View File
@@ -22,6 +22,10 @@ enum Commands {
Merge(cmd::merge::MergeArgs),
/// Filter and compact an existing index into a new single-layer index
Rebuild(cmd::rebuild::RebuildArgs),
/// Alias for rebuild
Filter(cmd::rebuild::RebuildArgs),
/// Project and/or aggregate genome columns into a new or in-place index
Select(cmd::select::SelectArgs),
/// Query an index with sequences and annotate matches
Query(cmd::query::QueryArgs),
/// Dump all indexed kmers as CSV (kmer + per-genome counts or presence)
@@ -66,6 +70,8 @@ fn main() {
Commands::Merge(args) => cmd::merge::run(args),
Commands::Dump(args) => cmd::dump::run(args),
Commands::Rebuild(args) => cmd::rebuild::run(args),
Commands::Filter(args) => cmd::rebuild::run(args),
Commands::Select(args) => cmd::select::run(args),
Commands::Query(args) => cmd::query::run(args),
Commands::Annotate(args) => cmd::annotate::run(args),
Commands::Distance(args) => cmd::distance::run(args),
+2
View File
@@ -7,7 +7,9 @@ mod merge_layer;
mod partition;
mod query_layer;
mod rebuild_layer;
mod select_layer;
pub use filter::{GroupQuorumFilter, KmerFilter, passes_all};
pub use merge_layer::MergeMode;
pub use partition::{KmerPartition, KmerSpectrum, PARTITIONS_SUBDIR};
pub use select_layer::{AggOp, OutputCol};
+287
View File
@@ -0,0 +1,287 @@
use std::fs;
use std::io;
use std::path::{Path, PathBuf};
use obicompactvec::{
PersistentBitMatrix, PersistentBitMatrixBuilder, PersistentBitVecBuilder,
PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder, PersistentCompactIntVecBuilder,
};
use obilayeredmap::meta::PartitionMeta;
use obilayeredmap::OLMError;
use obiskio::{SKError, SKResult};
use crate::partition::KmerPartition;
const INDEX_SUBDIR: &str = "index";
// ── AggOp ─────────────────────────────────────────────────────────────────────
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AggOp {
Any,
All,
None,
Sum,
Min,
Max,
}
impl AggOp {
pub fn is_logical(self) -> bool {
matches!(self, AggOp::Any | AggOp::All | AggOp::None)
}
}
// ── OutputCol ─────────────────────────────────────────────────────────────────
pub struct OutputCol {
pub label: String,
pub indices: Vec<usize>,
pub op: AggOp,
}
// ── Aggregation ───────────────────────────────────────────────────────────────
#[inline]
fn aggregate(op: AggOp, indices: &[usize], src_row: &[u32], threshold: u32) -> u32 {
match op {
AggOp::Any => {
if indices.iter().any(|&i| src_row[i] > threshold) { 1 } else { 0 }
}
AggOp::All => {
if indices.is_empty() { return 0; }
if indices.iter().all(|&i| src_row[i] > threshold) { 1 } else { 0 }
}
AggOp::None => {
if indices.iter().all(|&i| src_row[i] <= threshold) { 1 } else { 0 }
}
AggOp::Sum => {
indices.iter().map(|&i| src_row[i]).fold(0u32, |a, b| a.saturating_add(b))
}
AggOp::Min => indices.iter().map(|&i| src_row[i]).min().unwrap_or(0),
AggOp::Max => indices.iter().map(|&i| src_row[i]).max().unwrap_or(0),
}
}
// ── ColBuilder ────────────────────────────────────────────────────────────────
enum ColBuilder {
Bit(PersistentBitVecBuilder),
Int(PersistentCompactIntVecBuilder),
}
impl ColBuilder {
fn set_val(&mut self, slot: usize, value: u32) {
match self {
ColBuilder::Bit(b) => b.set(slot, value > 0),
ColBuilder::Int(b) => b.set(slot, value),
}
}
fn close(self) -> SKResult<()> {
match self {
ColBuilder::Bit(b) => b.close().map_err(SKError::Io),
ColBuilder::Int(b) => b.close().map_err(SKError::Io),
}
}
}
// ── Helpers ───────────────────────────────────────────────────────────────────
fn olm_to_sk(e: OLMError) -> SKError {
match e {
OLMError::Io(e) => SKError::Io(e),
other => SKError::InvalidData { context: "select", detail: other.to_string() },
}
}
fn col_path_bit(dir: &Path, col: usize) -> PathBuf {
dir.join(format!("col_{col:06}.pbiv"))
}
fn col_path_int(dir: &Path, col: usize) -> PathBuf {
dir.join(format!("col_{col:06}.pciv"))
}
fn write_matrix_meta(dir: &Path, n: usize, n_cols: usize) -> io::Result<()> {
fs::write(
dir.join("meta.json"),
format!("{{\"n\":{n},\"n_cols\":{n_cols}}}\n"),
)
}
/// Copy all plain files (not subdirectories) from `src_dir` to `dst_dir`.
fn copy_layer_files(src_dir: &Path, dst_dir: &Path) -> io::Result<()> {
for entry in fs::read_dir(src_dir)? {
let entry = entry?;
let path = entry.path();
if path.is_file() {
fs::copy(&path, dst_dir.join(entry.file_name()))?;
}
}
Ok(())
}
// ── fill_builders ─────────────────────────────────────────────────────────────
fn fill_builders(
builders: &mut [ColBuilder],
specs: &[OutputCol],
n: usize,
n_src: usize,
src_layer_dir: &Path,
src_is_count: bool,
threshold: u32,
) -> SKResult<()> {
let mut src_buf = vec![0u32; n_src];
if src_is_count {
let mat = PersistentCompactIntMatrix::open(src_layer_dir).map_err(SKError::Io)?;
for slot in 0..n {
mat.fill_row(slot, &mut src_buf);
for (col, spec) in specs.iter().enumerate() {
builders[col].set_val(slot, aggregate(spec.op, &spec.indices, &src_buf, threshold));
}
}
} else {
let mat = PersistentBitMatrix::open(src_layer_dir).map_err(SKError::Io)?;
for slot in 0..n {
mat.fill_row(slot, &mut src_buf);
for (col, spec) in specs.iter().enumerate() {
builders[col].set_val(slot, aggregate(spec.op, &spec.indices, &src_buf, threshold));
}
}
}
Ok(())
}
// ── KmerPartition::select_partition ──────────────────────────────────────────
impl KmerPartition {
/// Rewrite the data matrices of partition `i` in `src` into `self`.
///
/// `specs` defines the output columns (projection/aggregation).
/// `output_presence` — if true, all output builders use bit (0/1) format.
/// `in_place` — `self` and `src` share the same root; write to temp dirs then swap.
pub fn select_partition(
&self,
src: &KmerPartition,
i: usize,
specs: &[OutputCol],
n_src_genomes: usize,
threshold: u32,
output_presence: bool,
in_place: bool,
) -> SKResult<()> {
let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR);
if !src_index_dir.exists() {
return Ok(());
}
let src_meta = PartitionMeta::load(&src_index_dir).map_err(olm_to_sk)?;
if src_meta.n_layers == 0 {
return Ok(());
}
let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR);
if !in_place {
fs::create_dir_all(&dst_index_dir)?;
}
let n_out = specs.len();
let data_subdir = if output_presence { "presence" } else { "counts" };
for l in 0..src_meta.n_layers {
let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
if !src_layer_dir.exists() { continue; }
let dst_layer_dir = dst_index_dir.join(format!("layer_{l}"));
let counts_dir = src_layer_dir.join("counts");
let presence_dir = src_layer_dir.join("presence");
let src_is_count = counts_dir.exists() && !presence_dir.exists();
// Determine number of slots from the source matrix.
let n = if counts_dir.exists() {
PersistentCompactIntMatrix::open(&src_layer_dir).map_err(SKError::Io)?.n()
} else if presence_dir.exists() {
PersistentBitMatrix::open(&src_layer_dir).map_err(SKError::Io)?.n()
} else {
// Implicit single-genome layer: no data matrix needed in output either.
if !in_place {
fs::create_dir_all(&dst_layer_dir)?;
copy_layer_files(&src_layer_dir, &dst_layer_dir)?;
}
continue;
};
// Choose the output data directory (temp name for in-place).
let (dst_data_dir, final_data_dir) = if in_place {
let tmp = dst_layer_dir.join(format!("{data_subdir}_new"));
let perm = dst_layer_dir.join(data_subdir);
(tmp, perm)
} else {
let perm = dst_layer_dir.join(data_subdir);
(perm.clone(), perm)
};
if !in_place {
fs::create_dir_all(&dst_layer_dir)?;
copy_layer_files(&src_layer_dir, &dst_layer_dir)?;
}
fs::create_dir_all(&dst_data_dir)?;
// Initialise packed-format skeleton.
if output_presence {
PersistentBitMatrixBuilder::new(n, &dst_data_dir)
.map_err(SKError::Io)?.close().map_err(SKError::Io)?;
} else {
PersistentCompactIntMatrixBuilder::new(n, &dst_data_dir)
.map_err(SKError::Io)?.close().map_err(SKError::Io)?;
}
// Create column builders.
let mut builders: Vec<ColBuilder> = (0..n_out)
.map(|col| -> SKResult<ColBuilder> {
if output_presence {
Ok(ColBuilder::Bit(PersistentBitVecBuilder::new(
n, &col_path_bit(&dst_data_dir, col),
)?))
} else {
Ok(ColBuilder::Int(PersistentCompactIntVecBuilder::new(
n, &col_path_int(&dst_data_dir, col),
)?))
}
})
.collect::<SKResult<_>>()?;
fill_builders(
&mut builders, specs, n, n_src_genomes,
&src_layer_dir, src_is_count, threshold,
)?;
for b in builders { b.close()?; }
write_matrix_meta(&dst_data_dir, n, n_out).map_err(SKError::Io)?;
// In-place: swap old data dir for new.
if in_place {
let old_data_dir = if src_is_count {
dst_layer_dir.join("counts")
} else {
dst_layer_dir.join("presence")
};
if old_data_dir.exists() {
fs::remove_dir_all(&old_data_dir)?;
}
fs::rename(&dst_data_dir, &final_data_dir)?;
}
}
if !in_place {
PartitionMeta::load(&src_index_dir).map_err(olm_to_sk)?
.save(&dst_index_dir).map_err(olm_to_sk)?;
}
Ok(())
}
}