feat: add pairwise distance computation and phylogenetic trees

This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.
This commit is contained in:
Eric Coissac
2026-05-21 11:47:35 +02:00
parent 9e1d6f2f25
commit 3fa1dbf8cc
13 changed files with 512 additions and 7 deletions
+7
View File
@@ -80,6 +80,13 @@ pub trait CountPartials: ColumnWeights {
let sq2 = std::f64::consts::SQRT_2;
self.partial_hellinger(&global).mapv(|v| v.sqrt() / sq2)
}
/// Euclidean distance in the Hellinger (√relative-frequency) space.
/// Equal to √2 × hellinger_dist — unnormalised variant.
fn hellinger_euclidean_dist_matrix(&self) -> Array2<f64> {
let global = self.col_weights();
self.partial_hellinger(&global).mapv(|v| v.sqrt())
}
}
/// Partial distance matrices for bit-based data (`PersistentBitMatrix`).