Commit Graph

14 Commits

Author SHA1 Message Date
Eric Coissac 3fa1dbf8cc feat: add pairwise distance computation and phylogenetic trees
This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.
2026-05-21 15:03:08 +02:00
Eric Coissac e1d59fde54 feat: add merge command to consolidate k-mer indexes
Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
2026-05-21 13:44:50 +02:00
Eric Coissac 13e69e23c9 feat: introduce trait-based distance aggregation and layered store
Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.
2026-05-16 14:41:49 +08:00
Eric Coissac 45d49ed501 docs: document k-mer index architecture and refactor distance metrics
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
2026-05-15 21:24:30 +08:00
Eric Coissac 8409c852ef feat: Add parallel column counts and partial distance metrics
Introduces parallel `count_ones` for `BitMatrix` and parallel column-sum aggregation alongside three pairwise distance constructors (Bray-Curtis, Euclidean, Hellinger) for `IntMatrix`. These methods support partial, layer-wise data by accepting precomputed global column sums for normalization, enabling additive decomposition across partitions. Includes unit tests verifying mathematical equivalence and partition additivity.
2026-05-15 20:44:58 +08:00
Eric Coissac 8bee9f3017 feat: add parallel distance matrix computation for bit and int matrices
Introduce parallel distance matrix generation using `ndarray` and `rayon` for both `BitMatrix` and `IntMatrix`. Adds full and additive-partial variants for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics. Includes comprehensive unit tests verifying matrix symmetry, zero diagonals, and numerical correctness against pairwise calculations.
2026-05-15 17:23:12 +08:00
Eric Coissac 1881e98bad feat(bitvec): add partial Jaccard, fix padding, optimize constructor
Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
2026-05-15 15:04:10 +08:00
Eric Coissac b218bf012b feat: introduce column-major matrix storage and migrate layered map
Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
2026-05-14 21:19:18 +08:00
Eric Coissac 0b3fcf3cf0 feat: add PersistentBitVec and upgrade PersistentCompactIntVec format
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
2026-05-14 09:01:36 +08:00
Eric Coissac c18c5d2600 feat: add sum and sumadd methods to PersistentCompactIntVec
Adds a `sum()` method to compute the aggregate of all elements using u64 arithmetic to safely handle potential overflows. Introduces a `sumadd()` method for element-wise addition into the builder, enforcing strict length equality and using `checked_add` for safe accumulation. Includes unit tests to verify correct aggregation and overflow safety.
2026-05-13 21:49:22 +08:00
Eric Coissac 0733287de5 feat(obicompactvec): migrate to memory-mapped file storage
Refactor `PersistentCompactIntVecBuilder` to replace in-memory `Vec<u8>` with `MmapMut` for primary storage. Decouples initialization from finalization by updating `finalize_pciv` to truncate, append overflow data, and overwrite the header placeholder. Adds file path tracking via a new `path()` accessor, implements `build_from` for efficient copying, and introduces arithmetic/set operations (`min`, `max`, `sum`, `diff`). Expands test coverage for persistence roundtrips, mutations, iteration, and the new vector operations.
2026-05-13 10:56:36 +08:00
Eric Coissac dfce956162 feat(obicompactvec): implement Iterator for PersistentCompactIntVec
Add an `Iter` struct that implements `Iterator` and `ExactSizeIterator` to enable idiomatic traversal of `&PersistentCompactIntVec`. The iterator maintains `slot` and `overflow_pos` state to correctly yield `u32` values from both primary and overflow memory regions. Includes three unit tests validating iteration correctness against direct indexing, accurate `len()` tracking, and proper reference-based iteration.
2026-05-13 10:28:17 +08:00
Eric Coissac 4d5fcd4340 refactor: split obicompactvec storage into primary and overflow files
Refactors the storage format to separate primary and overflow data into distinct files. Introduces a cache-friendly sparse index with dynamically computed step and entry counts. Consolidates dual memory-mapped regions into a single file with explicit header parsing and validation, replacing unsafe slice casting with direct byte-offset indexing. Updates the test suite to accommodate the new file structure.
2026-05-13 10:25:14 +08:00
Eric Coissac f2de79acde Add persistent compact integer vector and cache-line-optimized MPHF
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
2026-05-13 10:09:46 +08:00