feat(bitvec): add partial Jaccard, fix padding, optimize constructor

Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
This commit is contained in:
Eric Coissac
2026-05-14 21:28:25 +08:00
parent b218bf012b
commit 1881e98bad
4 changed files with 290 additions and 39 deletions
+6
View File
@@ -246,6 +246,12 @@ Each partition's new layer is built independently; the operation is fully parall
---
## Relationship to target architecture
The target architecture (see [Kmer index architecture](../architecture/index_architecture.md)) separates `MphfLayer` from data stores entirely and introduces a `PartitionedIndex` with parallel dispatch and an `Aggregator` pattern. The current implementation is a stepping stone: `obicompactvec` types are already fully decoupled from the MPHF; the remaining refactoring is within `obilayeredmap` itself.
---
## Open questions
- **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.