Refactor: Extract utility function for string reversal

Extracted `inverser_chaine` into a reusable utility function with docstring and added unit test to ensure correctness.
2026-04-27 20:23:44 +02:00
parent e7fa60a3a2
commit ebbfe35cbc
22 changed files with 1807 additions and 62 deletions
@@ -1,6 +1,29 @@
-# MPHF selection — analysis in progress
+# MPHF selection — two-phase indexing architecture

-The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.
+## Indexing architecture
+
+Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.
+
+### Superkmer vs kmer counts
+
+The `SKFileMeta` sidecar written by `SKFileWriter` records `instances` (unique superkmers) and `length_sum` (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as `length_sum − instances × (k − 1)`. This is an **overestimate** of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.
+
+Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.
+
+### Phase 1 — provisional index and spectrum
+
+1. Enumerate all kmers from the dereplicated superkmers of the partition.
+2. Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).
+3. Accumulate counts: for each kmer in each superkmer, `count[MPHF(kmer)] += sk.count()`.
+4. Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).
+5. Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.
+6. Discard the provisional MPHF.
+
+### Phase 2 — definitive index
+
+Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).
+
+---

 ## Candidates

@@ -8,31 +31,41 @@ The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Thre

 - ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
 - Parallel construction; well-tested with DNA kmer data at scale
- Drawback: largest space footprint of the three
+- Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2

 **ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):

 - ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
- Theoretical foundation solid; paper and Rust crate from the same author
+- Requires exact key count at construction — available at phase 2
 - Drawback: published February 2025 — very young, no production track record

 **FMPHGO** (`ph` crate, Beling, ACM JEA 2023):

 - ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
 - More established than ptr_hash; actively maintained
- Currently preferred candidate
+- Works well with overestimated capacity → natural fit for phase 1
+
+## MPHF choice per phase
+
+**Phase 1** (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.
+
+**Phase 2** (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.
+
+boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.
+
+---

 ## Space at scale

-For 1 024 partitions × 100 M kmers/partition:
+For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):

-| MPHF    | bits/key | Total MPHF size |
-|---------|----------|-----------------|
-| boomphf | 3.7      | ~47 GB          |
-| ptr_hash | 2.4     | ~31 GB          |
-| FMPHGO  | 2.1      | ~27 GB          |
+| MPHF     | bits/key | Total MPHF size |
+|----------|----------|-----------------|
+| boomphf  | 3.7      | ~47 GB          |
+| ptr_hash | 2.4      | ~31 GB          |
+| FMPHGO   | 2.1      | ~27 GB          |

-In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 3–30 M kmers → 1–8 MB per MPHF, well within RAM.
+For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.

 ## On-disk and mmap considerations

@@ -42,7 +75,7 @@ No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse a

 ## Open questions

- Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.
- Evaluate whether ptr_hash's query speed advantage (2.1–3.3×) justifies adopting a crate that is less than a year old.
- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.
+- Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
+- Revisit ptr_hash for phase 2 once the crate has broader production track record.
+- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.
 - Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.