refactor: restructure k-mer partitioning pipeline for memory efficiency

Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
This commit is contained in:
Eric Coissac
2026-05-17 15:34:44 +08:00
parent f36b095ce2
commit 4736a7b6de
10 changed files with 230 additions and 114 deletions
+10 -13
View File
@@ -6,20 +6,20 @@ Kmer indexing per partition proceeds in two phases. The separation is necessary
### Phase 1 — provisional MPHF + kmer spectrum
Implemented in `obikpartitionner::KmerPartition::count_kmer()`.
Implemented in `obikpartitionner::KmerPartition::count_kmer()``count_partition()`.
1. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
2. **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
3. **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
4. **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
1. **External sort**: read the dereplicated superkmer file; extract the raw `u64` canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: `sorted_unique.bin` — a flat array of f0 distinct sorted `u64` values. Exact kmer count f0 is known at this point.
2. **Build provisional MPHF** (ptr_hash, same configuration as phase 2) over `sorted_unique.bin` using `new_from_par_iter`. Delete `sorted_unique.bin` immediately after. Persist to `mphf1.bin`.
3. **Create `counts1.bin`**: `PersistentCompactIntVec` with f0 slots, zero-initialised.
4. **Accumulation pass**: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute `slot = mphf.index(kmer.raw())` and increment `counts1[slot]` by the superkmer's COUNT.
5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.
Files produced per partition:
```
part_XXXXX/
mphf1.bin — GOFunction (provisional MPHF, discarded after phase 2)
counts1.bin — [u32; n_kmers] kmer counts, mmap'd
mphf1.bin — ptr_hash provisional MPHF (discarded after phase 2)
counts1.bin — PersistentCompactIntVec, f0 × u32 kmer counts
kmer_spectrum_raw.json — local frequency spectrum
```
@@ -53,16 +53,13 @@ After filtering (applying a min-count threshold derived from the spectrum) and b
**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
- ~2.1 bits/key — most compact; good query speed; deterministic construction
- Works well from an exact or slightly overestimated count
- `GOFunction` (group-oriented variant) is the specific type used
- `GOFunction` (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well
## MPHF choice per phase
**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.
**Both phases**: **ptr_hash**, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the `ph` crate dependency.
**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.
boomphf is eliminated: largest space overhead, streaming advantage does not apply.
boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.
---