refactor: restructure k-mer partitioning pipeline for memory efficiency

Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
2026-05-17 15:34:44 +08:00
parent f36b095ce2
commit 4736a7b6de
10 changed files with 230 additions and 114 deletions
@@ -6,20 +6,20 @@ Kmer indexing per partition proceeds in two phases. The separation is necessary

 ### Phase 1 — provisional MPHF + kmer spectrum

-Implemented in `obikpartitionner::KmerPartition::count_kmer()`.
+Implemented in `obikpartitionner::KmerPartition::count_kmer()` → `count_partition()`.

-1. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
-2. **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
-3. **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
-4. **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
+1. **External sort**: read the dereplicated superkmer file; extract the raw `u64` canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: `sorted_unique.bin` — a flat array of f0 distinct sorted `u64` values. Exact kmer count f0 is known at this point.
+2. **Build provisional MPHF** (ptr_hash, same configuration as phase 2) over `sorted_unique.bin` using `new_from_par_iter`. Delete `sorted_unique.bin` immediately after. Persist to `mphf1.bin`.
+3. **Create `counts1.bin`**: `PersistentCompactIntVec` with f0 slots, zero-initialised.
+4. **Accumulation pass**: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute `slot = mphf.index(kmer.raw())` and increment `counts1[slot]` by the superkmer's COUNT.
 5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.

 Files produced per partition:

 ```
 part_XXXXX/
-  mphf1.bin               — GOFunction (provisional MPHF, discarded after phase 2)
-  counts1.bin             — [u32; n_kmers] kmer counts, mmap'd
+  mphf1.bin               — ptr_hash provisional MPHF (discarded after phase 2)
+  counts1.bin             — PersistentCompactIntVec, f0 × u32 kmer counts
  kmer_spectrum_raw.json  — local frequency spectrum
 ```

@@ -53,16 +53,13 @@ After filtering (applying a min-count threshold derived from the spectrum) and b
 **FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):

 - ~2.1 bits/key — most compact; good query speed; deterministic construction
- Works well from an exact or slightly overestimated count
- `GOFunction` (group-oriented variant) is the specific type used
+- `GOFunction` (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well

 ## MPHF choice per phase

-**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.
+**Both phases**: **ptr_hash**, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the `ph` crate dependency.

-**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.
-
-boomphf is eliminated: largest space overhead, streaming advantage does not apply.
+boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.

 ---