refactor: restructure k-mer partitioning pipeline for memory efficiency

Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
2026-05-17 15:34:44 +08:00
parent f36b095ce2
commit 4736a7b6de
10 changed files with 230 additions and 114 deletions
@@ -6,20 +6,20 @@ Kmer indexing per partition proceeds in two phases. The separation is necessary

 ### Phase 1 — provisional MPHF + kmer spectrum

-Implemented in `obikpartitionner::KmerPartition::count_kmer()`.
+Implemented in `obikpartitionner::KmerPartition::count_kmer()` → `count_partition()`.

-1. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
-2. **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
-3. **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
-4. **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
+1. **External sort**: read the dereplicated superkmer file; extract the raw `u64` canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: `sorted_unique.bin` — a flat array of f0 distinct sorted `u64` values. Exact kmer count f0 is known at this point.
+2. **Build provisional MPHF** (ptr_hash, same configuration as phase 2) over `sorted_unique.bin` using `new_from_par_iter`. Delete `sorted_unique.bin` immediately after. Persist to `mphf1.bin`.
+3. **Create `counts1.bin`**: `PersistentCompactIntVec` with f0 slots, zero-initialised.
+4. **Accumulation pass**: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute `slot = mphf.index(kmer.raw())` and increment `counts1[slot]` by the superkmer's COUNT.
 5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.

 Files produced per partition:

 ```
 part_XXXXX/
-  mphf1.bin               — GOFunction (provisional MPHF, discarded after phase 2)
-  counts1.bin             — [u32; n_kmers] kmer counts, mmap'd
+  mphf1.bin               — ptr_hash provisional MPHF (discarded after phase 2)
+  counts1.bin             — PersistentCompactIntVec, f0 × u32 kmer counts
  kmer_spectrum_raw.json  — local frequency spectrum
 ```

@@ -53,16 +53,13 @@ After filtering (applying a min-count threshold derived from the spectrum) and b
 **FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):

 - ~2.1 bits/key — most compact; good query speed; deterministic construction
- Works well from an exact or slightly overestimated count
- `GOFunction` (group-oriented variant) is the specific type used
+- `GOFunction` (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well

 ## MPHF choice per phase

-**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.
+**Both phases**: **ptr_hash**, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the `ph` crate dependency.

-**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.
-
-boomphf is eliminated: largest space overhead, streaming advantage does not apply.
+boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.

 ---

@@ -82,7 +82,13 @@ for each super-kmer (sequence, COUNT):
        kmer_counts[canonical(kmer)] += COUNT
 ```

-Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.
+Implemented as a three-step pipeline in `count_partition()`:
+
+1. **External sort** (`kmer_sort::sort_unique_kmers`): read dereplicated superkmers, extract canonical kmer raw `u64` values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → `sorted_unique.bin`. f0 is now known exactly.
+2. **Provisional MPHF** (ptr_hash): built from `sorted_unique.bin` via `new_from_par_iter(f0, ...)`. Stored to `mphf1.bin`; `sorted_unique.bin` deleted immediately.
+3. **Accumulation pass**: re-read dereplicated superkmers; for each kmer, `slot = mphf.index(kmer.raw())`, increment `counts1[slot]` by the superkmer COUNT. Stored in a `PersistentCompactIntVec` (`counts1.bin`).
+
+At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (`kmer_spectrum_raw.json`) is written.

 Abundance filter applied here: kmers with `total_count < q` are discarded. `q` is a collection parameter (0 = keep all, including singletons for ≤1x data).