refactor: restructure k-mer partitioning pipeline for memory efficiency
Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
This commit is contained in:
@@ -6,20 +6,20 @@ Kmer indexing per partition proceeds in two phases. The separation is necessary
|
||||
|
||||
### Phase 1 — provisional MPHF + kmer spectrum
|
||||
|
||||
Implemented in `obikpartitionner::KmerPartition::count_kmer()`.
|
||||
Implemented in `obikpartitionner::KmerPartition::count_kmer()` → `count_partition()`.
|
||||
|
||||
1. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
|
||||
2. **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
|
||||
3. **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
|
||||
4. **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
|
||||
1. **External sort**: read the dereplicated superkmer file; extract the raw `u64` canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: `sorted_unique.bin` — a flat array of f0 distinct sorted `u64` values. Exact kmer count f0 is known at this point.
|
||||
2. **Build provisional MPHF** (ptr_hash, same configuration as phase 2) over `sorted_unique.bin` using `new_from_par_iter`. Delete `sorted_unique.bin` immediately after. Persist to `mphf1.bin`.
|
||||
3. **Create `counts1.bin`**: `PersistentCompactIntVec` with f0 slots, zero-initialised.
|
||||
4. **Accumulation pass**: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute `slot = mphf.index(kmer.raw())` and increment `counts1[slot]` by the superkmer's COUNT.
|
||||
5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.
|
||||
|
||||
Files produced per partition:
|
||||
|
||||
```
|
||||
part_XXXXX/
|
||||
mphf1.bin — GOFunction (provisional MPHF, discarded after phase 2)
|
||||
counts1.bin — [u32; n_kmers] kmer counts, mmap'd
|
||||
mphf1.bin — ptr_hash provisional MPHF (discarded after phase 2)
|
||||
counts1.bin — PersistentCompactIntVec, f0 × u32 kmer counts
|
||||
kmer_spectrum_raw.json — local frequency spectrum
|
||||
```
|
||||
|
||||
@@ -53,16 +53,13 @@ After filtering (applying a min-count threshold derived from the spectrum) and b
|
||||
**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
|
||||
|
||||
- ~2.1 bits/key — most compact; good query speed; deterministic construction
|
||||
- Works well from an exact or slightly overestimated count
|
||||
- `GOFunction` (group-oriented variant) is the specific type used
|
||||
- `GOFunction` (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well
|
||||
|
||||
## MPHF choice per phase
|
||||
|
||||
**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.
|
||||
**Both phases**: **ptr_hash**, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the `ph` crate dependency.
|
||||
|
||||
**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.
|
||||
|
||||
boomphf is eliminated: largest space overhead, streaming advantage does not apply.
|
||||
boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -82,7 +82,13 @@ for each super-kmer (sequence, COUNT):
|
||||
kmer_counts[canonical(kmer)] += COUNT
|
||||
```
|
||||
|
||||
Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.
|
||||
Implemented as a three-step pipeline in `count_partition()`:
|
||||
|
||||
1. **External sort** (`kmer_sort::sort_unique_kmers`): read dereplicated superkmers, extract canonical kmer raw `u64` values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → `sorted_unique.bin`. f0 is now known exactly.
|
||||
2. **Provisional MPHF** (ptr_hash): built from `sorted_unique.bin` via `new_from_par_iter(f0, ...)`. Stored to `mphf1.bin`; `sorted_unique.bin` deleted immediately.
|
||||
3. **Accumulation pass**: re-read dereplicated superkmers; for each kmer, `slot = mphf.index(kmer.raw())`, increment `counts1[slot]` by the superkmer COUNT. Stored in a `PersistentCompactIntVec` (`counts1.bin`).
|
||||
|
||||
At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (`kmer_spectrum_raw.json`) is written.
|
||||
|
||||
Abundance filter applied here: kmers with `total_count < q` are discarded. `q` is a collection parameter (0 = keep all, including singletons for ≤1x data).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user