refactor: restructure k-mer partitioning pipeline for memory efficiency

Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
This commit is contained in:
Eric Coissac
2026-05-17 15:34:44 +08:00
parent f36b095ce2
commit 4736a7b6de
10 changed files with 230 additions and 114 deletions
@@ -0,0 +1,17 @@
---
name: No architectural decisions without explicit authorization
description: Never make architectural or design decisions without explicit user approval — code decisions are the user's alone
type: feedback
---
Never make architectural decisions unilaterally. This includes:
- Memory layout or footprint changes
- Algorithm or data structure choices (HashSet vs streaming, etc.)
- Dependency additions or substitutions
- Structural refactors that go beyond the exact task requested
If a bug or inefficiency is observed, **report it and propose alternatives** — do not fix it without explicit authorization.
**Why:** The user optimizes for minimal memory footprint at all times. Introducing a HashSet in `count_kmer()` (replacing the intended streaming GOFunction construction from the sidecar estimate) caused a serious memory regression that went unreported. This is inadmissible on a project where memory efficiency is a core constraint.
**How to apply:** When editing code and noticing an architectural issue (even a clear improvement), stop, describe the problem and options, and wait for explicit go-ahead before touching anything.