feat: enhance memory budgeting and add rebuild diagnostics

This commit improves memory management by respecting Linux cgroup v1/v2 limits and introduces a configurable memory budget for the new `rebuild` subcommand to prevent OOM during index reconstruction. The rebuild process now supports filtering, compaction, and parallelization. Diagnostic capabilities are expanded with debug-level tracing for partition merges, k-mer expansion tracking, and utility flags for label renaming, matrix size breakdowns, per-genome counts, and partition distribution reporting. Accessor methods for active and remaining memory have also been added to the stats struct.
2026-06-12 15:18:37 +02:00
parent 97e3fb9761
commit 52fd2cf801
3 changed files with 104 additions and 9 deletions
@@ -51,7 +51,13 @@ Non-ACGT characters act as hard breaks between k-mer segments in all formats.
               Runs scatter → dereplicate → count → layered MPHF.
               Resumes automatically if interrupted.
    merge      Merge multiple independently built indexes into one.
-    rebuild    Filter and compact an existing index: apply count thresholds,
+               Schedules partitions largest-first under a memory budget semaphore
+               to avoid OOM on machines with many cores. The worst partition runs
+               alone first to calibrate the expansion estimator; subsequent
+               partitions run in parallel within the budget.
+               --budget-fraction F  fraction of available RAM to use as budget
+                                    (default 0.5; reduce if OOM persists).
+    filter     Filter and compact an existing index: apply count thresholds,
               drop layers, rewrite as a single-layer index.
    reindex    Convert evidence in-place across all layers:
               exact (evidence.bin) ↔ approximate (fingerprint.bin).
@@ -74,7 +80,14 @@ Non-ACGT characters act as hard breaks between k-mer segments in all formats.
               Diagnostic / pipeline use.
    unitig     Dump the unitig sequences stored in a built index. Debug use.
    utils      Miscellaneous utilities.
-               --new-label NEW=OLD  renames a genome label in-place.
+               --new-label NEW=OLD      rename a genome label in-place.
+               --bits-per-kmer          print MPHF / evidence / matrix size breakdown.
+               --stats                  per-genome k-mer counts as CSV.
+               --partition-stats        partition size distribution across one or more
+                                        indexes (markdown report to stdout). Useful to
+                                        diagnose minimizer imbalance before a large merge.
+               --csv FILE               write per-(partition, source) raw data to FILE
+                                        (used with --partition-stats).

 ## Quick start