Merge pull request 'Push ylnwstyzqwrt' (#22) from push-ylnwstyzqwrt into main

Reviewed-on: #22
2026-06-12 10:10:03 +00:00
parent 94e0a370b3 b5e027f23b
commit 97e3fb9761
11 changed files with 865 additions and 81 deletions
@@ -0,0 +1,207 @@
 # Merge parallelism and memory pressure
 ## Problem observed
 Running `obikmer merge` over 109 indexes (108 sources + 1 bootstrap) on a 192-core machine
 produces a fatal OOM during the `merge_partitions` stage:
 ```
 memory allocation of 9126805520 bytes failed
 ```
 A single allocation of ~8.5 GB fails. This is not an aggregate; it is one `malloc` call
 from hashbrown during a HashMap resize.
 ---
 ## Root cause
 ### The merge pipeline per partition
 ```
 source unitigs.bin
  → iter_indexed_canonical_kmers()
  → GraphDeBruijn::push()       ← HashSet<u64> + 1 byte flags, all in RAM
  → compute_degrees_and_mark_starts()
  → try_for_each_unitig()
  → unitigs.bin (new layer)
  → Layer::build() → MPHF + evidence
 ```
 `GraphDeBruijn` is a `FastHashMap<CanonicalKmer, AtomicU8>` — a `HashSet<u64>` with
 one flag byte per node. Neighbor lookup is implicit: 4 probes into the same map.
 No edges are stored. The full kmer set of one partition must reside in RAM
 simultaneously to compute degrees and mark unitig starts.
 The matrix builders that follow (pass 2) are mmapped files — they do **not** consume
 significant RAM. The pressure is entirely in pass 1.
 ### Unbounded Rayon parallelism
 With 192 cores, Rayon ran up to 192 partitions concurrently. Each partition built its
 own `GraphDeBruijn` accumulating all kmers absent from the destination. Peak memory =
 192 × peak_partition_hashset.
 ### The 8.5 GB single allocation
 hashbrown allocates the entire backing array in one call when rehashing.
 At load factor 7/8: `capacity × (sizeof(K,V) + 1 control byte)`.
 For `(u64, AtomicU8)` with alignment: ~16 bytes per slot.
 ```
 9 127 MB / 16 bytes ≈ 570 M slots → ~380 M new kmers in one partition
 ```
 Plausible for the largest partition of 108 Salix/Betula sources (~450 Mbp each).
 ---
 ## Partition size distribution
 `obikmer utils --partition-stats` measures the sum of `unitigs.bin` file sizes
 per partition across all source indexes (pure `stat()` syscalls, negligible cost).
 Observed on a 9-genome pilot (256 partitions):
 | Stat | Value |
 |---|---|
 | min    | 30.5 MB |
 | max    | 232.1 MB |
 | mean   | 40.1 MB |
 | median | 37.2 MB |
 | p95    | 47.1 MB |
 | max/median ratio | 6.23× |
 The distribution is **bimodal with a heavy tail**:
 - 238/256 partitions in a narrow 30–50 MB band
 - 4 structurally extreme partitions (3–6× the median): 221, 233, 135, 191
 These correspond to minimizers over-represented in repetitive regions shared across
 all sources. They are extreme in every run on this dataset.
 With 109 sources, outlier partitions do not scale linearly: only kmers **absent from
 the destination** enter the GraphDeBruijn, and inter-source overlap is high for closely
 related species. Partition 221 is the likely trigger for the 8.5 GB crash.
 ---
 ## Solution: LFD scheduling + memory budget semaphore
 ### Principle
 Pre-sort partitions by **decreasing estimated size** (First Fit Decreasing — FFD),
 then schedule them through a **continuous memory budget semaphore**. Each worker
 acquires an estimated cost before starting and releases it on completion.
 Large partitions run first when the full budget is available; small partitions fill
 the gaps. No hard outlier threshold is needed.
 ### `MemoryBudget` (`obisys`)
 ```rust
 pub struct MemoryBudget { … }
 impl MemoryBudget {
    pub fn new(total: u64) -> Self;
    pub fn acquire(&self, cost: u64);  // blocks until budget available
    pub fn release(&self, cost: u64);
    pub fn peak_active(&self) -> usize;
 }
 ```
 Non-deadlock guarantee: when `active == 0`, acquire always succeeds regardless of cost.
 Without this, a partition whose estimated cost exceeds the total budget would block forever.
 ### Adaptive expansion factor
 The expansion factor converts raw `unitigs.bin` bytes into an estimated GraphDeBruijn
 RAM footprint. hashbrown stores each kmer as `(u64, AtomicU8)` ≈ 16 bytes/kmer at 7/8
 load factor; unitig files encode ≈ 2 bits/base. The ratio depends on average unitig
 length (short unitigs: ~2×; long unitigs: up to ~50×).
 **Phase 1 — sequential pilot (worst partition)**
 The largest partition runs alone first. Its actual `g.len()` seeds the expansion factor
 before any parallel job starts. `FALLBACK_EXPANSION = 4×` is used only for empty partitions.
 ```rust
 let worst_g_len = dst_partition.merge_partition(worst_id, …)?;
 //                              ↑ now returns SKResult<usize> (was SKResult<()>)
 let seed_expansion = worst_g_len as u64 * 16 * 1000 / worst_bytes;
 let max_expansion = AtomicU64::new(seed_expansion);
 ```
 **Phase 2 — parallel with adaptive updates**
 ```rust
 order[1..].into_par_iter().for_each(|&i| {
    let cost = partition_sizes[i] * max_expansion.load(Relaxed) / 1000;
    budget.acquire(cost);
    let g_len = dst_partition.merge_partition(i, …)?;
    budget.release(cost);          // releases estimated cost, not actual
    let actual = g_len as u64 * 16 * 1000 / partition_sizes[i];
    max_expansion.fetch_max(actual, Relaxed);  // always pessimistic (max)
 });
 ```
 `budget.release(cost)` uses the estimated cost, not the actual one. The budget tracks
 reservations, not physical RAM; each partition pays what it promised at acquisition.
 **On the safety margin**
 There is no separate multiplier `k`. It is redundant with `budget_fraction`: both
 reduce effective concurrency by the same amount. A single parameter is easier to
 calibrate. `budget_fraction = 0.5` (default) reserves half of available RAM for the
 OS, MPHF build, pass 2, and estimation error.
 `--budget-fraction` is exposed as a CLI flag — the only escape hatch for pathological
 cases (extreme repetitive content, unusually long unitigs) that still cause OOM.
 ### RAM source
 `obisys::available_memory_bytes()` — wraps `sysinfo::System::available_memory()`,
 falls back to `total / 2` on macOS when the memory compressor returns 0.
 ---
 ## Diagnostic report
 After the parallel phase, `merge_partition` emits a structured report via `tracing::info!`:
 ```
 ─── merge_partitions memory report ───
  available RAM : 512.0 GB   budget 50% = 256.0 GB
  expansion factor — seed: 4.2×  final max: 6.1×  (mean: 1.8×  median: 1.6×)
  peak concurrent workers: 42
  expansion factor distribution (256 partitions with data):
     0.50× –  1.25× │██████████████████████████████ 148
     1.25× –  2.00× │████████████████████████        82
     …
     5.50× –  6.25× │█                                2
  top partitions by actual expansion factor:
    partition  221 : 6.10×  (232.1 MB unitigs → 48M kmers, reserved at 4.20×)
    partition  135 : 5.82×  (127.3 MB unitigs → 24M kmers, reserved at 4.20×)
    …
 ──────────────────────────────────────
 ```
 Fields useful for diagnosis:
 | Field | Interpretation |
 |---|---|
 | `seed` vs `final max` expansion | gap indicates partitions with higher expansion than the worst-by-size |
 | `reserved at X×` | the factor used at acquisition; if much lower than actual, the budget was under-reserved for that partition |
 | `peak concurrent workers` | effective parallelism achieved under the budget constraint |
 | `mean` / `median` expansion | typical dataset characteristic; stable across runs on the same data |
 ---
 ## Parameters
 | Parameter | Default | CLI flag | Notes |
 |---|---|---|---|
 | `fallback_expansion` | 4× | — | seed for empty partitions only |
 | `budget_fraction` | 0.5 | `--budget-fraction` | reduce if OOM persists |
 | RAM source | `obisys::available_memory_bytes()` | — | falls back to `total/2` on macOS |
@@ -49,6 +49,7 @@ nav:
      - PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
      - PersistentBitVec: implementation/persistent_bit_vec.md
      - Merge command: implementation/merge.md
      - Merge parallelism & memory: implementation/merge_parallelism.md
      - Kmer filtering: implementation/filtering.md
      - Select command: implementation/select.md
  - Architecture:
@@ -54,6 +54,14 @@ impl ColumnarCompactIntMatrix {
        Array1::from_vec(sums)
    }
    pub(crate) fn count_nonzero(&self) -> Array1<u64> {
        let counts: Vec<u64> = (0..self.n_cols())
            .into_par_iter()
            .map(|c| self.col(c).count_nonzero())
            .collect();
        Array1::from_vec(counts)
    }
    pub(crate) fn partial_bray_dist_matrix(&self) -> Array2<u64> {
        self.pairwise_u64(|i, j| self.col(i).partial_bray_dist(self.col(j)))
    }
@@ -234,6 +242,14 @@ impl PackedCompactIntMatrix {
        )
    }
    pub(crate) fn count_nonzero(&self) -> Array1<u64> {
        Array1::from_vec(
            (0..self.n_cols).into_par_iter()
                .map(|c| (0..self.n_rows).filter(|&s| self.get(c, s) > 0).count() as u64)
                .collect()
        )
    }
    // ── Pair primitives ───────────────────────────────────────────────────────
    fn pair_partial_bray(&self, i: usize, j: usize) -> u64 {
@@ -421,6 +437,10 @@ impl PersistentCompactIntMatrix {
        match self { Self::Columnar(m) => m.sum(), Self::Packed(m) => m.sum() }
    }
    pub fn count_nonzero(&self) -> Array1<u64> {
        match self { Self::Columnar(m) => m.count_nonzero(), Self::Packed(m) => m.count_nonzero() }
    }
    pub fn partial_bray_dist_matrix(&self) -> Array2<u64> {
        match self { Self::Columnar(m) => m.partial_bray_dist_matrix(), Self::Packed(m) => m.partial_bray_dist_matrix() }
    }
@@ -451,6 +471,7 @@ use crate::traits::{ColumnWeights, CountPartials};
 impl ColumnWeights for PersistentCompactIntMatrix {
    fn col_weights(&self) -> Array1<u64> { self.sum() }
    fn partial_kmer_counts(&self) -> Array1<u64> { self.count_nonzero() }
 }
 impl CountPartials for PersistentCompactIntMatrix {
@@ -133,11 +133,15 @@ impl PersistentCompactIntVec {
    }
    #[inline]
    /// Returns the sum of all values in the compact int vector.
    pub fn sum(&self) -> u64 {
        self.iter().map(|v| v as u64).sum()
    }
    #[inline]
    pub fn count_nonzero(&self) -> u64 {
        self.iter().filter(|&v| v > 0).count() as u64
    }
    #[inline]
    /// Returns the Bray-Curtis distance between two compact int vectors.
    pub fn bray_dist(&self, other: &PersistentCompactIntVec) -> f64 {
@@ -2,8 +2,16 @@ use ndarray::{Array1, Array2};
 /// Column-level weight statistic — total count or presence count per column.
 /// Additive across layers and partitions; used as denominator in normalised distances.
 ///
 /// `partial_kmer_counts` returns the number of **distinct k-mers** present per
 /// column (presence = 1 entries; count > 0 entries). For presence matrices this
 /// equals `col_weights`; for count matrices it differs (count_nonzero vs sum).
 pub trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;
    fn partial_kmer_counts(&self) -> Array1<u64> {
        self.col_weights()
    }
 }
 /// Partial distance matrices for count-based data (`PersistentCompactIntMatrix`).
@@ -2,7 +2,10 @@ use std::collections::HashMap;
 use std::fs;
 use std::io;
 use std::path::Path;
-use obisys::{Reporter, Stage, progress_bar, spinner};
+use std::sync::atomic::{AtomicU64, Ordering};
 use std::sync::{Arc, Mutex};
 use obisys::{MemoryBudget, Reporter, Stage, available_memory_bytes, progress_bar, spinner};
 use rayon::prelude::*;
 use tracing::info;
@@ -15,23 +18,26 @@ use crate::state::IndexState;
 pub use obikpartitionner::MergeMode;
 // ── per-partition diagnostic record ──────────────────────────────────────────
 #[derive(Debug)]
 struct PartStat {
    id:             usize,
    unitig_bytes:   u64,   // sum of unitigs.bin across remaining sources
    g_len:          usize, // actual new kmers inserted into GraphDeBruijn
    exp_at_acquire: f64,   // expansion factor used to size the budget reservation
 }
 // ── main merge entry point ────────────────────────────────────────────────────
 impl KmerIndex {
    /// Merge `sources` into a new index at `output`.
    ///
    /// All sources must be in `Indexed` state and share the same `kmer_size`,
    /// `minimizer_size`, and `n_partitions`. Count mode additionally requires
    /// every source to have `with_counts = true`.
    ///
    /// Genome labels must be unique across all sources. If `rename_duplicates`
    /// is true, repeated labels are disambiguated by appending `.1`, `.2`, …
    /// to the second and subsequent occurrences. Otherwise a
    /// `DuplicateGenomeLabel` error is returned on the first conflict.
    pub fn merge<P: AsRef<Path>>(
        output: P,
        sources: &[&KmerIndex],
        mode: MergeMode,
        force: bool,
        rename_duplicates: bool,
        budget_fraction: f64,
        rep: &mut Reporter,
    ) -> OKIResult<Self> {
        let output = output.as_ref();
@@ -98,7 +104,7 @@ impl KmerIndex {
        let sources: &[&KmerIndex] = &ordered;
        let evidence = sources[0].meta.config.evidence.clone();
-        // ── Compute final genome labels (rename duplicates if requested) ───────
+        // ── Compute final genome labels ────────────────────────────────────────
        let (source_labels, all_genomes) = compute_labels(sources, rename_duplicates)?;
        // ── Prepare output directory ──────────────────────────────────────────
@@ -125,23 +131,19 @@ impl KmerIndex {
        pb.set_message("copying index …");
        copy_dir_all(&sources[0].root_path, output)?;
        // Rewrite index.meta with final genome labels and the effective mode.
        let mut meta = IndexMeta::read(output).map_err(OKIError::Io)?;
        meta.genomes = all_genomes;
        meta.config.with_counts = mode == MergeMode::Count;
        meta.config.evidence = evidence.clone();
        meta.write(output)?;
        // In presence/absence mode, purge counts/ directories inherited from
        // source_0 — they are stale data from the source's count index.
        if mode == MergeMode::Presence {
            remove_dirs_named(output, "counts")?;
        }
        pb.finish_and_clear();
        rep.push(t.stop());
-        // Rebuild spectrums/ from all sources using the (possibly renamed) labels.
+        // ── Rebuild spectrums ─────────────────────────────────────────────────
        // Drop the spectrums/ that were copied from source_0 and rebuild from scratch.
        info!("rebuilding spectrums for {} source(s)", sources.len());
        let t = Stage::start("spectrums");
        let pb = spinner("spectrums");
@@ -157,12 +159,12 @@ impl KmerIndex {
        pb.finish_and_clear();
        rep.push(t.stop());
-        // Open the destination index.
+        // ── Open destination ──────────────────────────────────────────────────
        let dst = KmerIndex::open(output)?;
        let n_partitions = dst.n_partitions();
        let n_dst_genomes = sources[0].meta.genomes.len();
-        // ── Merge each subsequent source partition-by-partition ───────────────
+        // ── Merge partitions ──────────────────────────────────────────────────
        let remaining_sources: Vec<&KmerIndex> = sources[1..].to_vec();
        if !remaining_sources.is_empty() {
            let n_src_genomes: usize = remaining_sources.iter().map(|s| s.meta.genomes.len()).sum();
@@ -176,22 +178,118 @@ impl KmerIndex {
            let dst_partition = &dst.partition;
            let block_bits = dst.meta.config.block_bits;
-            let errors: Vec<obiskio::SKError> = (0..n_partitions)
+            // Pre-build source list once (avoid rebuilding per partition)
-                .into_par_iter()
+            let srcs: Vec<(&obikpartitionner::KmerPartition, usize)> = remaining_sources
-                .filter_map(|i| {
+                .iter()
-                    let srcs: Vec<(&obikpartitionner::KmerPartition, usize)> =
+                .map(|s| (&s.partition, s.meta.genomes.len()))
-                        remaining_sources.iter().map(|s| (&s.partition, s.meta.genomes.len())).collect();
+                .collect();
-                    let result = dst_partition.merge_partition(i, &srcs, mode, n_dst_genomes, block_bits, &evidence).err();
+
            // Per-partition unitig byte sizes across remaining sources (stat() only)
            let partition_sizes: Vec<u64> = (0..n_partitions)
                .map(|i| remaining_sources.iter()
                    .map(|s| partition_unitig_bytes(s, i))
                    .sum())
                .collect();
            // LFD sort: largest partition first
            let mut order: Vec<usize> = (0..n_partitions).collect();
            order.sort_unstable_by_key(|&i| std::cmp::Reverse(partition_sizes[i]));
            // ── Sequential pilot: worst partition → seed expansion factor ─────
            const FALLBACK_EXPANSION: u64 = 4_000; // 4× in fixed-point ×1000
            let worst_id    = order[0];
            let worst_bytes = partition_sizes[worst_id];
            let worst_g_len = dst_partition
                .merge_partition(worst_id, &srcs, mode, n_dst_genomes, block_bits, &evidence)
                .map_err(OKIError::Partition)?;
            pb.inc(1);
            let seed_expansion = if worst_bytes > 0 {
                worst_g_len as u64 * 16 * 1000 / worst_bytes
            } else {
                FALLBACK_EXPANSION
            };
            info!(
                "merge_partitions: pilot partition {} — {} unitig bytes → {} new kmers, \
                 expansion {:.2}×",
                worst_id, worst_bytes, worst_g_len,
                seed_expansion as f64 / 1000.0,
            );
            let part_stats: Arc<Mutex<Vec<PartStat>>> = Arc::new(Mutex::new({
                let mut v = Vec::with_capacity(n_partitions);
                v.push(PartStat {
                    id:             worst_id,
                    unitig_bytes:   worst_bytes,
                    g_len:          worst_g_len,
                    exp_at_acquire: seed_expansion as f64 / 1000.0,
                });
                v
            }));
            let max_expansion = AtomicU64::new(seed_expansion);
            // ── Parallel remainder under memory budget ────────────────────────
            let available = available_memory_bytes();
            let budget_bytes = (available as f64 * budget_fraction) as u64;
            let budget = Arc::new(MemoryBudget::new(budget_bytes));
            info!(
                "merge_partitions: available RAM {}, budget {:.0}% = {}",
                fmt_bytes(available),
                budget_fraction * 100.0,
                fmt_bytes(budget_bytes),
            );
            let errors: Vec<OKIError> = order[1..].into_par_iter()
                .filter_map(|&i| {
                    let ubytes = partition_sizes[i];
                    let exp    = max_expansion.load(Ordering::Relaxed);
                    let cost   = ubytes * exp / 1000;
                    budget.acquire(cost);
                    let result = dst_partition
                        .merge_partition(i, &srcs, mode, n_dst_genomes, block_bits, &evidence);
                    budget.release(cost);
                    pb.inc(1);
-                    result
+
                    match result {
                        Ok(g_len) => {
                            if ubytes > 0 {
                                let actual = g_len as u64 * 16 * 1000 / ubytes;
                                max_expansion.fetch_max(actual, Ordering::Relaxed);
                            }
                            part_stats.lock().unwrap().push(PartStat {
                                id:             i,
                                unitig_bytes:   ubytes,
                                g_len,
                                exp_at_acquire: exp as f64 / 1000.0,
                            });
                            None
                        }
                        Err(e) => Some(OKIError::Partition(e)),
                    }
                })
                .collect();
            pb.finish_and_clear();
            if let Some(e) = errors.into_iter().next() {
-                return Err(OKIError::Partition(e));
+                return Err(e);
            }
            // ── Diagnostic report ─────────────────────────────────────────────
            let stats = Arc::try_unwrap(part_stats).unwrap().into_inner().unwrap();
            print_merge_partition_report(
                &stats,
                available,
                budget_fraction,
                seed_expansion as f64 / 1000.0,
                max_expansion.load(Ordering::Relaxed) as f64 / 1000.0,
                budget.peak_active(),
            );
            rep.push(t.stop());
        }
@@ -206,19 +304,110 @@ impl KmerIndex {
            rep.push(t.stop());
        }
        // Re-open to get the updated state.
        KmerIndex::open(output)
    }
 }
-// ── Helpers ───────────────────────────────────────────────────────────────────
+// ── Diagnostic report ─────────────────────────────────────────────────────────
 fn print_merge_partition_report(
    stats:          &[PartStat],
    available_ram:  u64,
    budget_fraction: f64,
    seed_expansion: f64,
    final_expansion: f64,
    peak_active:    usize,
 ) {
    // Compute actual expansion per partition (skip empty partitions)
    let expansions: Vec<(usize, f64)> = stats
        .iter()
        .filter(|s| s.unitig_bytes > 0)
        .map(|s| (s.id, s.g_len as f64 * 16.0 / s.unitig_bytes as f64))
        .collect();
    if expansions.is_empty() {
        info!("merge_partitions report: no data (all partitions empty)");
        return;
    }
    let mut sorted_exp: Vec<f64> = expansions.iter().map(|(_, e)| *e).collect();
    sorted_exp.sort_by(|a, b| a.partial_cmp(b).unwrap());
    let n = sorted_exp.len();
    let mean_exp  = sorted_exp.iter().sum::<f64>() / n as f64;
    let median_exp = sorted_exp[n / 2];
    let max_exp   = sorted_exp[n - 1];
    info!("─── merge_partitions memory report ───");
    info!(
        "  available RAM : {}   budget {:.0}% = {}",
        fmt_bytes(available_ram),
        budget_fraction * 100.0,
        fmt_bytes((available_ram as f64 * budget_fraction) as u64),
    );
    info!(
        "  expansion factor — seed: {:.2}×  final max: {:.2}×  \
         (mean: {:.2}×  median: {:.2}×  observed max: {:.2}×)",
        seed_expansion, final_expansion, mean_exp, median_exp, max_exp,
    );
    info!("  peak concurrent workers: {}", peak_active);
    // Histogram of actual expansion factors
    let min_e = sorted_exp[0];
    let max_e = sorted_exp[n - 1];
    let n_buckets = 8usize;
    let bucket_w = (max_e - min_e).max(0.01) / n_buckets as f64;
    let mut counts = vec![0usize; n_buckets];
    for &e in &sorted_exp {
        let b = (((e - min_e) / bucket_w) as usize).min(n_buckets - 1);
        counts[b] += 1;
    }
    let max_count = *counts.iter().max().unwrap();
    info!("  expansion factor distribution ({} partitions with data):", n);
    for (i, &c) in counts.iter().enumerate() {
        let lo  = min_e + i as f64 * bucket_w;
        let hi  = min_e + (i + 1) as f64 * bucket_w;
        let bar = "█".repeat(if max_count > 0 { c * 30 / max_count } else { 0 });
        info!("    {:5.2}× – {:5.2}× │{:<30} {}", lo, hi, bar, c);
    }
    // Top 8 by actual expansion
    let mut by_exp: Vec<(usize, f64)> = expansions.clone();
    by_exp.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
    info!("  top partitions by actual expansion factor:");
    for (id, exp) in by_exp.iter().take(8) {
        let s = stats.iter().find(|s| s.id == *id).unwrap();
        info!(
            "    partition {:4} : {:.2}×  ({} unitigs → {}M kmers, \
             reserved at {:.2}×)",
            id, exp,
            fmt_bytes(s.unitig_bytes),
            s.g_len / 1_000_000,
            s.exp_at_acquire,
        );
    }
    info!("──────────────────────────────────────");
 }
 // ── helpers ───────────────────────────────────────────────────────────────────
 fn fmt_bytes(b: u64) -> String {
    if b >= 1 << 30      { format!("{:.1} GB", b as f64 / (1u64 << 30) as f64) }
    else if b >= 1 << 20 { format!("{:.1} MB", b as f64 / (1u64 << 20) as f64) }
    else if b >= 1 << 10 { format!("{:.1} KB", b as f64 / (1u64 << 10) as f64) }
    else                 { format!("{b} B") }
 }
 /// Sum of all unitigs.bin sizes across all layers of partition `i` in `src`.
 fn partition_unitig_bytes(src: &KmerIndex, i: usize) -> u64 {
    let mut total = 0u64;
    for l in 0.. {
        let p = src.layer_unitigs_path(i, l);
        if !p.exists() { break; }
        if let Ok(m) = std::fs::metadata(&p) { total += m.len(); }
    }
    total
 }
 /// Compute the final genome label lists for all sources.
 ///
 /// Returns `(per_source_labels, all_genomes_flat)`.
 /// The first occurrence of a label keeps the original name. Subsequent
 /// occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true,
 /// or trigger a `DuplicateGenomeLabel` error otherwise.
 fn compute_labels(
    sources: &[&KmerIndex],
    rename_duplicates: bool,
@@ -249,8 +438,6 @@ fn compute_labels(
    Ok((source_labels, all_genomes))
 }
 /// Copy spectrum JSON files from `src_root/spectrums/` to `dst_root/spectrums/`,
 /// mapping each `old_labels[i]` filename to `new_labels[i]`.
 fn copy_spectrums(
    src_root: &Path,
    dst_root: &Path,
@@ -269,7 +456,6 @@ fn copy_spectrums(
    Ok(())
 }
 /// Recursively remove every directory named `name` under `root`.
 fn remove_dirs_named(root: &Path, name: &str) -> io::Result<()> {
    for entry in fs::read_dir(root)? {
        let entry = entry?;
@@ -285,7 +471,6 @@ fn remove_dirs_named(root: &Path, name: &str) -> io::Result<()> {
    Ok(())
 }
 fn format_evidence(ev: &IndexMode) -> String {
    match ev {
        IndexMode::Exact             => "exact".to_string(),
@@ -294,37 +479,15 @@ fn format_evidence(ev: &IndexMode) -> String {
    }
 }
 /// A source is "trivial" if its presence/count values carry no approximation:
 /// single-genome presence index (SetMembership — all values are 1 by construction).
 fn is_trivial(src: &KmerIndex, mode: MergeMode) -> bool {
    src.meta.genomes.len() == 1 && mode == MergeMode::Presence
 }
 /// Sum of all `unitigs.bin` sizes across every partition and layer.
 /// Used as a proxy for the number of indexed smers.
 fn index_unitig_size(src: &KmerIndex) -> u64 {
    let n = src.partition.n_partitions();
-    let mut total = 0u64;
+    (0..n).map(|i| partition_unitig_bytes(src, i)).sum()
    for i in 0..n {
        let index_dir = src.partition.part_dir(i).join("index");
        let mut l = 0usize;
        loop {
            let p = index_dir.join(format!("layer_{l}")).join("unitigs.bin");
            if !p.exists() { break; }
            if let Ok(m) = std::fs::metadata(&p) { total += m.len(); }
            l += 1;
        }
    }
    total
 }
 /// Choose the index to use as bootstrap base.
 ///
 /// Rule — mieux-disant: if any non-trivial source uses approximate evidence
 /// (Approx or Hybrid), the output must also be approximate; the base must
 /// therefore come from an approximate source so its layers carry the right
 /// evidence files.  Among qualifying candidates, the largest (by unitig size)
 /// is chosen to minimise the number of new smers in the merge layer.
 fn choose_base(sources: &[&KmerIndex], mode: MergeMode) -> usize {
    let needs_approx = sources.iter().any(|src| {
        !is_trivial(src, mode)
@@ -1,7 +1,8 @@
 use std::fs;
 use std::path::Path;
-use obicompactvec::LayerMeta;
+use obicompactvec::{LayerMeta, PersistentBitMatrix, PersistentCompactIntMatrix};
 use obicompactvec::traits::ColumnWeights;
 use obilayeredmap::meta::PartitionMeta;
 use rayon::prelude::*;
@@ -124,4 +125,68 @@ impl KmerIndex {
            total:             bpk(mphf_b + evidence_b + matrix_b),
        })
    }
    /// Return `(total_distinct_kmers, per_genome_kmer_counts)`.
    ///
    /// For each genome, the count is the number of distinct k-mers for which
    /// that genome has a non-zero value (presence = 1, count > 0).
    /// Partitions are scanned in parallel; results are summed across partitions.
    pub fn genome_kmer_counts(&self) -> OKIResult<(usize, Vec<u64>)> {
        let n          = self.n_partitions();
        let n_genomes  = self.meta.genomes.len();
        let partials: Vec<(usize, Vec<u64>)> = (0..n)
            .into_par_iter()
            .map(|i| {
                let mut counts  = vec![0u64; n_genomes];
                let mut n_kmers = 0usize;
                let index_dir = self.partition.part_dir(i).join("index");
                if !index_dir.exists() { return (0, counts); }
                let n_layers = PartitionMeta::load(&index_dir)
                    .map(|m| m.n_layers)
                    .unwrap_or(0);
                for l in 0..n_layers {
                    let layer_dir = index_dir.join(format!("layer_{l}"));
                    if !layer_dir.exists() { continue; }
                    n_kmers += LayerMeta::load(&layer_dir).map(|m| m.n).unwrap_or(0);
                    let mat: Box<dyn ColumnWeights> =
                        if layer_dir.join("counts").exists()
                            && !layer_dir.join("presence").exists()
                        {
                            match PersistentCompactIntMatrix::open(&layer_dir) {
                                Ok(m)  => Box::new(m),
                                Err(_) => continue,
                            }
                        } else {
                            match PersistentBitMatrix::open(&layer_dir) {
                                Ok(m)  => Box::new(m),
                                Err(_) => continue,
                            }
                        };
                    let col_counts = mat.partial_kmer_counts();
                    for (c, &v) in col_counts.iter().enumerate() {
                        if c < n_genomes { counts[c] += v; }
                    }
                }
                (n_kmers, counts)
            })
            .collect();
        let total_kmers: usize = partials.iter().map(|(n, _)| n).sum();
        let mut total_counts   = vec![0u64; n_genomes];
        for (_, counts) in partials {
            for (i, v) in counts.into_iter().enumerate() {
                total_counts[i] += v;
            }
        }
        Ok((total_kmers, total_counts))
    }
 }
@@ -26,6 +26,11 @@ pub struct MergeArgs {
    /// Disambiguate duplicate genome labels by appending .1, .2, … instead of erroring
    #[arg(long, default_value_t = false)]
    pub rename_duplicates: bool,
    /// Fraction of available RAM reserved as memory budget for parallel partition merging.
    /// Reduce if OOM occurs despite the adaptive scheduler (e.g. --budget-fraction 0.3).
    #[arg(long, default_value_t = 0.5)]
    pub budget_fraction: f64,
 }
 pub fn run(args: MergeArgs) {
@@ -60,7 +65,7 @@ pub fn run(args: MergeArgs) {
    );
    let mut rep = Reporter::new();
-    KmerIndex::merge(&args.output, &source_refs, mode, args.force, args.rename_duplicates, &mut rep).unwrap_or_else(|e| {
+    KmerIndex::merge(&args.output, &source_refs, mode, args.force, args.rename_duplicates, args.budget_fraction, &mut rep).unwrap_or_else(|e| {
        eprintln!("error merging: {e}");
        std::process::exit(1);
    });
@@ -1,3 +1,4 @@
 use std::io::{self, Write};
 use std::path::PathBuf;
 use clap::Args;
@@ -6,20 +7,33 @@ use tracing::info;
 #[derive(Args)]
 pub struct UtilsArgs {
-    /// Index directory to operate on
+    /// Index directories to operate on (one or more)
-    pub index: PathBuf,
+    #[arg(required = true, num_args = 1..)]
    pub indexes: Vec<PathBuf>,
-    /// Set a new genome label: NEW_LABEL=OLD_LABEL
+    /// Set a new genome label: NEW_LABEL=OLD_LABEL  (single-index only)
    #[arg(long, value_name = "NEW=OLD")]
    pub new_label: Option<String>,
-    /// Add missing layer_meta.json files to each layer (required after upgrading from old indexes)
+    /// Add missing layer_meta.json files to each layer  (single-index only)
    #[arg(long)]
    pub upgrade_index: bool,
-    /// Print bits-per-kmer statistics (MPHF, evidence, matrix, total)
+    /// Print bits-per-kmer statistics  (single-index only)
    #[arg(long)]
    pub bits_per_kmer: bool,
    /// Print per-genome k-mer counts as CSV  (single-index only)
    #[arg(long)]
    pub stats: bool,
    /// Print partition size distribution report (accepts multiple indexes)
    #[arg(long)]
    pub partition_stats: bool,
    /// Write per-(partition, source) raw data as CSV to FILE  (used with --partition-stats)
    #[arg(long, value_name = "FILE")]
    pub csv: Option<PathBuf>,
 }
 pub fn run(args: UtilsArgs) {
@@ -27,25 +41,266 @@ pub fn run(args: UtilsArgs) {
    if let Some(spec) = &args.new_label {
        any = true;
-        run_rename(&args.index, spec);
+        run_rename(single_index(&args), spec);
    }
    if args.upgrade_index {
        any = true;
-        run_upgrade_index(&args.index);
+        run_upgrade_index(single_index(&args));
    }
    if args.bits_per_kmer {
        any = true;
-        run_bits_per_kmer(&args.index);
+        run_bits_per_kmer(single_index(&args));
    }
    if args.stats {
        any = true;
        run_stats(single_index(&args));
    }
    if args.partition_stats {
        any = true;
        run_partition_stats(&args.indexes, args.csv.as_deref());
    }
    if !any {
-        eprintln!("utils: no operation specified. Available options: --new-label NEW=OLD, --upgrade-index, --bits-per-kmer");
+        eprintln!(
            "utils: no operation specified. \
             Available: --new-label, --upgrade-index, --bits-per-kmer, --stats, --partition-stats"
        );
        std::process::exit(1);
    }
 }
 // ── helpers ───────────────────────────────────────────────────────────────────
 fn single_index(args: &UtilsArgs) -> &PathBuf {
    if args.indexes.len() > 1 {
        eprintln!("utils: this option requires exactly one index (got {})", args.indexes.len());
        std::process::exit(1);
    }
    &args.indexes[0]
 }
 // ── --partition-stats ─────────────────────────────────────────────────────────
 /// Per-partition, per-source byte count of all unitigs.bin files summed across layers.
 struct PartRow {
    partition: usize,
    source: String,
    bytes: u64,
 }
 fn collect_rows(indexes: &[PathBuf]) -> Vec<PartRow> {
    let mut rows = Vec::new();
    for path in indexes {
        let idx = KmerIndex::open(path).unwrap_or_else(|e| {
            eprintln!("error opening index {}: {e}", path.display());
            std::process::exit(1);
        });
        let name = path
            .file_name()
            .map(|n| n.to_string_lossy().into_owned())
            .unwrap_or_else(|| path.display().to_string());
        let n_parts = idx.n_partitions();
        for i in 0..n_parts {
            let mut bytes = 0u64;
            for l in 0.. {
                let p = idx.layer_unitigs_path(i, l);
                if !p.exists() {
                    break;
                }
                if let Ok(m) = std::fs::metadata(&p) {
                    bytes += m.len();
                }
            }
            rows.push(PartRow { partition: i, source: name.clone(), bytes });
        }
    }
    rows
 }
 /// Sum bytes per partition across all sources.
 fn partition_totals(rows: &[PartRow], n_parts: usize) -> Vec<u64> {
    let mut totals = vec![0u64; n_parts];
    for r in rows {
        totals[r.partition] += r.bytes;
    }
    totals
 }
 fn stats_summary(totals: &[u64]) -> (u64, u64, f64, f64, u64, u64, u64) {
    let mut sorted = totals.to_vec();
    sorted.sort_unstable();
    let n = sorted.len();
    let min = sorted[0];
    let max = sorted[n - 1];
    let mean = sorted.iter().sum::<u64>() as f64 / n as f64;
    let median = if n % 2 == 0 {
        (sorted[n / 2 - 1] + sorted[n / 2]) as f64 / 2.0
    } else {
        sorted[n / 2] as f64
    };
    let p95 = sorted[(n as f64 * 0.95) as usize];
    let p99 = sorted[(n as f64 * 0.99) as usize];
    let variance = sorted
        .iter()
        .map(|&v| (v as f64 - mean).powi(2))
        .sum::<f64>()
        / n as f64;
    let std_dev = variance.sqrt();
    (min, max, mean, median, p95, p99, std_dev as u64)
 }
 fn human_bytes(b: u64) -> String {
    if b >= 1 << 30 {
        format!("{:.1} GB", b as f64 / (1u64 << 30) as f64)
    } else if b >= 1 << 20 {
        format!("{:.1} MB", b as f64 / (1u64 << 20) as f64)
    } else if b >= 1 << 10 {
        format!("{:.1} KB", b as f64 / (1u64 << 10) as f64)
    } else {
        format!("{b} B")
    }
 }
 fn ascii_histogram(totals: &[u64], n_buckets: usize, bar_width: usize) -> String {
    let min = *totals.iter().min().unwrap();
    let max = *totals.iter().max().unwrap();
    if min == max {
        return format!("  (all partitions identical: {})\n", human_bytes(min));
    }
    let bucket_size = (max - min).max(1) as f64 / n_buckets as f64;
    let mut counts = vec![0usize; n_buckets];
    for &v in totals {
        let b = (((v - min) as f64 / bucket_size) as usize).min(n_buckets - 1);
        counts[b] += 1;
    }
    let max_count = *counts.iter().max().unwrap();
    let mut out = String::new();
    for (i, &c) in counts.iter().enumerate() {
        let lo = min + (i as f64 * bucket_size) as u64;
        let hi = min + ((i + 1) as f64 * bucket_size) as u64;
        let bar_len = if max_count > 0 { c * bar_width / max_count } else { 0 };
        let bar = "█".repeat(bar_len);
        out.push_str(&format!(
            "  {:>8} – {:>8} │{:<width$} {}\n",
            human_bytes(lo),
            human_bytes(hi),
            bar,
            c,
            width = bar_width
        ));
    }
    out
 }
 fn run_partition_stats(indexes: &[PathBuf], csv_path: Option<&std::path::Path>) {
    let rows = collect_rows(indexes);
    if rows.is_empty() {
        eprintln!("partition-stats: no data found");
        std::process::exit(1);
    }
    let n_parts = rows.iter().map(|r| r.partition).max().unwrap() + 1;
    let totals = partition_totals(&rows, n_parts);
    let (min, max, mean, median, p95, p99, std_dev) = stats_summary(&totals);
    // outliers: > median + 1.5 × IQR (approximate via > 1.5 × median as fallback)
    let mut sorted_t = totals.clone();
    sorted_t.sort_unstable();
    let q1 = sorted_t[n_parts / 4] as f64;
    let q3 = sorted_t[3 * n_parts / 4] as f64;
    let iqr = q3 - q1;
    let outlier_threshold = q3 + 1.5 * iqr;
    let mut out = String::new();
    out.push_str("# Partition size report\n\n");
    out.push_str(&format!(
        "Sources: {}  \nPartitions: {}  \n\n",
        indexes.len(),
        n_parts
    ));
    out.push_str("## Summary statistics (total unitigs.bin bytes per partition, sum across sources)\n\n");
    out.push_str("| Stat | Value |\n|---|---|\n");
    out.push_str(&format!("| min    | {} |\n", human_bytes(min)));
    out.push_str(&format!("| max    | {} |\n", human_bytes(max)));
    out.push_str(&format!("| mean   | {} |\n", human_bytes(mean as u64)));
    out.push_str(&format!("| median | {} |\n", human_bytes(median as u64)));
    out.push_str(&format!("| p95    | {} |\n", human_bytes(p95)));
    out.push_str(&format!("| p99    | {} |\n", human_bytes(p99)));
    out.push_str(&format!("| std    | {} |\n", human_bytes(std_dev)));
    out.push_str(&format!("| max/median ratio | {:.2}× |\n\n", max as f64 / median));
    out.push_str("## Histogram\n\n```\n");
    out.push_str(&ascii_histogram(&totals, 30, 40));
    out.push_str("```\n\n");
    let outliers: Vec<(usize, u64)> = totals
        .iter()
        .enumerate()
        .filter(|(_, v)| **v as f64 > outlier_threshold)
        .map(|(i, v)| (i, *v))
        .collect();
    if outliers.is_empty() {
        out.push_str("## Outliers\n\nNone (threshold: Q3 + 1.5×IQR = ");
        out.push_str(&human_bytes(outlier_threshold as u64));
        out.push_str(").\n");
    } else {
        out.push_str(&format!(
            "## Outliers (> Q3 + 1.5×IQR = {})\n\n| Partition | Total size | Ratio to median |\n|---|---|---|\n",
            human_bytes(outlier_threshold as u64)
        ));
        for (i, v) in &outliers {
            out.push_str(&format!(
                "| {} | {} | {:.2}× |\n",
                i,
                human_bytes(*v),
                *v as f64 / median
            ));
        }
        out.push('\n');
    }
    print!("{out}");
    if let Some(csv_out) = csv_path {
        let file = std::fs::File::create(csv_out).unwrap_or_else(|e| {
            eprintln!("error creating CSV file {}: {e}", csv_out.display());
            std::process::exit(1);
        });
        let mut w = io::BufWriter::new(file);
        writeln!(w, "partition,source,bytes").unwrap();
        for r in &rows {
            writeln!(w, "{},{},{}", r.partition, r.source, r.bytes).unwrap();
        }
        eprintln!("CSV written to {}", csv_out.display());
    }
 }
 // ── existing single-index operations ─────────────────────────────────────────
 fn run_stats(index_path: &PathBuf) {
    let idx = KmerIndex::open(index_path).unwrap_or_else(|e| {
        eprintln!("error opening index: {e}");
        std::process::exit(1);
    });
    let (total, per_genome) = idx.genome_kmer_counts().unwrap_or_else(|e| {
        eprintln!("error computing stats: {e}");
        std::process::exit(1);
    });
    println!("genome,n_kmers");
    for (g, &n) in idx.meta().genomes.iter().zip(per_genome.iter()) {
        println!("{},{}", g.label, n);
    }
    println!("total,{total}");
 }
 fn run_bits_per_kmer(index_path: &PathBuf) {
    let idx = KmerIndex::open(index_path).unwrap_or_else(|e| {
        eprintln!("error opening index: {e}");
@@ -59,8 +314,10 @@ fn run_bits_per_kmer(index_path: &PathBuf) {
    println!("genomes   : {}", stats.n_genomes);
    println!("mphf      : {:6.2} bits/kmer", stats.mphf);
    println!("evidence  : {:6.2} bits/kmer", stats.evidence);
-    println!("matrix    : {:6.2} bits/kmer  ({:.2} bits/kmer/genome)",
+    println!(
-             stats.matrix, stats.matrix_per_genome);
+        "matrix    : {:6.2} bits/kmer  ({:.2} bits/kmer/genome)",
        stats.matrix, stats.matrix_per_genome
    );
    println!("total     : {:6.2} bits/kmer", stats.total);
 }
@@ -99,7 +356,6 @@ fn run_rename(index_path: &PathBuf, spec: &str) {
        std::process::exit(1);
    });
    // Check the new label is not already taken.
    if idx.meta().genomes.iter().any(|g| g.label == new_label) {
        eprintln!("error: label '{new_label}' already exists in index");
        std::process::exit(1);
@@ -111,7 +367,6 @@ fn run_rename(index_path: &PathBuf, spec: &str) {
        std::process::exit(1);
    });
    // Rename the spectrum file if it exists.
    let spectrums_dir = index_path.join("spectrums");
    let old_spectrum = spectrums_dir.join(format!("{old_label}.json"));
    let new_spectrum = spectrums_dir.join(format!("{new_label}.json"));
@@ -166,10 +166,10 @@ impl KmerPartition {
        n_dst_genomes: usize,
        block_bits: u8,
        evidence: &IndexMode,
-    ) -> SKResult<()> {
+    ) -> SKResult<usize> {
        let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR);
        if !dst_index_dir.exists() {
-            return Ok(());
+            return Ok(0);
        }
        load_meta(&dst_index_dir)?; // ensure meta.json exists before LayeredMap::open
@@ -381,6 +381,6 @@ impl KmerPartition {
            part_meta.save(&dst_index_dir).map_err(olm_to_sk)?;
        }
-        Ok(())
+        Ok(n_new)
    }
 }
@@ -1,5 +1,6 @@
 use std::fmt;
 use std::sync::atomic::{AtomicU64, Ordering};
 use std::sync::{Condvar, Mutex};
 use std::time::{Duration, Instant};
 use indicatif::{ProgressBar, ProgressStyle};
@@ -309,6 +310,60 @@ fn fmt_efficiency(par: f64, n_cores: usize) -> String {
 // ── Display ───────────────────────────────────────────────────────────────────
 // ── MemoryBudget ──────────────────────────────────────────────────────────────
 struct BudgetInner {
    remaining:   u64,
    active:      usize,
    peak_active: usize,
 }
 /// Counting semaphore that limits total concurrent estimated memory usage.
 ///
 /// Each worker acquires a cost (bytes) before starting and releases it on
 /// completion. Non-deadlock guarantee: when no worker is active the next
 /// acquire always succeeds regardless of cost vs. remaining budget.
 pub struct MemoryBudget {
    total:   u64,
    inner:   Mutex<BudgetInner>,
    condvar: Condvar,
 }
 impl MemoryBudget {
    pub fn new(total: u64) -> Self {
        Self {
            total,
            inner:   Mutex::new(BudgetInner { remaining: total, active: 0, peak_active: 0 }),
            condvar: Condvar::new(),
        }
    }
    pub fn acquire(&self, cost: u64) {
        let mut g = self.inner.lock().unwrap();
        loop {
            if g.active == 0 || g.remaining >= cost {
                g.remaining     = g.remaining.saturating_sub(cost);
                g.active       += 1;
                g.peak_active   = g.peak_active.max(g.active);
                return;
            }
            g = self.condvar.wait(g).unwrap();
        }
    }
    pub fn release(&self, cost: u64) {
        let mut g = self.inner.lock().unwrap();
        g.remaining = (g.remaining + cost).min(self.total);
        g.active   -= 1;
        self.condvar.notify_all();
    }
    pub fn total(&self)       -> u64   { self.total }
    pub fn peak_active(&self) -> usize { self.inner.lock().unwrap().peak_active }
 }
 // ── Display ───────────────────────────────────────────────────────────────────
 impl fmt::Display for Reporter {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        if self.stages.is_empty() { return Ok(()); }