Files
obikmer/docmd/implementation/merge_parallelism.md
T
Eric Coissac b5e027f23b feat: add memory-aware parallel merge scheduling and CLI flags
Introduces a memory-aware scheduling strategy for parallel partition merging that replaces unbounded concurrency with a First-Fit Decreasing approach gated by a thread-safe `MemoryBudget` semaphore. An adaptive expansion factor, seeded by a sequential pilot run, dynamically caps concurrent workers to prevent hashbrown OOMs. Adds a `--budget-fraction` CLI flag to configure RAM allocation, enhances the CLI to accept multiple indexes, and introduces comprehensive partition diagnostics including memory utilization tracking, concurrency metrics, and statistical summaries with ASCII histograms. Updates documentation and navigation accordingly.
2026-06-12 11:44:10 +02:00

7.6 KiB
Raw Blame History

Merge parallelism and memory pressure

Problem observed

Running obikmer merge over 109 indexes (108 sources + 1 bootstrap) on a 192-core machine produces a fatal OOM during the merge_partitions stage:

memory allocation of 9126805520 bytes failed

A single allocation of ~8.5 GB fails. This is not an aggregate; it is one malloc call from hashbrown during a HashMap resize.


Root cause

The merge pipeline per partition

source unitigs.bin
  → iter_indexed_canonical_kmers()
  → GraphDeBruijn::push()       ← HashSet<u64> + 1 byte flags, all in RAM
  → compute_degrees_and_mark_starts()
  → try_for_each_unitig()
  → unitigs.bin (new layer)
  → Layer::build() → MPHF + evidence

GraphDeBruijn is a FastHashMap<CanonicalKmer, AtomicU8> — a HashSet<u64> with one flag byte per node. Neighbor lookup is implicit: 4 probes into the same map. No edges are stored. The full kmer set of one partition must reside in RAM simultaneously to compute degrees and mark unitig starts.

The matrix builders that follow (pass 2) are mmapped files — they do not consume significant RAM. The pressure is entirely in pass 1.

Unbounded Rayon parallelism

With 192 cores, Rayon ran up to 192 partitions concurrently. Each partition built its own GraphDeBruijn accumulating all kmers absent from the destination. Peak memory = 192 × peak_partition_hashset.

The 8.5 GB single allocation

hashbrown allocates the entire backing array in one call when rehashing. At load factor 7/8: capacity × (sizeof(K,V) + 1 control byte). For (u64, AtomicU8) with alignment: ~16 bytes per slot.

9 127 MB / 16 bytes ≈ 570 M slots → ~380 M new kmers in one partition

Plausible for the largest partition of 108 Salix/Betula sources (~450 Mbp each).


Partition size distribution

obikmer utils --partition-stats measures the sum of unitigs.bin file sizes per partition across all source indexes (pure stat() syscalls, negligible cost).

Observed on a 9-genome pilot (256 partitions):

Stat Value
min 30.5 MB
max 232.1 MB
mean 40.1 MB
median 37.2 MB
p95 47.1 MB
max/median ratio 6.23×

The distribution is bimodal with a heavy tail:

  • 238/256 partitions in a narrow 3050 MB band
  • 4 structurally extreme partitions (36× the median): 221, 233, 135, 191

These correspond to minimizers over-represented in repetitive regions shared across all sources. They are extreme in every run on this dataset.

With 109 sources, outlier partitions do not scale linearly: only kmers absent from the destination enter the GraphDeBruijn, and inter-source overlap is high for closely related species. Partition 221 is the likely trigger for the 8.5 GB crash.


Solution: LFD scheduling + memory budget semaphore

Principle

Pre-sort partitions by decreasing estimated size (First Fit Decreasing — FFD), then schedule them through a continuous memory budget semaphore. Each worker acquires an estimated cost before starting and releases it on completion.

Large partitions run first when the full budget is available; small partitions fill the gaps. No hard outlier threshold is needed.

MemoryBudget (obisys)

pub struct MemoryBudget {  }

impl MemoryBudget {
    pub fn new(total: u64) -> Self;
    pub fn acquire(&self, cost: u64);  // blocks until budget available
    pub fn release(&self, cost: u64);
    pub fn peak_active(&self) -> usize;
}

Non-deadlock guarantee: when active == 0, acquire always succeeds regardless of cost. Without this, a partition whose estimated cost exceeds the total budget would block forever.

Adaptive expansion factor

The expansion factor converts raw unitigs.bin bytes into an estimated GraphDeBruijn RAM footprint. hashbrown stores each kmer as (u64, AtomicU8) ≈ 16 bytes/kmer at 7/8 load factor; unitig files encode ≈ 2 bits/base. The ratio depends on average unitig length (short unitigs: ~2×; long unitigs: up to ~50×).

Phase 1 — sequential pilot (worst partition)

The largest partition runs alone first. Its actual g.len() seeds the expansion factor before any parallel job starts. FALLBACK_EXPANSION = 4× is used only for empty partitions.

let worst_g_len = dst_partition.merge_partition(worst_id, )?;
//                              ↑ now returns SKResult<usize> (was SKResult<()>)

let seed_expansion = worst_g_len as u64 * 16 * 1000 / worst_bytes;
let max_expansion = AtomicU64::new(seed_expansion);

Phase 2 — parallel with adaptive updates

order[1..].into_par_iter().for_each(|&i| {
    let cost = partition_sizes[i] * max_expansion.load(Relaxed) / 1000;
    budget.acquire(cost);
    let g_len = dst_partition.merge_partition(i, )?;
    budget.release(cost);          // releases estimated cost, not actual

    let actual = g_len as u64 * 16 * 1000 / partition_sizes[i];
    max_expansion.fetch_max(actual, Relaxed);  // always pessimistic (max)
});

budget.release(cost) uses the estimated cost, not the actual one. The budget tracks reservations, not physical RAM; each partition pays what it promised at acquisition.

On the safety margin

There is no separate multiplier k. It is redundant with budget_fraction: both reduce effective concurrency by the same amount. A single parameter is easier to calibrate. budget_fraction = 0.5 (default) reserves half of available RAM for the OS, MPHF build, pass 2, and estimation error.

--budget-fraction is exposed as a CLI flag — the only escape hatch for pathological cases (extreme repetitive content, unusually long unitigs) that still cause OOM.

RAM source

obisys::available_memory_bytes() — wraps sysinfo::System::available_memory(), falls back to total / 2 on macOS when the memory compressor returns 0.


Diagnostic report

After the parallel phase, merge_partition emits a structured report via tracing::info!:

─── merge_partitions memory report ───
  available RAM : 512.0 GB   budget 50% = 256.0 GB
  expansion factor — seed: 4.2×  final max: 6.1×  (mean: 1.8×  median: 1.6×)
  peak concurrent workers: 42
  expansion factor distribution (256 partitions with data):
     0.50×   1.25× │██████████████████████████████ 148
     1.25× –  2.00× │████████████████████████        82
     …
     5.50×   6.25× │█                                2
  top partitions by actual expansion factor:
    partition  221 : 6.10×  (232.1 MB unitigs → 48M kmers, reserved at 4.20×)
    partition  135 : 5.82×  (127.3 MB unitigs → 24M kmers, reserved at 4.20×)
    …
──────────────────────────────────────

Fields useful for diagnosis:

Field Interpretation
seed vs final max expansion gap indicates partitions with higher expansion than the worst-by-size
reserved at X× the factor used at acquisition; if much lower than actual, the budget was under-reserved for that partition
peak concurrent workers effective parallelism achieved under the budget constraint
mean / median expansion typical dataset characteristic; stable across runs on the same data

Parameters

Parameter Default CLI flag Notes
fallback_expansion 4× seed for empty partitions only
budget_fraction 0.5 --budget-fraction reduce if OOM persists
RAM source obisys::available_memory_bytes() falls back to total/2 on macOS