Files

T

Eric Coissac ea767376bd feat: implement NUMA-aware worker pools for merge command

Replaces the global Rayon pool with per-NUMA-node thread pools that pin worker threads to their respective nodes, leveraging Linux first-touch allocation to reduce cross-NUMA memory contention and improve cache locality. Integrates the `hwlocality` crate with a vendored build, includes graceful fallbacks for single-socket or non-Linux systems, and updates dependency constraints. Also adds installation and architecture documentation, and corrects parallelism detection in the partitioner.

2026-06-14 23:56:52 +02:00

5.4 KiB

Raw Blame History

NUMA-aware worker pools for merge

Problem

The merge command's bottleneck is compute_degrees in obidebruinj: a random pointer-chase over 20–70 M node hash maps that saturates DRAM bandwidth. When multiple partition workers run concurrently, they contend for the shared memory bus, causing super-linear slowdown (measured: 0.016 µs/node solo → 0.95 µs/node with 4–5 concurrent workers, ×60 degradation).

Modern HPC nodes are multi-socket NUMA machines (observed: 2 sockets × 4 NUMA nodes × 24 cores = 192 cores). Cross-NUMA memory traffic compounds the contention:

Full 192-core run: ~15 min/partition (×10 worse than M3 Mac)
taskset restricted to 4 NUMA nodes (96 cores): ~90 s/partition
OAR job on 1 NUMA node (24 cores): ~80 s/partition, same throughput as 96 cores

Conclusion: the bottleneck is memory bandwidth per NUMA node, not core count. 24 cores on one NUMA node achieve the same throughput as 96 cores across four.

Strategy

Run N worker groups in parallel, one per NUMA node, each with its own Rayon thread pool whose threads are pinned to the NUMA node's CPUs. Linux's first-touch policy then places graph allocations on local DRAM automatically — no explicit NUMA allocator needed.

Expected throughput: N × single-NUMA throughput. On the 8-NUMA-node HPC: 8 × ~80 s = 9–10 min total instead of >60 min with the current single-pool approach.

Rayon thread pool isolation

Rayon provides ThreadPool::install(|| { ... }): any Rayon call (par_iter, current_num_threads, etc.) inside the closure uses that pool exclusively. Wrapping merge_partition in pool.install() redirects all downstream Rayon calls — including those in debruijn.rs and partition.rs — without touching those crates.

// worker thread, assigned to NUMA pool `pool`
pool.install(|| {
    dst_partition.merge_partition(i, srcs, mode, n_dst_genomes, block_bits, evidence)
})

rayon::current_num_threads() inside merge_partition will return the pool size (e.g. 24), not the global thread count — which is the right value for buffer sizing.

Thread pinning

ThreadPoolBuilder::spawn_handler provides a hook executed for each thread at creation. Inside, libc::sched_setaffinity pins the thread to a CPU set:

let cpus: Vec<usize> = numa_node_cpus(node); // from /sys/devices/system/node/nodeN/cpulist
rayon::ThreadPoolBuilder::new()
    .num_threads(cpus.len())
    .spawn_handler(move |thread| {
        let mut b = std::thread::Builder::new();
        std::thread::Builder::new().spawn(move || {
            pin_to_cpus(&cpus);   // sched_setaffinity via libc
            thread.run()
        })?;
        Ok(())
    })
    .build()?

NUMA topology is read from /sys/devices/system/node/node*/cpulist — no libnuma dependency required. If the numa crate is linked, numa_available() / numa_run_on_node() are an alternative.

Memory locality

Linux allocates pages on the NUMA node of the thread that first writes them (first-touch policy). Once Rayon threads are pinned to node N, all graph data built by those threads lands on node N's DRAM. No changes to the allocator, no explicit numa_alloc_onnode calls.

Adaptive spawn criterion

The current criterion uses std::thread::available_parallelism() (returns total cores = 192) and max_workers = n_cores / 2. With NUMA pools:

n_cores per pool = cores per NUMA node (e.g. 24)
max_workers per pool = pool size / 2 (e.g. 12)
CPU efficiency is measured per pool, not globally

Each NUMA group runs its own independent adaptive pool. Workers are distributed across NUMA groups round-robin or by workload (partition assignment can be pre-split by NUMA group index).

Required changes

File	Change
`obikindex/src/merge.rs`	Detect NUMA topology; build N `ThreadPool`s with pinned threads; assign each pre-spawned worker to a pool; wrap `merge_partition` in `pool.install()`
`obikindex/src/merge.rs`	Replace `available_parallelism()` with per-NUMA core count for spawn criterion
`obikpartitionner/src/merge_layer.rs`	No change — `merge_partition` already works inside any Rayon context
`obidebruinj/src/debruijn.rs`	No change — `par_iter` and `current_num_threads` are pool-context-aware
`obikpartitionner/src/partition.rs`	No change — same reason

Platform guard

NUMA pinning is Linux-only. The fallback is the current single global pool:

#[cfg(target_os = "linux")]
fn build_numa_pools() -> Option<Vec<rayon::ThreadPool>> { ... }

#[cfg(not(target_os = "linux"))]
fn build_numa_pools() -> Option<Vec<rayon::ThreadPool>> { None }

When build_numa_pools() returns None (macOS, UMA, or single-socket), merge.rs uses the existing code path unchanged.

Open questions

Partition assignment: split partitions by NUMA group up-front (static) or use a shared queue with per-group workers stealing from a common pool? Static split is simpler; stealing is better for load balance when partitions vary widely in size.
Intra-NUMA adaptive criterion: with 24 cores and ~3–5 effective workers per NUMA node, the current marginal-gain criterion needs re-tuning or can be left as-is with per-pool n_cores = 24.
I/O: partition data (unitig files) is on a shared filesystem. With 8 concurrent NUMA groups, I/O concurrency increases 8× — need to verify the filesystem (Lustre or local SSD) can absorb it without becoming the new bottleneck.

5.4 KiB Raw Blame History Unescape Escape