feat: implement NUMA-aware worker pools for merge command

Replaces the global Rayon pool with per-NUMA-node thread pools that pin worker threads to their respective nodes, leveraging Linux first-touch allocation to reduce cross-NUMA memory contention and improve cache locality. Integrates the `hwlocality` crate with a vendored build, includes graceful fallbacks for single-socket or non-Linux systems, and updates dependency constraints. Also adds installation and architecture documentation, and corrects parallelism detection in the partitioner.
2026-06-14 23:40:09 +02:00
parent f1d76f3203
commit ea767376bd
9 changed files with 654 additions and 34 deletions
@@ -0,0 +1,97 @@
+# NUMA-aware worker pools for merge
+
+## Problem
+
+The merge command's bottleneck is `compute_degrees` in `obidebruinj`: a random pointer-chase over 20–70 M node hash maps that saturates DRAM bandwidth. When multiple partition workers run concurrently, they contend for the shared memory bus, causing super-linear slowdown (measured: 0.016 µs/node solo → 0.95 µs/node with 4–5 concurrent workers, ×60 degradation).
+
+Modern HPC nodes are multi-socket NUMA machines (observed: 2 sockets × 4 NUMA nodes × 24 cores = 192 cores). Cross-NUMA memory traffic compounds the contention:
+
+- Full 192-core run: ~15 min/partition (×10 worse than M3 Mac)
+- `taskset` restricted to 4 NUMA nodes (96 cores): ~90 s/partition
+- OAR job on 1 NUMA node (24 cores): ~80 s/partition, same throughput as 96 cores
+
+**Conclusion**: the bottleneck is memory bandwidth per NUMA node, not core count. 24 cores on one NUMA node achieve the same throughput as 96 cores across four.
+
+## Strategy
+
+Run N worker groups in parallel, one per NUMA node, each with its own Rayon thread pool whose threads are pinned to the NUMA node's CPUs. Linux's first-touch policy then places graph allocations on local DRAM automatically — no explicit NUMA allocator needed.
+
+Expected throughput: N × single-NUMA throughput. On the 8-NUMA-node HPC: 8 × ~80 s = 9–10 min total instead of >60 min with the current single-pool approach.
+
+## Rayon thread pool isolation
+
+Rayon provides `ThreadPool::install(|| { ... })`: any Rayon call (`par_iter`, `current_num_threads`, etc.) inside the closure uses *that* pool exclusively. Wrapping `merge_partition` in `pool.install()` redirects all downstream Rayon calls — including those in `debruijn.rs` and `partition.rs` — without touching those crates.
+
+```rust
+// worker thread, assigned to NUMA pool `pool`
+pool.install(|| {
+    dst_partition.merge_partition(i, srcs, mode, n_dst_genomes, block_bits, evidence)
+})
+```
+
+`rayon::current_num_threads()` inside `merge_partition` will return the pool size (e.g. 24), not the global thread count — which is the right value for buffer sizing.
+
+## Thread pinning
+
+`ThreadPoolBuilder::spawn_handler` provides a hook executed for each thread at creation. Inside, `libc::sched_setaffinity` pins the thread to a CPU set:
+
+```rust
+let cpus: Vec<usize> = numa_node_cpus(node); // from /sys/devices/system/node/nodeN/cpulist
+rayon::ThreadPoolBuilder::new()
+    .num_threads(cpus.len())
+    .spawn_handler(move |thread| {
+        let mut b = std::thread::Builder::new();
+        std::thread::Builder::new().spawn(move || {
+            pin_to_cpus(&cpus);   // sched_setaffinity via libc
+            thread.run()
+        })?;
+        Ok(())
+    })
+    .build()?
+```
+
+NUMA topology is read from `/sys/devices/system/node/node*/cpulist` — no `libnuma` dependency required. If the `numa` crate is linked, `numa_available()` / `numa_run_on_node()` are an alternative.
+
+## Memory locality
+
+Linux allocates pages on the NUMA node of the thread that first writes them (first-touch policy). Once Rayon threads are pinned to node N, all graph data built by those threads lands on node N's DRAM. No changes to the allocator, no explicit `numa_alloc_onnode` calls.
+
+## Adaptive spawn criterion
+
+The current criterion uses `std::thread::available_parallelism()` (returns total cores = 192) and `max_workers = n_cores / 2`. With NUMA pools:
+
+- `n_cores` per pool = cores per NUMA node (e.g. 24)
+- `max_workers` per pool = pool size / 2 (e.g. 12)
+- CPU efficiency is measured per pool, not globally
+
+Each NUMA group runs its own independent adaptive pool. Workers are distributed across NUMA groups round-robin or by workload (partition assignment can be pre-split by NUMA group index).
+
+## Required changes
+
+| File | Change |
+|------|--------|
+| `obikindex/src/merge.rs` | Detect NUMA topology; build N `ThreadPool`s with pinned threads; assign each pre-spawned worker to a pool; wrap `merge_partition` in `pool.install()` |
+| `obikindex/src/merge.rs` | Replace `available_parallelism()` with per-NUMA core count for spawn criterion |
+| `obikpartitionner/src/merge_layer.rs` | No change — `merge_partition` already works inside any Rayon context |
+| `obidebruinj/src/debruijn.rs` | No change — `par_iter` and `current_num_threads` are pool-context-aware |
+| `obikpartitionner/src/partition.rs` | No change — same reason |
+
+## Platform guard
+
+NUMA pinning is Linux-only. The fallback is the current single global pool:
+
+```rust
+#[cfg(target_os = "linux")]
+fn build_numa_pools() -> Option<Vec<rayon::ThreadPool>> { ... }
+
+#[cfg(not(target_os = "linux"))]
+fn build_numa_pools() -> Option<Vec<rayon::ThreadPool>> { None }
+```
+
+When `build_numa_pools()` returns `None` (macOS, UMA, or single-socket), `merge.rs` uses the existing code path unchanged.
+
+## Open questions
+
+- **Partition assignment**: split partitions by NUMA group up-front (static) or use a shared queue with per-group workers stealing from a common pool? Static split is simpler; stealing is better for load balance when partitions vary widely in size.
+- **Intra-NUMA adaptive criterion**: with 24 cores and ~3–5 effective workers per NUMA node, the current marginal-gain criterion needs re-tuning or can be left as-is with per-pool `n_cores = 24`.
+- **I/O**: partition data (unitig files) is on a shared filesystem. With 8 concurrent NUMA groups, I/O concurrency increases 8× — need to verify the filesystem (Lustre or local SSD) can absorb it without becoming the new bottleneck.
@@ -0,0 +1,68 @@
+# Installation
+
+## Prerequisites
+
+### Rust toolchain
+
+`obikmer` requires **Rust 1.85 or later** (edition 2024). Install or update via [rustup](https://rustup.rs):
+
+```bash
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+rustup update stable
+```
+
+### C build environment (required for hwloc)
+
+`obikmer` embeds [hwloc](https://www.open-mpi.org/projects/hwloc/) (Hardware Locality) for NUMA-aware thread placement on multi-socket machines. hwloc is built from source at compile time via the `vendored` feature of the `hwlocality` crate. This requires a standard C build environment.
+
+#### Linux (Debian/Ubuntu)
+
+```bash
+apt install build-essential automake libtool autoconf pkg-config
+```
+
+#### Linux (RHEL/Rocky/AlmaLinux)
+
+```bash
+dnf install gcc make automake libtool autoconf pkgconfig
+```
+
+#### HPC clusters
+
+Most HPC clusters provide these tools via the module system:
+
+```bash
+module load gcc automake libtool autoconf
+```
+
+If in doubt, check whether `autoreconf --version` and `libtool --version` return successfully.
+
+#### macOS
+
+```bash
+brew install automake libtool autoconf pkg-config
+```
+
+## Building
+
+```bash
+git clone <repository-url>
+cd obikmer/src
+cargo build --release
+```
+
+The compiled binary is at `target/release/obikmer`.
+
+## NUMA support
+
+NUMA-aware thread placement is active automatically on multi-socket Linux machines (detected at runtime via hwloc). No special build flag is required — the detection is built in and falls back gracefully to the single-pool adaptive strategy on:
+
+- macOS (Apple Silicon, unified memory)
+- single-socket Linux machines
+- any system where hwloc reports only one NUMA node
+
+## Verifying the installation
+
+```bash
+obikmer --help
+```