Files
obikmer/docmd/architecture/numa_partition_runner.md
T
Eric Coissac 7a87e911b6 feat: introduce NUMA-aware PartitionRunner for adaptive parallelism
Replace NUMA-naive Rayon loops and ad-hoc adaptive pools with a unified `PartitionRunner` that manages a NUMA-aware worker pool. The implementation uses pinned Rayon thread pools per node and activates dormant threads based on real-time CPU efficiency metrics. This standardizes partition-level parallelism, optimizes memory locality, and eliminates cross-socket traffic. Includes architecture documentation and updates mkdocs navigation.
2026-06-15 11:34:41 +02:00

180 lines
6.5 KiB
Markdown

# NUMA-aware partition runner
## Problem
All partition-level parallel loops in obikindex currently fall into two
categories:
**Naive Rayon** — used in `build_layers`, `pack_matrices`, `dump`, `select`,
`stats`, `rebuild`, `reindex`:
```rust
(0..n).into_par_iter().for_each(|i| work(i));
```
Threads come from the global Rayon pool with no NUMA awareness. On
multi-socket machines this produces cross-socket memory traffic and degrades
performance super-linearly (see [NUMA-aware worker pools](numa_worker_pools.md)).
**Ad-hoc adaptive pool** — used in `merge`:
A bespoke implementation with pre-spawned workers, channel-based dispatch, and
activation control. It handles NUMA correctly but is not reusable.
Both cases should be replaced by a single generic mechanism.
## Unified model
The key insight is that **UMA is just the NUMA case with a single node**. The
runner always works the same way: one controller thread per node, each
independently managing its own workers with the same adaptive logic. The only
difference between UMA and NUMA is the number of nodes and whether workers are
pinned.
```
NUMA (k nodes) UMA (1 node)
controller-0 controller-1 … controller-0
│ │ │
workers[0] workers[1] workers[0]
(pinned) (pinned) (global pool)
└───────────────┴──────────────────┘
shared work queue
```
On each node, the Rayon `ThreadPool` is pinned to that node's CPUs.
`pool.install()` ensures all internal Rayon calls (inside the work function)
use the node-local pool. Linux first-touch then places heap allocations in
local DRAM automatically.
On UMA the global Rayon pool is used directly — no pinning, no overhead.
## Adaptive mechanism
Each controller follows the same logic regardless of node count:
1. Pre-spawn `workers_per_node` dormant worker threads (blocked on `activate_rx`).
2. Activate the first worker immediately.
3. Loop on result channel with a `SPAWN_POLL` timeout:
- On result: call `on_done`; check whether to activate the next worker.
- On timeout: same check.
- Activation criterion: `should_spawn_worker(active, global_efficiency, prev_efficiency)`.
4. Drop `activate_tx` when done — dormant workers exit cleanly.
**Global CPU efficiency** (`CpuSample`, reads `/proc/stat` on Linux) is used by
all controllers — no per-node measurement needed. The signal is coarser than
per-node efficiency but correct in practice: if any node saturates memory
bandwidth, the global efficiency drops and all controllers stop activating
workers. Using a standard portable primitive avoids platform-specific CPU
accounting and keeps the implementation clean.
## Proposed API
```rust
pub struct PartitionRunner {
// One entry per NUMA node; one entry total on UMA.
nodes: Vec<NodeConfig>,
}
struct NodeConfig {
pool: Option<Arc<rayon::ThreadPool>>, // None = global Rayon pool (UMA)
cpu_ids: Vec<usize>, // empty = no pinning (UMA)
max_workers: usize,
}
impl PartitionRunner {
/// Detect topology and build the runner.
/// Returns a single-node runner on UMA / macOS / hwloc failure.
pub fn new() -> Self;
/// Run `f(i)` for every index in `order`, collecting results.
///
/// `on_done(i, result, elapsed)` is called under an internal mutex as
/// each partition completes — use it for progress bars and aggregation.
/// The runner serialises all calls to `on_done` via an internal
/// `Arc<Mutex<C>>`, so no `Sync` bound is required on the callback.
/// `Send` is required because the Arc clone crosses thread boundaries.
///
/// Serialisation is free in practice: a partition takes seconds to
/// minutes; the callback takes microseconds. Contention is negligible.
///
/// Returns the first error from `f`, if any.
pub fn run<F, R, E, C>(
&self,
order: &[usize],
f: F,
on_done: C,
) -> Result<(), E>
where
F: Fn(usize) -> Result<R, E> + Send + Sync,
R: Send,
E: Send,
C: FnMut(usize, R, Duration) + Send; // Send required, Sync is not
}
```
`order` is caller-supplied so each command chooses its scheduling strategy:
largest-first for `merge`, sequential for `build_layers`, etc.
## Migration examples
### merge.rs (before: ~180 lines of bespoke machinery)
```rust
let runner = PartitionRunner::new();
runner.run(
&order,
|i| dst_partition.merge_partition(i, srcs, mode, n_dst_genomes, block_bits, evidence)
.map_err(OKIError::Partition),
|i, g_len, dur| {
pb.inc(1);
debug!("partition {i}: done in {:.1}s — {g_len} new kmers", dur.as_secs_f64());
part_stats.push(PartStat { id: i, unitig_bytes: partition_sizes[i], g_len });
},
)?;
```
### index.rs build_layers (before: naive into_par_iter)
```rust
let order: Vec<usize> = (0..n).collect();
let runner = PartitionRunner::new();
runner.run(
&order,
|i| self.partition.build_index_layer(i, min_ab, max_ab, with_counts, &evidence, block_bits)
.map_err(OKIError::Partition),
|_, n_kmers, _| {
total_kmers.fetch_add(n_kmers, Ordering::Relaxed);
pb.inc(1);
},
)?;
```
All other sites (`pack_matrices`, `dump`, `select`, etc.) follow the same
pattern.
## Placement
`PartitionRunner` lives in `obikindex/src/numa.rs` alongside `NumaSetup`.
It depends only on standard library primitives and Rayon — no new dependencies.
A single `PartitionRunner` instance can be built once per command invocation
and reused across multiple `run()` calls (e.g. `merge` runs
`merge_partitions` then `pack_matrices`).
## Open questions
- **Error handling**: `run` currently returns the first error; remaining errors
are dropped. A `Vec<E>` return would give complete diagnostics.
- **`workers_per_node` tuning**: currently `(cpus / 8).max(3).min(8)`, calibrated
for merge on BeeGFS. I/O-bound commands (`dump`, `select`) may benefit from
a higher value. A per-call override could be added to the API.
- **`on_done` ordering**: the runner serialises calls to `on_done` via an
internal `Arc<Mutex<C>>`. `Send` is required (the Arc clone crosses thread
boundaries); `Sync` is not (only one thread holds the lock at a time).
Contention is negligible because a partition takes seconds while the callback
takes microseconds. The callback is therefore simple to write (plain
`Vec::push`, plain `FnMut`) with no measurable performance cost.