Files
obikmer/docmd/architecture/numa_partition_runner.md
T
Eric Coissac f84dd539bf
Release / create-release (push) Successful in 2m25s
Release / build-linux-x86_64 (push) Successful in 8m47s
Release / build-macos-arm64 (push) Failing after 31s
CI / build (pull_request) Successful in 3m30s
feat(numa): introduce I/O sampling to prevent activation stalls
Replaces the monolithic CPU scaling threshold with separate CPU and I/O spawn thresholds. Introduces an `IoSample` struct with platform-specific byte reading and a relative throughput growth heuristic. Adds a 0.1s wall-clock guard to `CpuSample` to suppress artificial efficiency spikes, and updates `maybe_activate` to trigger worker scaling when either resource indicates headroom. Bumps `obikmer` to v1.1.33 and updates architecture documentation.
2026-07-02 10:07:22 +02:00

11 KiB
Raw Blame History

NUMA-aware partition runner

Problem

All partition-level parallel loops in obikindex currently fall into two categories:

Naive Rayon — used in build_layers, pack_matrices, dump, select, stats, rebuild, reindex:

(0..n).into_par_iter().for_each(|i| work(i));

Threads come from the global Rayon pool with no NUMA awareness. On multi-socket machines this produces cross-socket memory traffic and degrades performance super-linearly (see NUMA-aware worker pools).

Ad-hoc adaptive pool — used in merge:

A bespoke implementation with pre-spawned workers, channel-based dispatch, and activation control. It handles NUMA correctly but is not reusable.

Both cases should be replaced by a single generic mechanism.

Unified model

The key insight is that UMA is just the NUMA case with a single node. The runner always works the same way: one controller thread per node, each independently managing its own workers with the same adaptive logic. The only difference between UMA and NUMA is the number of nodes and whether workers are pinned.

NUMA (k nodes)                    UMA (1 node)

controller-0  controller-1  …     controller-0
    │               │                  │
workers[0]     workers[1]         workers[0]
(pinned)       (pinned)           (global pool)
    └───────────────┴──────────────────┘
              shared work queue

On each node, the Rayon ThreadPool is pinned to that node's CPUs. pool.install() ensures all internal Rayon calls (inside the work function) use the node-local pool. Linux first-touch then places heap allocations in local DRAM automatically.

On UMA the global Rayon pool is used directly — no pinning, no overhead.

Adaptive mechanism

Each controller follows the same logic regardless of node count:

  1. Pre-spawn workers_per_node dormant worker threads (blocked on activate_rx).
  2. Activate the first worker immediately.
  3. Loop on result channel with a SPAWN_POLL timeout:
    • On result: call on_done; check whether to activate the next worker.
    • On timeout: same check.
    • Activation criterion: should_spawn_worker(active, global_efficiency, prev_efficiency).
  4. Drop activate_tx when done — dormant workers exit cleanly.

Global CPU efficiency (CpuSample, reads /proc/stat on Linux) is used by all controllers — no per-node measurement needed. The signal is coarser than per-node efficiency but correct in practice: if any node saturates memory bandwidth, the global efficiency drops and all controllers stop activating workers. Using a standard portable primitive avoids platform-specific CPU accounting and keeps the implementation clean.

Proposed API

pub struct PartitionRunner {
    // One entry per NUMA node; one entry total on UMA.
    nodes: Vec<NodeConfig>,
}

struct NodeConfig {
    pool:       Option<Arc<rayon::ThreadPool>>,  // None = global Rayon pool (UMA)
    cpu_ids:    Vec<usize>,                      // empty = no pinning (UMA)
    max_workers: usize,
}

impl PartitionRunner {
    /// Detect topology and build the runner.
    /// Returns a single-node runner on UMA / macOS / hwloc failure.
    pub fn new() -> Self;

    /// Run `f(i)` for every index in `order`, collecting results.
    ///
    /// `on_done(i, result, elapsed)` is called under an internal mutex as
    /// each partition completes — use it for progress bars and aggregation.
    /// The runner serialises all calls to `on_done` via an internal
    /// `Arc<Mutex<C>>`, so no `Sync` bound is required on the callback.
    /// `Send` is required because the Arc clone crosses thread boundaries.
    ///
    /// Serialisation is free in practice: a partition takes seconds to
    /// minutes; the callback takes microseconds.  Contention is negligible.
    ///
    /// Returns the first error from `f`, if any.
    pub fn run<F, R, E, C>(
        &self,
        order:   &[usize],
        f:       F,
        on_done: C,
    ) -> Result<(), E>
    where
        F: Fn(usize) -> Result<R, E> + Send + Sync,
        R: Send,
        E: Send,
        C: FnMut(usize, R, Duration) + Send;   // Send required, Sync is not
}

order is caller-supplied so each command chooses its scheduling strategy: largest-first for merge, sequential for build_layers, etc.

Migration examples

merge.rs (before: ~180 lines of bespoke machinery)

let runner = PartitionRunner::new();
runner.run(
    &order,
    |i| dst_partition.merge_partition(i, srcs, mode, n_dst_genomes, block_bits, evidence)
            .map_err(OKIError::Partition),
    |i, g_len, dur| {
        pb.inc(1);
        debug!("partition {i}: done in {:.1}s — {g_len} new kmers", dur.as_secs_f64());
        part_stats.push(PartStat { id: i, unitig_bytes: partition_sizes[i], g_len });
    },
)?;

index.rs build_layers (before: naive into_par_iter)

let order: Vec<usize> = (0..n).collect();
let runner = PartitionRunner::new();
runner.run(
    &order,
    |i| self.partition.build_index_layer(i, min_ab, max_ab, with_counts, &evidence, block_bits)
            .map_err(OKIError::Partition),
    |_, n_kmers, _| {
        total_kmers.fetch_add(n_kmers, Ordering::Relaxed);
        pb.inc(1);
    },
)?;

All other sites (pack_matrices, dump, select, etc.) follow the same pattern.

Placement

PartitionRunner lives in obikindex/src/numa.rs alongside NumaSetup. It depends only on standard library primitives and Rayon — no new dependencies.

A single PartitionRunner instance can be built once per command invocation and reused across multiple run() calls (e.g. merge runs merge_partitions then pack_matrices).

Known issue: CPU-only activation signal stalls on I/O-bound stages

Observed on a real filter run (109 genomes, 256 partitions, 8×24-core NUMA): rebuild (CPU-bound — k-mer construction) scales cleanly from 9 to 43 active workers as CpuSample::do_i_activate (obisys::lib.rs) sees efficiency climb. pack_matrices (I/O-bound — reopens and recomposes per-genome column files into .pbmx/.pcmx) activates one extra worker then flatlines at 10/192 for the rest of the stage, even though 256 partitions keep completing over several minutes. This matches the documented intent (§ Adaptive mechanism — "avoids over-provisioning ... I/O-bound ... workloads") but conflates two different things: "CPU is not the bottleneck" and "more workers would not help". On storage with real queue depth (NVMe, RAID, parallel FS) the second stage could still benefit from more concurrent workers even with flat CPU usage — a signal the current mechanism cannot see.

A one-off artefact was also found in the same log: right after a stage transition, do_i_activate produced a physically impossible spike (efficiency ~94 cores on a 192-core box) because it has no minimum-window guard — unlike its sibling cpu_efficiency, which returns 0.0 if wall < 0.1s (obisys::lib.rs:260). do_i_activate unconditionally overwrites self.wall/self.user_secs/self.sys_secs even when the elapsed window is too short to be meaningful, so a burst of rapid completions right after activating a worker can divide a real CPU delta by a near-zero wall delta.

Implemented: I/O signal + shared debounce guard

IoSample (obisys::lib.rs, alongside CpuSample) is fed by read_bytes/write_bytes from /proc/self/io on Linux (actual bytes submitted to the block layer — not rchar/wchar, which also count page-cache hits, and not ru_inblock/ru_oublock, unreliable on macOS), with a proc_pid_rusage(RUSAGE_INFO_V4) fallback on macOS (ri_diskio_bytesread/ri_diskio_byteswritten, FFI only via libc, no new dependency — same pattern as the existing getrusage bindings). Any other target degrades gracefully to a signal that never triggers (falls back to CPU-only activation), same pattern as cgroup_v2_available.

maybe_activate (numa.rs) activates a worker if either signal still shows headroom, making PartitionRunner adapt to whichever resource is actually the bottleneck without per-call configuration. Both samplers are called unconditionally — no || short-circuit — so neither window starves behind whichever signal fires first:

let cpu_wants_more = cpu_sample.do_i_activate(CPU_SPAWN_THRESHOLD);
let io_wants_more  = io_sample.do_i_activate(IO_SPAWN_THRESHOLD);
if cpu_wants_more || io_wants_more {
    activate_tx.send(()).ok();
    ...
}

Unlike the CPU signal (an absolute delta in cores — a bounded, portable unit), raw I/O throughput has no natural scale across devices, so IoSample uses a relative growth threshold instead of an absolute one:

pub fn do_i_activate(&mut self, threshold: f64) -> bool {
    let elapsed = self.wall.elapsed().as_secs_f64();
    if elapsed < 0.1 { return false; }        // state untouched — window keeps accumulating

    let n = Self::read_bytes();
    let rate = n.saturating_sub(self.bytes) as f64 / elapsed;
    let activate = if self.previous_rate == 0.0 {
        rate > 0.0                            // bootstrap: any measured throughput is signal
    } else {
        (rate - self.previous_rate) / self.previous_rate >= threshold
    };

    self.bytes = n;
    self.wall  = Instant::now();              // reset only on a real sample
    activate
}

The elapsed < 0.1s → return false without mutating state guard was also back-ported into CpuSample::do_i_activate (previously missing — source of the ~94-core artefact above) — one fix for both problems, and it removes the need for any arbitrary I/O-rate floor: a short/noisy window is rejected outright rather than papered over with a hardware-dependent constant.

Both spawn thresholds (CPU_SPAWN_THRESHOLD, IO_SPAWN_THRESHOLD, both 0.2) are defined as const in PartitionRunner::run (numa.rs). The I/O value is a starting point, not a derived one — needs empirical validation against a real pack run.

Starting threshold: 0.2 (20 % relative growth) for IoSample, same order of magnitude as the CPU threshold's implicit relative sensitivity (in the observed log, an 8→9 worker step raised efficiency by ~12 %). This is a starting point, not a derived value — I/O throughput is lumpier than CPU time (buffered writes flush in bursts), so it needs empirical validation against a real pack run before being considered final.

Open questions

  • Error handling: run currently returns the first error; remaining errors are dropped. A Vec<E> return would give complete diagnostics.

  • workers_per_node tuning: currently (cpus / 8).max(3).min(8), calibrated for merge on BeeGFS. Superseded by the I/O signal above for the "more workers would help despite flat CPU" case — a per-call override may still be worth keeping as a manual escape hatch.

  • on_done ordering: the runner serialises calls to on_done via an internal Arc<Mutex<C>>. Send is required (the Arc clone crosses thread boundaries); Sync is not (only one thread holds the lock at a time). Contention is negligible because a partition takes seconds while the callback takes microseconds. The callback is therefore simple to write (plain Vec::push, plain FnMut) with no measurable performance cost.