refactor(numa): replace flat runner with per-node activation channels

Shifts the NUMA-aware runner from a flat, round-robin model to a per-node architecture using dedicated `NodeActivation` channels. Replaces absolute deltas with relative scaling based on the previous growth step's worker count, decoupling growth from node count to fix slow ramp-up and enforce per-node fairness. Updates architecture documentation to reflect these changes and focus tuning questions on `INITIAL`/`GROWTH_DIVISOR` parameters for I/O-bound validation.
2026-07-03 12:47:56 +02:00
parent 66ab4c6db1
commit 67b4e4da53
2 changed files with 285 additions and 108 deletions
@@ -205,14 +205,28 @@ unconditionally — no `||` short-circuit — so neither window starves behind
 whichever signal fires first:

 ```rust
-let cpu_wants_more = cpu_sample.do_i_activate(CPU_SPAWN_THRESHOLD);
+let cpu_threshold = CPU_SPAWN_THRESHOLD * activation.last_step() as f64;
+let cpu_wants_more = cpu_sample.do_i_activate(cpu_threshold);
 let io_wants_more  = io_sample.do_i_activate(IO_SPAWN_THRESHOLD);
 if cpu_wants_more || io_wants_more {
-    activate_tx.send(()).ok();
-    ...
+    activation.grow(GROWTH_DIVISOR, n_total);
 }
 ```

+The CPU threshold is *not* the flat absolute delta it started as: it scales
+with `activation.last_step()` — the number of workers activated in the last
+growth step, tracked by `NodeActivation` (`numa.rs`) and updated every time
+`grow()` actually grows something. Growing by 8 workers should add ~8 cores of
+efficiency if the workload is truly CPU-bound; requiring only
+`CPU_SPAWN_THRESHOLD` (20 %) of that expected gain confirms the growth was
+useful without demanding perfect linear scaling. Scaling by the *last step's
+size* rather than the cumulative total keeps the bar equally meaningful
+whether it's the 2nd growth step or the 20th — a flat absolute threshold
+(0.2 core) is a strong signal at 8 active workers but pure noise at 150; a
+threshold scaled by the *cumulative* total instead (considered and rejected)
+would have made the bar essentially impossible to clear late in the ramp,
+strangling exactly the CPU-bound saturation the mechanism exists to allow.
+
 Unlike the CPU signal (an absolute delta in cores — a bounded, portable unit),
 raw I/O throughput has no natural scale across devices, so `IoSample` uses a
 **relative** growth threshold instead of an absolute one:
@@ -242,27 +256,64 @@ the ~94-core artefact above) — one fix for both problems, and it removes the
 need for any arbitrary I/O-rate floor: a short/noisy window is rejected
 outright rather than papered over with a hardware-dependent constant.

-Both spawn thresholds (`CPU_SPAWN_THRESHOLD`, `IO_SPAWN_THRESHOLD`, both `0.2`)
-are defined as `const` in `PartitionRunner::run` (`numa.rs`). The I/O value is
-a starting point, not a derived one — needs empirical validation against a
-real `pack` run.
+Both spawn thresholds (`CPU_SPAWN_THRESHOLD`, `IO_SPAWN_THRESHOLD`, module-level
+`const` in `numa.rs`, both `0.2`) are a starting point, not a derived value:
+`0.2` (20 % relative growth) for `IoSample` was chosen to match the CPU
+threshold's *implicit* relative sensitivity (in the observed log, an 8→9
+worker step raised efficiency by ~12 %) — but I/O throughput is lumpier than
+CPU time (buffered writes flush in bursts), so it needs empirical validation
+against a real `pack` run before being considered final.

-Starting threshold: `0.2` (20 % relative growth) for `IoSample`, same order of
-magnitude as the CPU threshold's *implicit* relative sensitivity (in the
-observed log, an 8→9 worker step raised efficiency by ~12 %). This is a
-starting point, not a derived value — I/O throughput is lumpier than CPU time
-(buffered writes flush in bursts), so it needs empirical validation against a
-real `pack` run before being considered final.
+## Known issue: ramp-up too slow, and confused with node count
+
+The original design started `n_nodes` workers (one per node) and grew one
+worker at a time. On a real `filter` run this took ~10 minutes to climb from
+9 to ~40 active workers even on the CPU-bound `rebuild` stage — most of a
+35-minute stage spent under-provisioned while waiting for evidence to
+accumulate one worker at a time. There is no scale-down mechanism (`n_active`
+only grows), so the original caution was deliberate — but a quarter of
+available cores is still far from saturation, and the real risk zone (over-provisioning
+a memory-bandwidth-bound stage) only shows up much later in the ramp, near
+full occupancy — not at 25 %.
+
+The fix decouples ramp speed from node *count*: both the initial size and the
+growth step are a fraction of `workers_per_node` (node *size*), applied
+identically on every node. A single-NUMA-node (UMA) machine ramps exactly as
+fast as an 8-node one — growing by `n_nodes` per step, as first considered,
+would have degenerated to "grow by 1" on UMA, reproducing the original
+problem for exactly the machines that need the fix most.
+
+```rust
+// NodeActivation::grow — called both at startup (activate_initial) and on
+// every CPU/IO-triggered growth step, with a different divisor each time.
+let wanted = (self.caps[idx] / divisor).max(1);   // INITIAL_DIVISOR=4 at startup, GROWTH_DIVISOR=8 per step
+let room   = self.caps[idx].saturating_sub(self.active[idx]);
+let grow   = wanted.min(room).min(n_total.saturating_sub(self.total));
+```
+
+This also fixed a latent correctness gap: the original single shared
+`activate_tx`/`activate_rx` pair had *no* per-node addressing — sending one
+activation signal woke up whichever dormant worker (from any node) happened
+to win the race on that channel. `crossbeam_channel` gives no fairness
+guarantee across competing receivers, so "round-robin across nodes" was an
+assumption the code never actually enforced. `PartitionRunner::run` now opens
+one activation channel per node (`activate_txs`/`activate_rxs`, one pair per
+`NodeConfig`); `NodeActivation` (`numa.rs`) tracks how many of each node's
+dormant workers have been woken and grows every node by the same amount per
+step, capped by that node's remaining dormant workers and by the run's total
+budget (`n_total`) — balance across nodes is now guaranteed by construction,
+not incidental to channel implementation details.

 ## Open questions

 - **Error handling**: `run` currently returns the first error; remaining errors
  are dropped.  A `Vec<E>` return would give complete diagnostics.

- **`workers_per_node` tuning**: currently `(cpus / 8).max(3).min(8)`, calibrated
-  for merge on BeeGFS.  Superseded by the I/O signal above for the "more
-  workers would help despite flat CPU" case — a per-call override may still be
-  worth keeping as a manual escape hatch.
+- **`INITIAL_DIVISOR` / `GROWTH_DIVISOR` tuning**: currently `4` and `8`
+  (start at 1/4 of a node's cores, grow by 1/8 per step), chosen to fix an
+  observed too-slow ramp — not yet validated against a real `pack` (I/O-bound)
+  run, where over-provisioning risk is different from the CPU-bound `rebuild`
+  case this was tuned against.

 - **`on_done` ordering**: the runner serialises calls to `on_done` via an
  internal `Arc<Mutex<C>>`.  `Send` is required (the Arc clone crosses thread