refactor(numa): replace flat runner with per-node activation channels
Shifts the NUMA-aware runner from a flat, round-robin model to a per-node architecture using dedicated `NodeActivation` channels. Replaces absolute deltas with relative scaling based on the previous growth step's worker count, decoupling growth from node count to fix slow ramp-up and enforce per-node fairness. Updates architecture documentation to reflect these changes and focus tuning questions on `INITIAL`/`GROWTH_DIVISOR` parameters for I/O-bound validation.
This commit is contained in:
@@ -205,14 +205,28 @@ unconditionally — no `||` short-circuit — so neither window starves behind
|
||||
whichever signal fires first:
|
||||
|
||||
```rust
|
||||
let cpu_wants_more = cpu_sample.do_i_activate(CPU_SPAWN_THRESHOLD);
|
||||
let cpu_threshold = CPU_SPAWN_THRESHOLD * activation.last_step() as f64;
|
||||
let cpu_wants_more = cpu_sample.do_i_activate(cpu_threshold);
|
||||
let io_wants_more = io_sample.do_i_activate(IO_SPAWN_THRESHOLD);
|
||||
if cpu_wants_more || io_wants_more {
|
||||
activate_tx.send(()).ok();
|
||||
...
|
||||
activation.grow(GROWTH_DIVISOR, n_total);
|
||||
}
|
||||
```
|
||||
|
||||
The CPU threshold is *not* the flat absolute delta it started as: it scales
|
||||
with `activation.last_step()` — the number of workers activated in the last
|
||||
growth step, tracked by `NodeActivation` (`numa.rs`) and updated every time
|
||||
`grow()` actually grows something. Growing by 8 workers should add ~8 cores of
|
||||
efficiency if the workload is truly CPU-bound; requiring only
|
||||
`CPU_SPAWN_THRESHOLD` (20 %) of that expected gain confirms the growth was
|
||||
useful without demanding perfect linear scaling. Scaling by the *last step's
|
||||
size* rather than the cumulative total keeps the bar equally meaningful
|
||||
whether it's the 2nd growth step or the 20th — a flat absolute threshold
|
||||
(0.2 core) is a strong signal at 8 active workers but pure noise at 150; a
|
||||
threshold scaled by the *cumulative* total instead (considered and rejected)
|
||||
would have made the bar essentially impossible to clear late in the ramp,
|
||||
strangling exactly the CPU-bound saturation the mechanism exists to allow.
|
||||
|
||||
Unlike the CPU signal (an absolute delta in cores — a bounded, portable unit),
|
||||
raw I/O throughput has no natural scale across devices, so `IoSample` uses a
|
||||
**relative** growth threshold instead of an absolute one:
|
||||
@@ -242,27 +256,64 @@ the ~94-core artefact above) — one fix for both problems, and it removes the
|
||||
need for any arbitrary I/O-rate floor: a short/noisy window is rejected
|
||||
outright rather than papered over with a hardware-dependent constant.
|
||||
|
||||
Both spawn thresholds (`CPU_SPAWN_THRESHOLD`, `IO_SPAWN_THRESHOLD`, both `0.2`)
|
||||
are defined as `const` in `PartitionRunner::run` (`numa.rs`). The I/O value is
|
||||
a starting point, not a derived one — needs empirical validation against a
|
||||
real `pack` run.
|
||||
Both spawn thresholds (`CPU_SPAWN_THRESHOLD`, `IO_SPAWN_THRESHOLD`, module-level
|
||||
`const` in `numa.rs`, both `0.2`) are a starting point, not a derived value:
|
||||
`0.2` (20 % relative growth) for `IoSample` was chosen to match the CPU
|
||||
threshold's *implicit* relative sensitivity (in the observed log, an 8→9
|
||||
worker step raised efficiency by ~12 %) — but I/O throughput is lumpier than
|
||||
CPU time (buffered writes flush in bursts), so it needs empirical validation
|
||||
against a real `pack` run before being considered final.
|
||||
|
||||
Starting threshold: `0.2` (20 % relative growth) for `IoSample`, same order of
|
||||
magnitude as the CPU threshold's *implicit* relative sensitivity (in the
|
||||
observed log, an 8→9 worker step raised efficiency by ~12 %). This is a
|
||||
starting point, not a derived value — I/O throughput is lumpier than CPU time
|
||||
(buffered writes flush in bursts), so it needs empirical validation against a
|
||||
real `pack` run before being considered final.
|
||||
## Known issue: ramp-up too slow, and confused with node count
|
||||
|
||||
The original design started `n_nodes` workers (one per node) and grew one
|
||||
worker at a time. On a real `filter` run this took ~10 minutes to climb from
|
||||
9 to ~40 active workers even on the CPU-bound `rebuild` stage — most of a
|
||||
35-minute stage spent under-provisioned while waiting for evidence to
|
||||
accumulate one worker at a time. There is no scale-down mechanism (`n_active`
|
||||
only grows), so the original caution was deliberate — but a quarter of
|
||||
available cores is still far from saturation, and the real risk zone (over-provisioning
|
||||
a memory-bandwidth-bound stage) only shows up much later in the ramp, near
|
||||
full occupancy — not at 25 %.
|
||||
|
||||
The fix decouples ramp speed from node *count*: both the initial size and the
|
||||
growth step are a fraction of `workers_per_node` (node *size*), applied
|
||||
identically on every node. A single-NUMA-node (UMA) machine ramps exactly as
|
||||
fast as an 8-node one — growing by `n_nodes` per step, as first considered,
|
||||
would have degenerated to "grow by 1" on UMA, reproducing the original
|
||||
problem for exactly the machines that need the fix most.
|
||||
|
||||
```rust
|
||||
// NodeActivation::grow — called both at startup (activate_initial) and on
|
||||
// every CPU/IO-triggered growth step, with a different divisor each time.
|
||||
let wanted = (self.caps[idx] / divisor).max(1); // INITIAL_DIVISOR=4 at startup, GROWTH_DIVISOR=8 per step
|
||||
let room = self.caps[idx].saturating_sub(self.active[idx]);
|
||||
let grow = wanted.min(room).min(n_total.saturating_sub(self.total));
|
||||
```
|
||||
|
||||
This also fixed a latent correctness gap: the original single shared
|
||||
`activate_tx`/`activate_rx` pair had *no* per-node addressing — sending one
|
||||
activation signal woke up whichever dormant worker (from any node) happened
|
||||
to win the race on that channel. `crossbeam_channel` gives no fairness
|
||||
guarantee across competing receivers, so "round-robin across nodes" was an
|
||||
assumption the code never actually enforced. `PartitionRunner::run` now opens
|
||||
one activation channel per node (`activate_txs`/`activate_rxs`, one pair per
|
||||
`NodeConfig`); `NodeActivation` (`numa.rs`) tracks how many of each node's
|
||||
dormant workers have been woken and grows every node by the same amount per
|
||||
step, capped by that node's remaining dormant workers and by the run's total
|
||||
budget (`n_total`) — balance across nodes is now guaranteed by construction,
|
||||
not incidental to channel implementation details.
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Error handling**: `run` currently returns the first error; remaining errors
|
||||
are dropped. A `Vec<E>` return would give complete diagnostics.
|
||||
|
||||
- **`workers_per_node` tuning**: currently `(cpus / 8).max(3).min(8)`, calibrated
|
||||
for merge on BeeGFS. Superseded by the I/O signal above for the "more
|
||||
workers would help despite flat CPU" case — a per-call override may still be
|
||||
worth keeping as a manual escape hatch.
|
||||
- **`INITIAL_DIVISOR` / `GROWTH_DIVISOR` tuning**: currently `4` and `8`
|
||||
(start at 1/4 of a node's cores, grow by 1/8 per step), chosen to fix an
|
||||
observed too-slow ramp — not yet validated against a real `pack` (I/O-bound)
|
||||
run, where over-provisioning risk is different from the CPU-bound `rebuild`
|
||||
case this was tuned against.
|
||||
|
||||
- **`on_done` ordering**: the runner serialises calls to `on_done` via an
|
||||
internal `Arc<Mutex<C>>`. `Send` is required (the Arc clone crosses thread
|
||||
|
||||
Reference in New Issue
Block a user