Workers and Workloads
Workers are the processes that generate load for scenarios. They run
inside the VM, each placed in a cgroup, and each one reports detailed
telemetry (WorkerReport) when the workload stops. WorkloadHandle
is the RAII handle that owns their whole lifecycle: spawn → place →
start → stop and collect → drop.
Spawning
let config = WorkloadConfig {
num_workers: 4,
work_type: WorkType::Mixed,
..Default::default()
};
let mut handle = WorkloadHandle::spawn(&config)?;
Set only the fields that matter for the test and let
..Default::default() fill in the rest — WorkloadConfig’s default
is a known-good single-worker SpinWait baseline, and the spread
form keeps examples pinned to intent as fields are added. Consult the
WorkloadConfig rustdoc for the current field list. (Do not
extrapolate this to every ktstr type: CgroupDef deliberately has no
Default because a derived empty name would silently produce an
invalid cgroup — use CgroupDef::named(...).)
The worker-creation primitive is selected by CloneMode:
CloneMode::Fork(default): forks N child processes; each child installs a SIGUSR1 handler, then blocks on a pipe waiting for the start signal. Each worker has its own tgid, socgroup.procsplacement is per-worker.CloneMode::Thread: spawns N threads inside the harness; each blocks on a rendezvous channel untilstart(). Workers share the harness’s tgid, so cgroup placement must go throughcgroup.threads(see Placement below).
For grouped work types (PipeIo, FutexPingPong, FutexFanOut,
MutexContention, ThunderingHerd, and the rest of the
communicating families), spawn() validates that num_workers is
divisible by the variant’s group size and sets up the inter-worker
plumbing the variant requires (pipes, shared futex pages). See
Work Types for choosing a variant.
pcomm containers are not created by spawn() — it bails when a
composed WorkSpec::pcomm is set, pointing at
WorkloadHandle::spawn_pcomm_cgroup (or the CgroupDef::pcomm
path), which spawns one thread-group-leader process hosting N worker
threads so the group’s comm matches the pcomm name.
Two-phase start
Workers wait for a “start” signal after spawn:
- Parent spawns the worker (fork or thread), which blocks.
- Parent moves the worker to its target cgroup.
- Parent calls
start(), releasing all workers at once.
This ensures workers run inside their target cgroup from the first instruction of their workload — there is no window where load runs in the wrong cgroup and pollutes the measurement.
Placement
// 1. Spawn workers (blocked, waiting for start signal)
let mut handle = WorkloadHandle::spawn(&config)?;
// 2. Move workers into their target cgroup. `cgroup.procs` is
// tgid-scoped, so use `worker_pids_for_cgroup_procs()` — it
// bails for Thread-mode workers (whose pids share the harness's
// tgid) and points at `cgroup.threads` instead. Plain
// `worker_pids()` returns the raw pid set without that check.
ctx.cgroups.move_tasks("cg_0", &handle.worker_pids_for_cgroup_procs()?)?;
// 3. Signal workers to start
handle.start();
// 4. Wait for the workload duration
std::thread::sleep(ctx.duration);
// 5. Stop workers and collect telemetry
let reports: Vec<WorkerReport> = handle.stop_and_collect();
Step 2’s Thread-mode bail exists because the kernel resolves any pid
written to cgroup.procs to its thread-group leader — writing a
Thread-mode worker’s pid there would migrate the entire harness
into the test cgroup.
Which placement tool, when:
| You want to | Use |
|---|---|
| Pin one worker to CPUs | handle.set_affinity(idx, cpus) |
| Pin a whole cgroup of workers | CgroupGroup::add_cgroup (writes cpuset.cpus once, RAII-removes on drop) |
| A cgroup that outlives the current scope | CgroupManager directly |
Start and observing progress
start() signals all workers to begin (a start-pipe byte for
fork children, a channel send for threads). Idempotent — the second
call is a no-op. Call it after cgroup placement.
snapshot_iterations() reads every worker’s current iteration
count from a shared-memory region without stopping anything. Call it
periodically during the run window to detect stalls or compute
instantaneous rates; final totals come from stop_and_collect().
Stop and collect
stop_and_collect(self) signals workers to stop (SIGUSR1 flips a
stop flag in fork children; a per-thread flag for thread workers),
then collects each worker’s WorkerReport — read from a report pipe
under a shared 5-second deadline for fork children, returned from the
thread join for thread workers. It auto-starts workers if start()
was never called, and consumes the handle — workers cannot be
restarted.
A worker that fails to produce a report (died, timed out, wrote
corrupt data) gets a zeroed sentinel report: completed: false,
work_units: 0, and exit_info: Some(_) preserving how it ended
(Exited(code) / Signaled(sig) / TimedOut / WaitFailed /
Panicked). Live-worker reports always carry exit_info: None, so
consumers can distinguish “ran to completion and did nothing” from
“died before reporting” — and the starvation gate counts dead workers
as starved instead of silently passing.
After collection, SIGKILL is delivered to each fork worker’s process group unconditionally to reap stragglers.
Warning
The teardown SIGKILL is a process-group sweep. Every worker calls
setpgid(0, 0)after fork, so any child aCustomwork function spawns (a helper viaexecv, a subshell) inherits the worker’s pgid and is SIGKILLed at teardown. A child that must outlive the worker needssetpgid(child_pid, 0)after fork, or an explicit wait before the worker returns its report. Details in Work Types — Custom.
Drop behavior
Dropping a WorkloadHandle without calling stop_and_collect()
sends SIGKILL to all child processes (the same process-group sweep)
and waits for them, so error paths never leak orphaned workers.
Shared mmap regions (futex pages, iteration counters) are unmapped on
drop. The type is #[must_use] — an accidentally dropped handle
tears its workload down immediately.
Telemetry: WorkerReport
Each worker produces one WorkerReport. The fields you will actually
assert on:
| Field | Meaning | Populated by |
|---|---|---|
work_units | Cumulative work counter; feeds the starvation gate | Every framework work type |
iterations | Outer-loop count; feeds throughput rates | Every framework work type |
cpu_time_ns / wall_time_ns / off_cpu_ns | On-CPU vs total vs off-CPU time | Every framework work type |
migration_count, migrations, cpus_used | Cross-CPU movement | Checked every 1024 work units |
max_gap_ms (+ _cpu, _at_ms) | Longest wall-clock gap between checkpoints — the starvation/preemption tell | Every framework work type |
wake_latencies_ns + wake_sample_total | Per-wakeup latency samples | Blocking work types only (futex, pipe, I/O, yield, sleep) |
iteration_costs_ns + iteration_cost_sample_total | Per-iteration wall-clock cost | Pure-compute variants (AluHot, SmtSiblingSpin, IpcVariance) |
timer_latencies_ns + timer_sample_total | Timer-wake jitter vs absolute deadline | TimerLatency only |
schedstat_run_delay_ns / schedstat_run_count / schedstat_cpu_time_ns | /proc/self/schedstat deltas over the work loop | Every framework work type |
numa_pages, vmstat_numa_pages_migrated | Per-node residency and migration counters | Every framework work type; feed the NUMA checks |
completed, exit_info | Natural end vs sentinel (see above) | Framework |
affinity_error, sched_policy_error | Setup calls that failed; worker ran anyway | Framework |
Consult the WorkerReport rustdoc for the full field list and
per-field semantics — the table above summarizes, the rustdoc is
authoritative.
Semantics worth knowing before asserting:
- Sampling caps.
wake_latencies_nsis reservoir-sampled and capped at 100,000 entries;wake_sample_totalkeeps counting past the cap. Report “total wakeups” from the total; compute percentiles from the vector. (The cap is pinned by a unit test —max_wake_samples_pins_doc_value— so this paragraph cannot silently rot.) schedstat_run_countis pcount, not context switches. It increments each time the scheduler picks the task to run; a task that keeps running on one CPU does not advance it. For true context-switch counts read/proc/<pid>/status.- Checkpoint cadence. Migration and gap checks run when
work_unitsis a multiple of 1024, so a variant contributing N units per outer iteration checks every1024 / gcd(N, 1024)iterations. Per-variant unit contributions live in the worker source and its rustdoc; the key defaults are pinned by unit tests. Custompopulates nothing. The framework fills no telemetry forWorkType::Custom— migration tracking, gap detection, schedstat deltas, and iteration counts exist only if the user’srunfunction fills them.
What the reports become
Test output rolls WorkerReports up per cgroup. From a real failing
run:
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
iter sums iterations, gap is the worst max_gap_ms,
migrations sums migration_count, and cpus counts distinct
cpus_used entries. Reading this one: both cgroups made steady
progress with sub-25ms worst gaps — the workers were scheduled fine;
this failure came from a throughput floor, not starvation. A report
showing migrations=0 plus a growing gap on a multi-CPU cpuset
would tell the opposite story: the scheduler is not spreading.
How reports become verdicts — thresholds, defaults, and the merge rules — is Checking’s territory.