Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Workers and Workloads

Workers are the processes that generate load for scenarios. They run inside the VM, each placed in a cgroup, and each one reports detailed telemetry (WorkerReport) when the workload stops. WorkloadHandle is the RAII handle that owns their whole lifecycle: spawn → place → start → stop and collect → drop.

Spawning

let config = WorkloadConfig {
    num_workers: 4,
    work_type: WorkType::Mixed,
    ..Default::default()
};
let mut handle = WorkloadHandle::spawn(&config)?;

Set only the fields that matter for the test and let ..Default::default() fill in the rest — WorkloadConfig’s default is a known-good single-worker SpinWait baseline, and the spread form keeps examples pinned to intent as fields are added. Consult the WorkloadConfig rustdoc for the current field list. (Do not extrapolate this to every ktstr type: CgroupDef deliberately has no Default because a derived empty name would silently produce an invalid cgroup — use CgroupDef::named(...).)

The worker-creation primitive is selected by CloneMode:

  • CloneMode::Fork (default): forks N child processes; each child installs a SIGUSR1 handler, then blocks on a pipe waiting for the start signal. Each worker has its own tgid, so cgroup.procs placement is per-worker.
  • CloneMode::Thread: spawns N threads inside the harness; each blocks on a rendezvous channel until start(). Workers share the harness’s tgid, so cgroup placement must go through cgroup.threads (see Placement below).

For grouped work types (PipeIo, FutexPingPong, FutexFanOut, MutexContention, ThunderingHerd, and the rest of the communicating families), spawn() validates that num_workers is divisible by the variant’s group size and sets up the inter-worker plumbing the variant requires (pipes, shared futex pages). See Work Types for choosing a variant.

pcomm containers are not created by spawn() — it bails when a composed WorkSpec::pcomm is set, pointing at WorkloadHandle::spawn_pcomm_cgroup (or the CgroupDef::pcomm path), which spawns one thread-group-leader process hosting N worker threads so the group’s comm matches the pcomm name.

Two-phase start

Workers wait for a “start” signal after spawn:

  1. Parent spawns the worker (fork or thread), which blocks.
  2. Parent moves the worker to its target cgroup.
  3. Parent calls start(), releasing all workers at once.

This ensures workers run inside their target cgroup from the first instruction of their workload — there is no window where load runs in the wrong cgroup and pollutes the measurement.

Placement

// 1. Spawn workers (blocked, waiting for start signal)
let mut handle = WorkloadHandle::spawn(&config)?;

// 2. Move workers into their target cgroup. `cgroup.procs` is
//    tgid-scoped, so use `worker_pids_for_cgroup_procs()` — it
//    bails for Thread-mode workers (whose pids share the harness's
//    tgid) and points at `cgroup.threads` instead. Plain
//    `worker_pids()` returns the raw pid set without that check.
ctx.cgroups.move_tasks("cg_0", &handle.worker_pids_for_cgroup_procs()?)?;

// 3. Signal workers to start
handle.start();

// 4. Wait for the workload duration
std::thread::sleep(ctx.duration);

// 5. Stop workers and collect telemetry
let reports: Vec<WorkerReport> = handle.stop_and_collect();

Step 2’s Thread-mode bail exists because the kernel resolves any pid written to cgroup.procs to its thread-group leader — writing a Thread-mode worker’s pid there would migrate the entire harness into the test cgroup.

Which placement tool, when:

You want toUse
Pin one worker to CPUshandle.set_affinity(idx, cpus)
Pin a whole cgroup of workersCgroupGroup::add_cgroup (writes cpuset.cpus once, RAII-removes on drop)
A cgroup that outlives the current scopeCgroupManager directly

Start and observing progress

start() signals all workers to begin (a start-pipe byte for fork children, a channel send for threads). Idempotent — the second call is a no-op. Call it after cgroup placement.

snapshot_iterations() reads every worker’s current iteration count from a shared-memory region without stopping anything. Call it periodically during the run window to detect stalls or compute instantaneous rates; final totals come from stop_and_collect().

Stop and collect

stop_and_collect(self) signals workers to stop (SIGUSR1 flips a stop flag in fork children; a per-thread flag for thread workers), then collects each worker’s WorkerReport — read from a report pipe under a shared 5-second deadline for fork children, returned from the thread join for thread workers. It auto-starts workers if start() was never called, and consumes the handle — workers cannot be restarted.

A worker that fails to produce a report (died, timed out, wrote corrupt data) gets a zeroed sentinel report: completed: false, work_units: 0, and exit_info: Some(_) preserving how it ended (Exited(code) / Signaled(sig) / TimedOut / WaitFailed / Panicked). Live-worker reports always carry exit_info: None, so consumers can distinguish “ran to completion and did nothing” from “died before reporting” — and the starvation gate counts dead workers as starved instead of silently passing.

After collection, SIGKILL is delivered to each fork worker’s process group unconditionally to reap stragglers.

Warning

The teardown SIGKILL is a process-group sweep. Every worker calls setpgid(0, 0) after fork, so any child a Custom work function spawns (a helper via execv, a subshell) inherits the worker’s pgid and is SIGKILLed at teardown. A child that must outlive the worker needs setpgid(child_pid, 0) after fork, or an explicit wait before the worker returns its report. Details in Work Types — Custom.

Drop behavior

Dropping a WorkloadHandle without calling stop_and_collect() sends SIGKILL to all child processes (the same process-group sweep) and waits for them, so error paths never leak orphaned workers. Shared mmap regions (futex pages, iteration counters) are unmapped on drop. The type is #[must_use] — an accidentally dropped handle tears its workload down immediately.

Telemetry: WorkerReport

Each worker produces one WorkerReport. The fields you will actually assert on:

FieldMeaningPopulated by
work_unitsCumulative work counter; feeds the starvation gateEvery framework work type
iterationsOuter-loop count; feeds throughput ratesEvery framework work type
cpu_time_ns / wall_time_ns / off_cpu_nsOn-CPU vs total vs off-CPU timeEvery framework work type
migration_count, migrations, cpus_usedCross-CPU movementChecked every 1024 work units
max_gap_ms (+ _cpu, _at_ms)Longest wall-clock gap between checkpoints — the starvation/preemption tellEvery framework work type
wake_latencies_ns + wake_sample_totalPer-wakeup latency samplesBlocking work types only (futex, pipe, I/O, yield, sleep)
iteration_costs_ns + iteration_cost_sample_totalPer-iteration wall-clock costPure-compute variants (AluHot, SmtSiblingSpin, IpcVariance)
timer_latencies_ns + timer_sample_totalTimer-wake jitter vs absolute deadlineTimerLatency only
schedstat_run_delay_ns / schedstat_run_count / schedstat_cpu_time_ns/proc/self/schedstat deltas over the work loopEvery framework work type
numa_pages, vmstat_numa_pages_migratedPer-node residency and migration countersEvery framework work type; feed the NUMA checks
completed, exit_infoNatural end vs sentinel (see above)Framework
affinity_error, sched_policy_errorSetup calls that failed; worker ran anywayFramework

Consult the WorkerReport rustdoc for the full field list and per-field semantics — the table above summarizes, the rustdoc is authoritative.

Semantics worth knowing before asserting:

  • Sampling caps. wake_latencies_ns is reservoir-sampled and capped at 100,000 entries; wake_sample_total keeps counting past the cap. Report “total wakeups” from the total; compute percentiles from the vector. (The cap is pinned by a unit test — max_wake_samples_pins_doc_value — so this paragraph cannot silently rot.)
  • schedstat_run_count is pcount, not context switches. It increments each time the scheduler picks the task to run; a task that keeps running on one CPU does not advance it. For true context-switch counts read /proc/<pid>/status.
  • Checkpoint cadence. Migration and gap checks run when work_units is a multiple of 1024, so a variant contributing N units per outer iteration checks every 1024 / gcd(N, 1024) iterations. Per-variant unit contributions live in the worker source and its rustdoc; the key defaults are pinned by unit tests.
  • Custom populates nothing. The framework fills no telemetry for WorkType::Custom — migration tracking, gap detection, schedstat deltas, and iteration counts exist only if the user’s run function fills them.

What the reports become

Test output rolls WorkerReports up per cgroup. From a real failing run:

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252

iter sums iterations, gap is the worst max_gap_ms, migrations sums migration_count, and cpus counts distinct cpus_used entries. Reading this one: both cgroups made steady progress with sub-25ms worst gaps — the workers were scheduled fine; this failure came from a throughput floor, not starvation. A report showing migrations=0 plus a growing gap on a multi-CPU cpuset would tell the opposite story: the scheduler is not spreading.

How reports become verdicts — thresholds, defaults, and the merge rules — is Checking’s territory.