pub enum WorkType {
Show 45 variants
SpinWait,
YieldHeavy,
Mixed,
IoSyncWrite,
IoRandRead,
IoConvoy,
Bursty {
burst_duration: Duration,
sleep_duration: Duration,
},
PipeIo {
burst_iters: u64,
},
FutexPingPong {
spin_iters: u64,
},
CachePressure {
size_kib: usize,
stride: usize,
},
CacheYield {
size_kib: usize,
stride: usize,
},
CachePipe {
size_kib: usize,
burst_iters: u64,
},
FutexFanOut {
fan_out: usize,
spin_iters: u64,
},
Sequence {
first: WorkPhase,
rest: Vec<WorkPhase>,
},
ForkExit,
NiceSweep,
AffinityChurn {
spin_iters: u64,
},
CrossAffinityChurn {
spin_iters: u64,
},
PolicyChurn {
spin_iters: u64,
},
FanOutCompute {
fan_out: usize,
cache_footprint_kib: usize,
operations: usize,
sleep_usec: u64,
},
Schbench {
config: SchbenchConfig,
},
Taobench {
config: TaobenchConfig,
},
PageFaultChurn {
region_kib: usize,
touches_per_cycle: usize,
spin_iters: u64,
},
MutexContention {
contenders: usize,
hold_iters: u64,
work_iters: u64,
},
Custom {
name: String,
run: CustomFn,
cfg: CustomCfg,
},
ThunderingHerd {
waiters: usize,
batches: u64,
inter_batch_ms: u64,
},
PriorityInversion {
high_count: usize,
medium_count: usize,
low_count: usize,
hold_iters: u64,
work_iters: u64,
pi_mode: FutexLockMode,
},
ProducerConsumerImbalance {
producers: usize,
consumers: usize,
produce_rate_hz: u64,
consume_iters: u64,
queue_depth_target: u64,
},
RtStarvation {
rt_workers: usize,
cfs_workers: usize,
rt_priority: i32,
burst_iters: u64,
},
AsymmetricWaker {
waker_class: SchedClass,
wakee_class: SchedClass,
burst_iters: u64,
},
WakeChain {
depth: usize,
wake: WakeMechanism,
work_per_hop: Duration,
},
NumaWorkingSetSweep {
region_kib: usize,
sweep_period_ms: u64,
target_nodes: Vec<usize>,
},
CgroupChurn {
groups: usize,
cycle_ms: u64,
},
CgroupAttachStorm {
dest: String,
reap: ReapMode,
},
SignalStorm {
signals_per_iter: u64,
work_iters: u64,
},
PreemptStorm {
cfs_workers: usize,
rt_burst_iters: u64,
rt_sleep_us: u64,
},
EpollStorm {
producers: usize,
consumers: usize,
events_per_burst: u64,
},
NumaMigrationChurn {
period_ms: u64,
},
IdleChurn {
burst_duration: Duration,
sleep_duration: Duration,
precise_timing: bool,
},
TimerLatency {
interval_us: u64,
},
NetTraffic {
interval_us: u64,
frame_bytes: u16,
},
IrqWake {
interval_us: u64,
frame_bytes: u16,
},
AluHot {
width: AluWidth,
},
SmtSiblingSpin,
IpcVariance {
hot_iters: u64,
cold_iters: u64,
period_iters: u64,
},
}Expand description
What each worker process does during a scenario.
Different work types exercise different scheduler code paths: CPU-bound, yield-heavy, I/O, bursty, or inter-process communication.
Variants ending in Churn cycle their target setting WITHOUT
ordering (random per-iteration); variants ending in Sweep
rotate through an ordered list or range deterministically. See
the module-level “Churn vs Sweep” section for the convention’s
rationale and the runtime contract for each suffix.
§Migration: IoSync was replaced
IoSync was replaced by IoSyncWrite,
IoRandRead, and IoConvoy.
The old IoSync simulated IO via tmpfs+sleep — write 64 KB to
a temp file (page-cache memcpy on tmpfs) then sleep 100 µs to
imitate disk-fsync latency. The new variants do real
block-device IO on /dev/vda with O_SYNC/O_DIRECT
(sector-aligned 4 KiB pread/pwrite, optional fdatasync),
so the kernel paths under stress are the actual virtio-blk
submit/complete + BIO routing paths rather than a synthetic
page-cache + nanosleep loop. Tests that depended on the old
page-cache + sleep behavior should use a Sequence
with WorkPhase::Sleep (and an arbitrary CPU phase) to model
the simulated-IO-completion pause without doing real disk IO.
let wt = WorkType::from_name("SpinWait").unwrap();
assert!(matches!(wt, WorkType::SpinWait));
let bursty = WorkType::bursty(
std::time::Duration::from_millis(10),
std::time::Duration::from_millis(5),
);
assert!(matches!(bursty, WorkType::Bursty { .. }));
assert!(WorkType::from_name("nonexistent").is_none());IO variants share the IoBacking open path but differ in the
open flag + IO shape used to detect them:
IoSyncWrite:O_SYNC+ sequentialpwritebursts followed byfdatasync.IoRandRead:O_DIRECT+ randompreadto a logical-block-aligned scratch buffer.IoConvoy:O_DIRECT+ interleaved sequentialpwriteand randompread, with anfdatasyncevery 16 iterations (the pathology cadence).
let cfg = WorkloadConfig {
work_type: WorkType::IoConvoy,
..Default::default()
};
assert!(matches!(cfg.work_type, WorkType::IoConvoy));The VariantNames derive generates WorkType::VARIANTS: &[&str]
at compile time from the enum arm names, which this module
re-exposes as WorkType::ALL_NAMES so a new variant is picked
up automatically without editing a parallel list.
Variants§
SpinWait
Tight CPU spin loop (1024 iterations per cycle).
YieldHeavy
Repeated sched_yield with minimal CPU work.
Mixed
CPU spin burst followed by sched_yield.
IoSyncWrite
Synchronous write workload against a real block device. Each
iteration issues 16 × 4 KB pwrites totaling 64 KB at the
worker’s stripe offset (per-worker striping prevents fdatasync
from coalescing across writers), then fdatasync()s. Drives
fsync-heavy D-state cycles. Opens /dev/vda with O_SYNC once
per worker; if /dev/vda is absent (host-side unit tests), a
per-worker tempfile is opened with the same flags and used as
the backing.
IoRandRead
Random-read workload against a real block device. Each
iteration issues a single 4 KB pread at a sector-aligned random
offset within the device capacity. Opens /dev/vda with
O_DIRECT once per worker; if /dev/vda is absent, a
per-worker tempfile is opened with the same flags and used as
the backing. Drives high-IOPS short-D-state cycles. Offsets
come from a per-worker xorshift PRNG seeded from tid; no
crate dependency on rand.
IoConvoy
Interleaved sequential pwrite and random pread with
periodic fdatasync via O_DIRECT. Each iteration alternates
between a 4 KB pwrite at the worker’s monotonic sequential
cursor and a 4 KB pread at a random offset; fdatasync()
runs every 16 iterations. Opens /dev/vda (or tempfile
fallback) with O_DIRECT once per worker.
The convoy pathology (writes batching behind a flush barrier) requires buffered writes; this variant currently uses direct IO so the pathology surface is the synchronous flush + the IO-mix latency distribution rather than the page-cache convoy build-up itself.
Bursty
Work hard for burst_duration, sleep for sleep_duration,
repeat. Frees CPUs during sleep for borrowing. Both fields
use Duration (humantime-serialised) so call sites and
captured configs carry units explicitly, matching
WakeChain and
IdleChurn.
Fields
burst_duration: DurationWall-clock duration of CPU work between sleeps.
Default 50ms (see crate::workload::config::defaults::BURSTY_BURST_DURATION).
sleep_duration: DurationWall-clock duration of each sleep period; the worker
off-CPUs via thread::sleep. Default 100ms (see
crate::workload::config::defaults::BURSTY_SLEEP_DURATION).
PipeIo
CPU burst then 1-byte pipe exchange with a partner worker. Sleep duration depends on partner scheduling, exercising cross-CPU wake placement. Requires even num_workers; workers are paired (0,1), (2,3), etc.
FutexPingPong
Paired futex wait/wake between partner workers. Each iteration does
spin_iters of CPU work then wakes the partner and waits on the
shared futex word. Exercises the non-WF_SYNC wake path.
Requires even num_workers.
CachePressure
Strided read-modify-write over a buffer, sized to pressure the L1 cache. Each worker allocates its own buffer post-fork.
CacheYield
Cache pressure burst followed by sched_yield(). Tests scheduler re-placement after voluntary yield with a cache-hot working set.
CachePipe
Cache pressure burst then 1-byte pipe exchange with a partner worker. Combines cache-hot working set with cross-CPU wake placement. Requires even num_workers.
FutexFanOut
1:N fan-out wake pattern without cache pressure. One messenger per
group does CPU spin work then wakes N receivers via FUTEX_WAKE.
Receivers measure wake-to-run latency as the interval from
stamping before_block = Instant::now() just before the wait
loop to observing the futex generation advance. Unlike
FanOutCompute, there is no shared messenger
timestamp — the measurement is receiver-local and excludes the
messenger’s pre-wake delay. For cache-aware fan-out with matrix
multiply work, see FanOutCompute. Requires num_workers divisible
by (fan_out + 1).
Sequence
Compound work pattern: loop through phases in order, repeat. Each phase runs for its duration before the next starts.
ForkExit
Rapid fork+_exit cycling. Each iteration forks a child that immediately calls _exit(0). Parent waitpid’s then repeats. Exercises wake_up_new_task, exit_group/do_group_exit, wait_task_zombie.
NiceSweep
Cycle nice level from -20 to 19 across iterations. Each iteration: spin_burst → setpriority → yield. Exercises reweight_task and dynamic priority reweighting. Skips negative nice values when CAP_SYS_NICE is absent.
AffinityChurn
Rapid self-directed sched_setaffinity to random CPUs from the effective cpuset. Each iteration: spin_burst → pick random CPU → sched_setaffinity → yield. Exercises affine_move_task and migration_cpu_stop.
CrossAffinityChurn
Rapid CROSS-task affinity churn: each worker rewrites every
SIBLING worker’s CPU affinity at high rate, toggling between
two cpuset sub-masks that differ by one CPU so each
sched_setaffinity is a genuine mask change. Distinct from
AffinityChurn, which churns the
worker’s OWN affinity; this churns its SIBLINGS’.
Each iteration: spin_burst → toggle the target mask → for
each sibling pid sched_setaffinity(sibling, mask) → yield.
Siblings are discovered once at worker entry from the worker’s
own cgroup.procs (excluding self) — a flipper sees only the
peers present when it starts. Declare the target WorkSpec(s)
BEFORE the CrossAffinityChurn WorkSpec in the cohort:
apply_setup spawns, moves into the cgroup, and STARTS each
WorkSpec serially in declaration order (one WorkloadHandle
per WorkSpec), so a flipper sees only the peers whose WorkSpec
started before its own (plus its co-flippers — workers within
one WorkSpec all fork before that spec starts). Targets declared
AFTER the flippers, or added by a later separate spawn into the
same cgroup, are NOT seen. Because the sibling set is
the worker’s cgroup membership, this WorkType MUST run in a
dedicated cgroup (a crate::scenario::ops::CgroupDef) — in a
shared or host cgroup it would rewrite the affinity of
unrelated tasks. A worker with no siblings, or a cpuset smaller
than 2 CPUs, is a no-op.
Kernel path: sched_setaffinity(pid) → __set_cpus_allowed_ptr
→ set_cpus_allowed_common + affine_move_task (→
migration_cpu_stop when the running CPU is toggled out); the
scheduler’s select_cpu re-places the task on its next wake. A
sched_ext scheduler that implements the ops.set_cpumask hook
additionally drives that hook on every change — the cross-CPU
set_cpumask race surface. The scx-ktstr fixture does not
implement set_cpumask, so against it only the generic
migration path runs.
Masks come from the FLIPPER’s own cpuset; the kernel clamps each
sched_setaffinity(sibling, mask) to the sibling’s cpuset
(cpumask_and in __sched_setaffinity), so a sibling whose
cpuset is fully disjoint from the flipper’s gets -EINVAL and is
silently skipped, and one sharing only CPUs outside the one-CPU
toggle delta sees no per-iteration change. Reliable churn
therefore wants the flipper and its targets in a shared cpuset —
the dedicated-CgroupDef pattern above guarantees it.
PolicyChurn
Cycle through scheduling policies each iteration. Each iteration: spin_burst → sched_setscheduler to next policy → yield. Cycles SCHED_OTHER → SCHED_BATCH → SCHED_IDLE (and SCHED_FIFO/SCHED_RR when CAP_SYS_NICE is available). Exercises __sched_setscheduler and scheduling class transitions.
FanOutCompute
Messenger/worker fan-out with compute work. One messenger per group
wakes fan_out workers via shared futex. After recording the
wake-to-run latency, each worker sleeps for sleep_usec
microseconds (simulating think time), then does operations
matrix multiplications over a cache_footprint_kib-sized working
set. Wake-to-run latency is the interval from the messenger’s
timestamp to the worker observing the generation advance.
Requires num_workers divisible by (fan_out + 1).
Schbench
schbench’s default-mode benchmark, re-expressed natively (the
schbench_rs port). One worker process runs schbench’s
message-thread / worker-thread topology with native threads: message
threads batch-wake worker threads (measuring scheduler wakeup latency),
and each worker think-sleeps then does matrix work under a per-CPU lock
(measuring request latency). The carried SchbenchConfig sets the
thread counts, cache footprint, think-time, and locking; build it with
SchbenchConfig::default plus its chainable setters. Use a single
ktstr worker (workers(1)) – the message/worker parallelism is this
variant’s internal thread topology, not ktstr worker processes.
Fields
config: SchbenchConfigTaobench
A bounded, evicting key-value cache workload, re-expressed natively (the
taobench_rs port of the taobench object-cache benchmark). One worker
process runs a closed-loop client population over an in-process sharded
cache: a fast in-cache hit path and a slow backing-store-miss path (a
dispatcher-thread sleep), driven to a steady-state hit ratio by sizing the
key range against the cache capacity. The carried TaobenchConfig sets
the thread counts, cache capacity, target hit ratio, and slow-path
latency; build it with TaobenchConfig::default plus its chainable
setters. Use a single ktstr worker (workers(1)) – the client/fast/slow
parallelism is this variant’s internal thread topology, not ktstr worker
processes.
Fields
config: TaobenchConfigPageFaultChurn
Rapid page fault cycling. Workers mmap a region_kib KiB region with
MADV_NOHUGEPAGE (forcing 4 KiB pages), touch touches_per_cycle
random pages via write faults, then MADV_DONTNEED to zap PTEs and
repeat. Exercises do_anonymous_page, page allocator contention,
and TLB pressure on migration.
MutexContention
N-way futex mutex contention. contenders workers per group contend
on a shared AtomicU32 via CAS acquire / FUTEX_WAIT on failure.
Loop: spin_burst(work_iters) → CAS acquire → spin_burst(hold_iters)
→ store 0 + FUTEX_WAKE(1). Exercises convoy effect, lock-holder
preemption cascading stalls, and futex wait/wake contention paths.
Custom
User-supplied work function. The function receives a reference to
the stop flag and returns a WorkerReport when signaled.
Function pointers are fork-safe (Copy), so Custom works with
the fork-based worker model without serialization.
name identifies this work type in logs and sidecar metadata.
from_name returns None for custom names.
Telemetry contract: Custom runs the user closure to
completion and returns its WorkerReport verbatim. None of the
built-in per-iteration instrumentation runs for this variant —
neither the reservoir-sampled wake latencies, the shared-memory
iter_slot publish that host sampling reads, nor the periodic
max-gap tracking. The custom closure owns its own telemetry and
must populate the WorkerReport fields it wants measured
(iterations, wake_latencies_ns, max_gap_ns, etc.); any
field left at WorkerReport::default() is reported as zero by
downstream evaluation. Assertions like
assert_not_starved that
compute wake-latency percentiles will produce zero/degenerate
numbers against a Custom report that did not record them.
work_units vs iterations — which assertion reads which: the
two WorkerReport counters are NOT interchangeable. Headline
throughput — CgroupStats::total_iterations and the derived rates
iterations_per_worker / iterations_per_cpu_sec and
migration_ratio — sums WorkerReport::iterations, NOT work_units.
The default fairness/starvation gate (assert_not_starved and the
min_work_units floor) and assert_throughput_parity read
WorkerReport::work_units, NOT iterations. Populate BOTH (set them
equal when the closure has a single loop counter, as the custom_spin_fn
fixture does), or set each to the quantity the assertions you
target will read. A report with work_units > 0, iterations == 0
passes the starvation gate but reports zero throughput, so
claim_total_iterations(..).at_least(N) silently fails; the inverse
(iterations > 0, work_units == 0) reports throughput but the
starvation gate flags every worker.
Process-group lifecycle (per CloneMode):
Fork mode — every worker calls setpgid(0, 0) immediately
after fork, giving the worker its own process group
(pgid == worker_pid). Any child processes the custom
closure forks (a helper binary via execv, a subshell via
sh -c, etc.) inherit that pgid unless they explicitly
change it. On teardown, stop_and_collect issues
killpg(worker_pid, SIGKILL) unconditionally (on both the
graceful-exit and StillAlive-escalation paths) and
WorkloadHandle::drop issues another killpg on handle
teardown, so every descendant a Custom closure spawns
will be SIGKILLed at worker teardown — there is no opt-out.
Closures that need children to outlive the worker must
either detach them from the worker’s pgid
(setpgid(child_pid, 0) after fork) or wait on them
explicitly before returning the WorkerReport. The
grandchild reaping tests in this module pin this sweep
end-to-end.
Thread mode — setpgid(0, 0) does NOT run; thread workers
share the test runner’s pgid and cannot have one of their
own (pgid is per-process / per-tgid). killpg-based cleanup
is therefore unavailable: if a Thread-mode Custom closure
forks helpers (e.g. via Command::spawn), those helpers
inherit the test runner’s pgid and will not be reaped on
worker teardown. You own teardown for any helpers a
Thread-mode Custom closure spawns — wait on them before
returning, or arrange explicit kill/wait before returning
the WorkerReport.
Thread-mode prohibition on process-scoping syscalls:
under Thread mode, the closure runs as a thread inside the
parent (test-runner) process, sharing pid/tgid, the signal-
disposition table, the file descriptor table, cwd, and
every other process-scoped attribute with every sibling
worker AND with the test harness. Do NOT call
_exit()/exit(), setpgid()/setsid(), execve(),
chdir()/chroot(), setresuid()/setresgid(),
prctl(PR_SET_*) or any other process-scoping syscall —
these affect the entire process, including all sibling
workers and the test harness itself, and will produce silent
cross-worker corruption, unexpected test-harness exits, or
both. fork()/vfork()/clone() are equally unsupported but
for a distinct reason: a fork from a thread of this
multi-threaded process duplicates only the calling thread, so
any lock another thread holds at fork time (glibc malloc
arena, internal mutexes) stays locked forever in the child.
The supported
shutdown contract is: observe the &AtomicBool argument’s
stop.load() flag and return the WorkerReport when it
flips. This is a runtime contract, not a static check —
Custom closures are arbitrary user code and the framework
cannot detect violations at spawn time. If your workload
genuinely needs _exit/fork/etc., use CloneMode::Fork
where each worker IS its own process. The
WorkType::ForkExit + CloneMode::Thread combination
is rejected at spawn time precisely because of this — see
WorkloadHandle::spawn.
Serde: the Custom variant is #[serde(skip)] because
the run field is a fn pointer that has no portable wire
format. Serializing a WorkloadConfig with WorkType::Custom
emits an error; persisted configs (e.g. captured via
cargo ktstr export) must use a built-in variant. Test
authors who want a custom worker should keep WorkType::Custom
inline in the test body and not roundtrip the config.
§Construction
Prefer the WorkType::custom constructor — it takes a
bare fn pointer and transparently wraps it in CustomFn,
defaulting cfg to CustomCfg::default:
WorkType::custom("my_workload", my_fn). To pass a fork-safe
config payload, use WorkType::custom_with with a CustomCfg
(Copy POD; for variable-length / shared state pass a MAP_SHARED
region address through a u64 slot). Struct-literal construction
requires the wrap explicitly: WorkType::Custom { name: "my_workload".into(), run: CustomFn(my_fn), cfg: CustomCfg::default() }. The constructor path is the supported
user-facing API; the struct-literal form exists for test-internal
construction where the call site already deals with the newtype.
ThunderingHerd
One waker, N waiters on a SINGLE global futex word, repeated
in batches with a sleep gap. Distinct from
FutexFanOut which uses one futex per
fan-out group: ThunderingHerd parks every worker on the same
queue, so a single FUTEX_WAKE rouses the entire herd
simultaneously. Exercises the broadcast-wake path through
try_to_wake_up and the scheduler’s ability to spread the
woken cohort across CPUs without convoying.
The first worker (index 0) is the waker; the remaining
num_workers - 1 are waiters. Pick waiters >= 5 so the
herd (5) + waker (1) = 6 tasks saturates a 4-core host,
making convoy effects observable; scale up further on
larger hosts so the runnable cohort exceeds the cgroup’s
CPU budget. worker_group_size = num_workers so every
worker shares the same shared-memory region; reuses the
existing futex MAP_SHARED allocator.
Fields
waiters: usizeNumber of waiter workers (the herd). Must satisfy
num_workers == waiters + 1 (1 waker + waiters).
PriorityInversion
Three priority tiers contending for one shared lock. low
workers acquire the lock and hold it while doing CPU work;
medium workers do non-blocking CPU work (no lock) at a
higher priority so they can preempt low; high workers
try to acquire the lock at top priority. When medium keeps
preempting low, high waits on the lock indefinitely —
classic priority inversion.
pi_mode = FutexLockMode::Pi uses FUTEX_LOCK_PI (PI-aware
mutex); kernel boosts low to high’s priority for the
duration of the hold, which both unblocks high and pins
medium from preempting. FutexLockMode::Plain uses a plain
futex with no boost — the inversion goes uncorrected.
Tests both halves of the rt_mutex PI chain under the same
workload shape.
Requires same-CPU pinning (e.g. AffinityIntent::SingleCpu)
for medium to actually preempt low. Without pinning, the
scheduler distributes the priorities across CPUs and the
inversion never materialises.
worker_group_size = high_count + medium_count + low_count
so all three tiers share one futex region.
Fields
medium_count: usizeNumber of medium-priority workers. Run at a priority
above low_count so they preempt the lock holder.
low_count: usizeNumber of low-priority workers. Each holds the shared
lock during its hold_iters CPU burst.
work_iters: u64CPU-spin iterations every worker burns between
lock-acquire attempts (high/low) or between
non-blocking work cycles (medium).
pi_mode: FutexLockModeWhether the workload uses a PI-aware futex (Pi,
invokes FUTEX_LOCK_PI and the rt_mutex PI boost
chain in kernel/futex/pi.c) or a plain non-PI futex
(Plain, uncorrected inversion). See FutexLockMode.
ProducerConsumerImbalance
Producer / consumer pipeline with deliberately-unbalanced
rates. producers workers push items at produce_rate_hz;
consumers workers pop items and burn consume_iters of
CPU work per pop. When producers * produce_rate_hz
exceeds consumers * (1 / consume_time), the queue grows
monotonically toward queue_depth_target, exercising
scheduler unfairness under sustained backpressure.
The shared queue is an SPSC/MPSC ring buffer in MAP_SHARED
memory sized to queue_depth_target * 8 bytes (u64 slots).
Worker indices [0, producers) are producers; indices
[producers, producers + consumers) are consumers.
worker_group_size = producers + consumers.
Fields
produce_rate_hz: u64Target rate per producer (items per second). Producers
pace themselves with nanosleep between pushes.
RtStarvation
rt_workers workers run as SCHED_FIFO at rt_priority
burning 100% CPU with burst_iters CPU work per iteration
(no yields). cfs_workers workers run as SCHED_NORMAL and
try to do work in the same scheduling domain. Without DL
server protection (sched_ext does not have one — see the
scx_ext docs), the SCHED_NORMAL workers starve.
Reproducer setup: pin both groups to the same CPU set
(e.g. via AffinityIntent::SingleCpu), and on the host set
sysctl_sched_rt_runtime_us=-1 for unlimited RT bandwidth
(otherwise the kernel rt_period throttle unstuck things
after 0.95s).
Worker indices [0, rt_workers) get SCHED_FIFO applied
post-fork via sched_setscheduler; the remainder stay on
SCHED_NORMAL. worker_group_size = rt_workers + cfs_workers.
Fields
cfs_workers: usizeNumber of SCHED_NORMAL (CFS) workers competing on the same CPU set. Expected to starve.
AsymmetricWaker
Paired workers with mismatched scheduling classes share a
single futex word for hand-off. The waker (worker index 0)
runs as waker_class;
the wakee (worker index 1) runs as
wakee_class. After
burst_iters of CPU work the waker advances the futex word
and FUTEX_WAKEs the wakee; the wakee blocks in
FUTEX_WAIT between turns. Tests wake-affine placement
when waker and wakee live in different scheduling classes
(e.g. an RT waker waking an EXT wakee — does the scheduler
place the wakee on the waker’s CPU, the wakee’s last CPU,
or somewhere else?).
worker_group_size = 2. Wake latency is recorded into the
wakee’s wake_latencies_ns reservoir using the same
before_block → cur != expected measurement as
FutexPingPong.
Fields
waker_class: SchedClassScheduling class for the waker (worker index 0).
wakee_class: SchedClassScheduling class for the wakee (worker index 1).
WakeChain
Pipeline of waker-wakee hops forming a ring of depth stages.
Two wake mechanisms gated by the wake field — see
WakeMechanism for kernel citations:
-
WakeMechanism::Pipe— anon-pipe ring (depthpipes per chain). Wakes carryWF_SYNCviawake_up_interruptible_sync_poll, biasing scheduler placement against migration. Tests theSCX_WAKE_SYNCpath that scx variants must respect. -
WakeMechanism::Futex— single shared futex word per chain. The active stage advances the word andFUTEX_WAKEs; the stage whoseposmatches runs, others re-park. NoWF_SYNC.
Worker indices are partitioned into num_workers / depth
chains of depth workers each. worker_group_size = depth
so the spawn-side allocates one independent futex region
per chain. At the end of the chain the last worker loops
back to the first, forming a ring so the work pattern can
run for a long test window.
To run multiple parallel chains, set num_workers to a
multiple of depth greater than depth itself — the
spawn-side derives the chain count from the ratio.
When wake == WakeMechanism::Pipe, the spawn-side
additionally allocates depth pipes per chain — see
chain_pipe_depth and the
chain_pipes field on SpawnGuard (early-bail path) and
WorkloadHandle (success path).
Both CloneMode::Fork and CloneMode::Thread are
supported for WakeMechanism::Pipe. On a successful spawn
the chain-pipe fds transfer from the guard into
WorkloadHandle, and WorkloadHandle::drop closes them
only after every worker is reaped (Fork) or joined (Thread).
Under Thread mode each worker thread shares the parent’s fd
table, so the post-shutdown close is what guarantees workers
finish their read / write ops before the fds become
invalid.
Fields
depth: usizeNumber of workers per chain. Each worker waits for its
predecessor’s signal, does work_per_hop of CPU work,
signals the next worker, and repeats.
wake: WakeMechanismSelects the wake mechanism between stages — see
WakeMechanism.
WakeMechanism::Pipe allocates one anonymous pipe
per stage (a chain ring of depth pipes) and uses
write(1 byte) / read(1 byte) (poll-stop-pollable)
for stage handoffs. The kernel raises WF_SYNC on the
wake because anon_pipe_write (fs/pipe.c) calls
wake_up_interruptible_sync_poll (include/linux/wait.h)
which expands to __wake_up_sync_key (kernel/sched/wait.c)
and that passes WF_SYNC through __wake_up_common_lock
to try_to_wake_up. WF_SYNC biases scheduler placement
away from migrating the woken stage off the waker’s CPU
— testing the wake-affine cohabitation that scx variants
must respect.
WakeMechanism::Futex uses the existing futex-word
ring: FUTEX_WAKE fans out to every parked worker on
the same word, the active stage proceeds, the rest
re-park. No WF_SYNC; the scheduler is free to
migrate the woken stage.
The Pipe path needs depth pipes per chain — see
chain_pipe_depth — and
closes the inverse ends of every other stage’s pipe in
the worker post-fork. The kernel-side WF_SYNC raise
is verified by reading the call chain:
anon_pipe_write at fs/pipe.c:431-601,
wake_up_interruptible_sync_poll at
include/linux/wait.h:246-247, and
__wake_up_sync_key at kernel/sched/wait.c:186-193.
NumaWorkingSetSweep
Workers allocate a region_kib KiB region with set_mempolicy
pinned to one node, touch every page in that region, then
mbind(MPOL_BIND) the region to the next node in
target_nodes and re-touch — moving the working set across
NUMA nodes every sweep_period_ms. Exercises page migration
(migrate_pages / move_pages), the kernel’s NUMA-balancing
path (task_numa_work), and scheduler placement decisions
under sustained working-set churn.
Each worker rotates independently through the same
target_nodes list with a per-worker phase offset so the
cohort doesn’t bind every region to the same node at the
same instant. worker_group_size = None (any worker count
is valid; each worker mbinds its own region without shared
state).
Fields
region_kib: usizeSize of the working-set region per worker (KB). Each worker allocates this much anonymous memory and re-binds it across NUMA nodes.
sweep_period_ms: u64Wall-clock interval between binds. After every
sweep_period_ms, the worker rotates to the next node
in target_nodes and mbinds the region.
target_nodes: Vec<usize>Ordered list of NUMA node IDs the working set rotates through. Empty list disables binding (the worker still touches the region every iteration; no migration is triggered). Single-node lists pin the region to one node permanently — useful as an A/B baseline against a rotating sweep.
CgroupChurn
Workers cycle their cgroup membership between sibling cgroups
every cycle_ms, rewriting cgroup.procs to drive
sched_move_task (kernel/sched/core.c) and the registered
scx_cgroup_move_task ops callback. Distinct from
AffinityChurn: that variant rotates
task_struct->cpus_ptr (cpuset membership) and never moves
the task between cgroup containers; CgroupChurn rotates
the cgroup itself, which takes the cgroup_threadgroup_rwsem
write lock and exercises the per-class sched_move_task /
task_change_group callbacks. Zero coverage today.
The worker auto-creates the rotation cgroups
wt-cgroup-churn-<i> for i in 0..groups under the workload
cgroup root (default /sys/fs/cgroup/ktstr, or the per-test
#[ktstr_test(workload_root_cgroup = "/path")] root) at entry,
as empty leaf cgroups (no subtree_control) so they accept
cgroup.procs migration. Each iteration the worker writes its
tid to the next sibling in rotation. worker_group_size = None
(any worker count
valid; each worker rotates independently). Per-iteration
budget is one write syscall to cgroup.procs.
Fields
CgroupAttachStorm
Each iteration the worker forks a transient child and migrates
it — the whole process — into a sibling cgroup by writing the
child’s pid to <dest>/cgroup.procs, while the child immediately
_exits. The cgroup.procs write drives the kernel’s
threadgroup-wide attach path: cgroup_procs_write →
cgroup_attach_task(dst, leader, threadgroup=true) walks
while_each_thread(leader, task) and migrates every member of the
tgid, then fires TRACE_CGROUP_PATH(attach_task, …)
(kernel/cgroup/cgroup.c). A whole-process cgroup.procs write of
an exiting child is the leader-acquire race a tp_btf /
cgroup_attach_task BPF handler must survive — migrating a task
that is concurrently tearing down.
Distinct from both sibling primitives, and not expressible by either:
ForkExitforks andwaitpids its child but never writescgroup.procs— no migration, no attach path.CgroupChurnwrites its own tid to rotate cgroups but never forks — no transient-child leader race.
reap (ReapMode) selects the disposition that decides whether
the migration races the child’s teardown:
SigIgn (default) installs SIGCHLD = SIG_IGN
once so children auto-reap concurrent with the write — the race;
Waitpid blocking-reaps each child after the
write — a non-racing A/B control.
dest is the name of a cgroup that must already exist under the
worker cgroup’s parent, typically created via
Op::add_cgroup. The target
resolves to <worker-cgroup>.parent()/<dest>/cgroup.procs from
the worker’s resolved cgroup-v2 dir (the same resolution
WorkerCtx::open_sibling_cgroup_procs uses), so the worker must
run in a dedicated cgroup — not the root. A single-component dest
names a sibling of the worker cgroup; a multi-component dest
(e.g. a/b) addresses a nested descendant of that parent. If
dest cannot be resolved or its cgroup.procs is not writable the
worker logs a warning once and the storm no-ops (a vacuous
“scheduler survived” is surfaced loudly, never silently);
work_units stays zero so a caller can detect the no-op.
Exclusive to CloneMode::Fork:
the worker installs SIGCHLD = SIG_IGN to auto-reap its forked
children (under ReapMode::SigIgn), and a thread-group worker
shares the harness sighand, so that install would corrupt the
harness’s own child reaping; fork from a thread of the harness is
also fragile. CloneMode::Thread
is therefore rejected at spawn. worker_group_size = None (any
worker count valid; each worker storms independently).
Fields
dest: StringName of the cgroup whose cgroup.procs each forked child is
migrated into, resolved relative to the worker cgroup’s parent
(a single-component name is a sibling; a multi-component name a
nested descendant). Must already exist (e.g. via
Op::add_cgroup).
SignalStorm
Paired workers signal each other with kill(partner, SIGUSR1). Each worker installs a SIGUSR1 handler via
sigaction, then alternates: do work_iters of CPU work,
fire signals_per_iter signals at the partner, repeat.
Exercises signal_wake_up_state (kernel/signal.c) and the
per-task sighand->siglock, which is distinct from the
futex pi_lock path. The wake itself goes through
kick_process / smp_send_reschedule, not
ttwu_queue_wakelist.
Workers are paired (0,1), (2,3), … so worker_group_size = 2
and num_workers must be even. Partner tids are exchanged
via the existing pair shared-memory region. The signal
handler is a no-op SA_RESTART handler; its only purpose is
to trip TIF_SIGPENDING on the partner and force the
scheduler through the signal-delivery wake path.
Fields
PreemptStorm
Mixed RT + CFS preemption pressure. One worker per group runs
as SCHED_FIFO doing rt_burst_iters of CPU work followed
by clock_nanosleep(rt_sleep_us); the remaining
cfs_workers workers run as SCHED_NORMAL and spin
continuously. Each RT wake (post-nanosleep) hits
wakeup_preempt (kernel/sched/core.c) → resched_curr,
preempting the CFS worker on the same CPU. Drives sustained
nonvoluntary_ctxt_switches on the CFS workers.
Distinct from RtStarvation which
monopolises the CPU at 100% RT (and relies on
sysctl_sched_rt_runtime_us=-1) and from
PriorityInversion which uses a
PI-aware lock chain. PreemptStorm is the
“RT-flickers-and-preempts” pathology: short bursts at high
frequency, no monopolisation.
worker_group_size = cfs_workers + 1. Worker index 0 in
each group is the RT worker; indices 1..=cfs_workers are
CFS spinners. RT priority defaults to 1 (lowest above
SCHED_NORMAL); raise the priority via the host
RLIMIT_RTPRIO and CAP_SYS_NICE are present.
Fields
EpollStorm
Producers / consumers connected by a single eventfd +
epoll_wait pair. Producers write(eventfd, &1u64) in a
burst loop; consumers wait in epoll_wait(maxevents=1),
read the counter, and burn one CPU-burst before re-arming
the wait. Exercises __wake_up_common (kernel/sched/wait.c)
with exclusive autoremove — ONE wake per event, distinct
from ThunderingHerd’s broadcast
futex wake. Hits scx_select_cpu_dfl WITHOUT the
SCX_WAKE_SYNC fast-path because epoll_wait is not a
sync wakeup primitive.
worker_group_size = producers + consumers; needs shared
memory for the eventfd / epoll fd handoff between sibling
workers. Producers’ events_per_burst controls how many
writes they issue back-to-back before one nanosleep gap
(paces production rate without per-event sleep overhead).
Fields
producers: usizeNumber of producer workers per group. Each writes
events_per_burst events per cycle.
NumaMigrationChurn
Workers rotate sched_setaffinity across NUMA nodes every
period_ms. Reads online NUMA nodes from
/sys/devices/system/node/online at startup, then cycles
the worker through one node’s CPUs per period. Exercises
task migration via select_task_rq (kernel/sched/core.c)
with the WF_MIGRATED flag and, on sched_ext, the
SCX_OPS_BUILTIN_IDLE_PER_NODE branch of scx_select_cpu_dfl.
Distinct from NumaWorkingSetSweep
which moves the working-set MEMORY across nodes via mbind;
NumaMigrationChurn moves the TASK across nodes via
sched_setaffinity. worker_group_size = None. On hosts
with one NUMA node, the variant degenerates to a no-op
(every iteration re-pins to the same node).
IdleChurn
CPU burst for burst_duration followed by nanosleep for
sleep_duration, repeated. Exercises task off-CPU/on-CPU
transitions: nanosleep dequeues the worker into
TASK_INTERRUPTIBLE; on the pinned CPU, when no other tasks
are runnable, __pick_next_task selects the idle class
(pick_task_idle at kernel/sched/idle.c:501-505); on
nanosleep expiry the hrtimer callback hrtimer_wakeup
calls wake_up_process → try_to_wake_up.
§When to use IdleChurn
Reach for IdleChurn when the test needs the kernel’s hrtimer + idle-class scheduling path — exercising the nanosleep → schedule → idle → hrtimer-wakeup loop that the idle thread itself observes. Concrete pickers:
- You need to measure scheduler wake placement after a
TASK_INTERRUPTIBLEdequeue — IdleChurn blocks viananosleepdirectly, the same hrtimer path the idle thread enters when no work is runnable. - You need to drive the tick-stop / C-state boundary on
the pinned CPU — sleeps > 1ms exercise the full idle
path including the tickless branch (
tick_nohz_idle_enter). - You’re A/B-testing scheduler behavior on the idle-class transition specifically (e.g. scx_lavd’s idle-CPU selection vs scx_simple’s), and need a reproducible workload that passes through the kernel idle path.
Choose Bursty instead when:
- The test measures THROUGHPUT under burst-then-sleep
patterns at the millisecond regime — Bursty uses
thread::sleep(which is itself nanosleep-backed but coarser-grained in libc) and matches the existing pthread/std-lib timing model most application benchmarks assume. - The test needs >1 ms sleeps without caring about the idle-class transition specifically — Bursty is the simpler variant and has fewer caveats below.
IdleChurn is distinct from variants that block on
futex/pipe (FutexPingPong, PipeIo, WakeChain) — those
route the wake through futex_wake /
wake_up_interruptible_sync_poll, exercising
inter-task-coordination paths. IdleChurn’s blocking
primitive is the hrtimer expiry, not a peer’s wake call.
§Caveat impacts at a glance
NB: the five bullets below mirror the detailed sections that follow — keep both in sync when editing.
The five sections below detail the kernel-side mechanisms. For test authors picking thresholds, the practical per-iteration impact is:
- Timer slack — observed sleep is
sleep_duration + current->timer_slack_ns. Default slack is 50µs, so asleep_durationof 80µs produces ~130µs actual sleep. Forsleep_duration≥ 1ms the slack is < 5% noise; for sub-100µs sleeps the slack floor dominates. - Task off-CPU vs CPU idle — the worker off-CPUs
every iteration regardless of placement, but the CPU
only enters the idle class under exclusive pinning.
Without
AffinityIntent::SingleCputhe CPU runs another runnable task during the sleep window — the variant tests TASK transitions, not CPU-idle. - Degenerate-input rejection — spawn-side rejects
Duration::ZEROfor either field with an actionable bail message.burst_duration=0collapses the loop to pure nanosleep (worker accrues no runtime);sleep_duration=0overlaps with two existing variants —SpinWaitis the bail message’s forwarding target (no idle path exercised, pure spin loop), but the kernel-level semantic is closer toYieldHeavysincenanosleep(0)still callsset_current_state(TASK_INTERRUPTIBLE)+schedule()(sched_yield-equivalent). - NO_HZ_FULL — workers pinned to a CPU in the
nohz_full=mask see LOWER medianwake_latencies_ns(tick re-arm is skipped) but heavier high-percentile tail (deferred jiffy-driven work catchup). Mixing pinned-vs-unpinned workers across the mask boundary produces a bimodal distribution. - vCPU-in-KVM — wake latency aggregates guest +
host scheduler costs.
performance_mode=truedisables HLT vmexits so the test measures guest scheduling in isolation;performance_mode=falseexercises the cross-VM idle path but adds host-scheduler jitter bounded by one host scheduler tick.
§Task off-CPU is guaranteed; CPU idle is conditional
IdleChurn exercises the TASK off-CPU/back-on-CPU transition on every iteration — NOT necessarily the CPU idle/exit transition. The two are distinct paths in the scheduler and a test must pick the one the design requires:
do_nanosleepatkernel/time/hrtimer.c:2284-2317callsset_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE)thenschedule(). The current task IS dequeued and goes off-CPU on every iteration regardless of what else is runnable.nr_voluntary_ctxt_switchesticks per iteration unconditionally.- Whether the CPU enters the idle class
(
__pick_next_taskselectingpick_task_idle) depends on what else is on the runqueue. If any other task is runnable on the pinned CPU,schedule()picks it and the CPU never idles for that iteration.
Three concrete scenarios where the CPU does NOT enter the idle class even though IdleChurn fired:
- Multi-worker on a single CPU — IdleChurn with
num_workers=2and overlapping affinity runs A and B on the same CPU. When A nanosleeps, B is runnable; CPU runs B, never idles. The variant tests “worker churn” rather than “CPU idle/exit transitions”. - Co-scheduled kernel threads — kworker, ksoftirqd,
rcu_* kthreads (kthread_run on the same CPU) and
deferred-work softirqs run on every CPU. ksoftirqd is
woken from
wakeup_softirqd(kernel/softirq.c) whenirq_exitobserves pending softirqs after inline processing — its wake frequency tracks irq load, not a fixed cadence. Sleep durations short enough to overlap with steady-state softirq backlog (e.g. NIC interrupt pressure) may observe ksoftirqd preempting the IdleChurn worker between iterations — diluting the idle-transition signal. - Sibling test workloads in the same LLC — a peer test pinned to a different CPU within the same LLC can spawn kernel threads that get migrated onto IdleChurn’s CPU by the kernel’s load balancer. The migration is invisible to the IdleChurn worker but breaks the “CPU is exclusive” assumption.
For TASK-off-CPU testing (the default and the variant’s guaranteed semantic): no special pinning required — every iteration off-CPUs the worker.
For CPU-idle-class testing: ensure the worker has exclusive CPU affinity AND no co-scheduled kernel threads. Concrete recipe:
- Use
AffinityIntent::SingleCpuor a one-CPUExactmask so only this worker is pinned to the CPU. - Run under
performance_mode=trueso the CPU lock budget reserves the CPU for this test. - Set
num_workers=1(multiple IdleChurn workers on the same CPU break the assumption — see scenario 1 above). - Be aware that kernel-side periodic work (RCU callbacks, vmstat updates, watchdog ticks) still runs on every CPU regardless of affinity — sub-millisecond sleeps will sometimes observe a non-idle iteration even with exclusive pinning.
This is a runtime contract, not a static one. The spawn-side does not check the affinity policy because “exclusive” depends on the rest of the host’s load, which the framework cannot observe at spawn time.
§Timer slack expands the requested sleep
The kernel adds current->timer_slack_ns to the requested
sleep_duration inside hrtimer_nanosleep at
kernel/time/hrtimer.c:2331-2356, specifically the
hrtimer_set_expires_range_ns(&t.timer, rqtp, current->timer_slack_ns) call at L2338.
timer_slack_ns is inherited from the parent at fork; the
kernel default propagated from init_task is 50000ns
(50µs, set at init/init_task.c:173). So:
sleep_durationis a lower bound on the observed idle interval — actual sleep extends by up tocurrent->timer_slack_nsto let the kernel coalesce timer wakeups.- Sub-50µs
sleep_durationvalues do not produce sub-50µs idle periods — the slack floor dominates. - RT workers bypass slack. Under
SchedPolicy::FifoorSchedPolicy::RoundRobinthe kernel forcestimer_slack_nsto 0 (kernel/sched/syscalls.c:258), so RT IdleChurn workers get exact wake timing. CFS / SCHED_NORMAL workers inherit the 50µs default. - IdleChurn calls
prctl(PR_SET_TIMERSLACK, 1)ONLY when the variant’sprecise_timingfield istrue. The default isfalse, preserving the inherited 50µs slack for CFS workers. Setprecise_timing: true(or use the struct-literal form directly — theidle_churnconstructor leaves the field at its default) to shrink slack to 1ns for sub-50µssleep_durationmeasurements. See the field’s doc for the kernel-source citation that explains why1(not0) is the value that narrows slack.
§Tick-stop boundary
Sleeps > 1ms exercise the full idle path including tick
stop and (on configured platforms) C-state entry —
tick_nohz_idle_enter, cpuidle_idle_call, governor
selection. Sub-millisecond sleeps still produce
sched_switch transitions but skip the tick-stop branch
because the tick is reprogrammed for the imminent
expiry rather than stopped entirely.
§NO_HZ_FULL alters wake observation
Three NO_HZ kernel configurations affect wake latency differently:
CONFIG_HZ_PERIODIC— the periodic timer tick fires every1/CONFIG_HZseconds regardless of CPU state. Wake-from-idle latency is bounded above by the tick period; the kernel may choose to delay wakes to the next tick. Most predictable wake population, useful for strict-bound assertions.CONFIG_NO_HZ_IDLE— tick stops when a CPU goes idle but resumes immediately on any wake event. Wake latency reflects theTASK_INTERRUPTIBLE → TASK_RUNNINGtransition cost plus tick re-arming. This is the default on modern x86_64 / arm64 distro kernels and the posture ktstr’s bundledktstr.kconfiginherits (the fragment does not override NO_HZ_*).CONFIG_NO_HZ_FULL— for CPUs in thenohz_full=boot parameter mask, the tick stays stopped even when one task is runnable. Wake delivery routes through hrtimer expiry alone; the kernel skips tick re-arm on wake when no tick-dependent subsystem demands it (tick_nohz_idle_enteratkernel/time/tick-sched.c), so steady-statewake_latencies_nsreads LOWER on nohz_full CPUs than onNO_HZ_IDLECPUs. The catch: deferred jiffy-driven work (RCU callbacks, vmstat updates, watchdog ticks) accumulates while the tick is stopped and produces visible long-tail jitter when it eventually runs — manifesting as occasional high-percentile spikes in the wake-latency distribution even though the median drops.
IdleChurn behavior is consistent under
CONFIG_NO_HZ_IDLE (the default). On hosts with
CONFIG_NO_HZ_FULL, samples from workers whose CPU is in
the nohz_full mask are NOT directly comparable to samples
from CPUs outside that mask — the populations differ in
both the median (lower on nohz_full) and the tail (heavier
on nohz_full from deferred-work catchup). Tests asserting
precise idle-duration scheduler decisions (e.g. “tasks
idle <1ms get latency-sensitive treatment”) must either:
- require NO_HZ_FULL on (and pin the worker into the mask),
- require NO_HZ_FULL off (CONFIG_HZ_PERIODIC or CONFIG_NO_HZ_IDLE), or
- tolerate both populations with looser thresholds.
The active mask is readable at runtime via
/sys/devices/system/cpu/nohz_full. The file only
exists when the kernel was built with
CONFIG_NO_HZ_FULL=y; on a CONFIG_NO_HZ_IDLE-only
kernel (the typical distro default) the file is absent
and the test author can assume no nohz_full effects.
IdleChurn does not adjust the mask itself, and mixing
pinned-vs-unpinned workers in the same scenario produces
a bimodal latency distribution if the host is configured
for nohz_full.
§vCPU-in-KVM amplifies wake latency
ktstr tests run inside KVM guests. IdleChurn’s
nanosleep inside a guest vCPU has a layered cost:
- Guest task calls
nanosleep→ guest kernel arms a guest-side hrtimer. - Guest task off-CPUs (
TASK_INTERRUPTIBLE→schedule()). - Guest CPU idles → guest kernel issues
HLT(orMWAITon x86,WFIon arm64). - The HLT either vmexits to host KVM or spins in-guest (see perf-mode interaction below).
- On vmexit: host KVM blocks the vCPU thread on a wait queue.
- Guest-side timer expires (in guest time) → host KVM injects a timer interrupt → vCPU thread wakes → vmenter back to guest.
- Guest kernel’s hrtimer ISR fires →
wake_up_process→ guest scheduler reruns the IdleChurn task.
wake_latencies_ns (the dispatch arm subtracts
sleep_duration to isolate scheduler-resume overhead)
captures the SUM of guest scheduling cost +
vmexit-vmenter round-trip + host scheduling cost. The
SCHEDULER-UNDER-TEST is the GUEST scheduler, but the
host’s contribution can dominate under load.
Strict bound on host preemption. The guest’s hrtimer expiry routes through the emulated LAPIC (x86) or arch timer (arm64), both backed by host timers. If the host has descheduled the vCPU thread (PLE-induced eviction from a busy guest spinlock, host-side preemption by higher-priority work, or simple oversubscription), the guest’s hrtimer CANNOT fire until the host re-runs the vCPU thread. This is a hard additional latency bound added on top of guest-side scheduling cost — the guest scheduler under test cannot be observed through IdleChurn while the host has preempted its vCPU.
Performance-mode interaction. This subsection
describes x86_64 only. ktstr’s x86_64 VMM disables HLT
vmexits when performance_mode=true (see
src/vmm/x86_64/kvm.rs::Vm::new around the
KVM_X86_DISABLE_EXITS_HLT enable_cap call). The
aarch64 VMM accepts the performance_mode flag but does
NOT configure WFI trap behavior (no HCR_EL2.TWI tweak in
src/vmm/aarch64/kvm.rs::Vm::new), so on aarch64 every
guest WFI exits to host regardless of performance_mode
— IdleChurn always exercises the cross-VM idle path
there. With HLT exits disabled (x86_64 only):
- Step 4 stays in-guest: the vCPU spins on HLT without vmexit, consuming its assigned host CPU slot. The guest kernel still sees the CPU as idle, but the host never blocks the vCPU thread.
- Steps 5-6 collapse: no host wait queue, no guest-time-aware injection. The host runs the vCPU thread continuously, and the guest hrtimer expiry is handled inside the running vCPU.
- IdleChurn under
performance_mode=truetherefore tests ONLY the guest’s idle path. It does NOT exercise the cross-VM idle / host-scheduler interaction. This is the right config for measuring guest scheduler decisions in isolation.
With performance_mode=false, HLT vmexits fire and
IdleChurn DOES test the cross-VM idle path — but the
host scheduler’s contribution to wake latency
interferes with timing-sensitive guest measurements.
Test-author guidance:
- For tests measuring GUEST scheduler decisions in
isolation (e.g.
scx_lavdidle-CPU selection): setperformance_mode=trueso the host doesn’t perturb the measurement. - For tests measuring CROSS-VM idle (e.g. how the host
schedules a vCPU thread after a guest HLT): set
performance_mode=false, run on a dedicated host (no noisy neighbors), and budget for host-scheduler- contributed jitter. - On a heavily-loaded host (concurrent ktstr tests, or
noisy neighbors),
wake_latencies_nsreflects host contention even underperformance_mode=truebecause the vCPU thread itself can be preempted on the host (the guest sees this as “the worker just took longer than it should”).
Distinguishing host vs guest contribution requires
host-side observation — e.g. perf sched on the vCPU
thread, or comparing
/proc/<vcpu_tid>/status::voluntary_ctxt_switches
before vs after the test window.
§Wake-latency interpretation
wake_latencies_ns samples for IdleChurn capture the
scheduler-resume overhead — the time the kernel spent
scheduling the worker back on-CPU after the requested
sleep_duration elapsed. The dispatch arm subtracts
sleep_duration from the measured nanosleep elapsed time,
leaving timer slack (default 50µs) plus
try_to_wake_up → on-CPU latency. This isolates the
signal a scheduler A/B test cares about: comparing
wake_latencies_ns distributions across schedulers
directly measures their idle-class → run-class transition
behavior without the requested-sleep duration dominating
the measurement.
saturating_sub guards against the rare case where
elapsed < sleep_duration. That can happen on early-EINTR
returns or sub-tick measurement windows; saturating to 0
matches the “no observable resume overhead” interpretation.
Samples are comparable in DIRECTION to
wake_latencies_ns from FutexPingPong, FutexFanOut,
and other wake-pair variants (lower = better scheduler
resume), but the IdleChurn distribution carries a
~50µs floor from current->timer_slack_ns that
event-driven futex variants don’t. Cross-variant
absolute comparisons must subtract the slack floor or
limit the comparison to the > P50 percentile where the
slack contribution is dwarfed by tail latency.
§Spawn-time validation
The spawn path rejects burst_duration == Duration::ZERO
(loop collapses to pure nanosleep, no runtime accrued)
and sleep_duration == Duration::ZERO (loop degenerates
to SpinWait, making the variant
useless as an idle-path test).
The sleep_duration == 0 rejection deserves an
implementation-rationale note: nanosleep(0) is NOT a
no-op — the kernel still calls
set_current_state(TASK_INTERRUPTIBLE) followed by
schedule(), which produces sched_yield-equivalent
semantics (yield to the next runnable task on the
runqueue, return immediately). That overlaps with
YieldHeavy and provides no idle-path
signal, so the rejection sends the caller to the variant
that already covers the yield case. Both rejections
produce actionable bail messages naming the field and
the degenerate semantics — see the spawn-side check in
WorkloadHandle::spawn.
worker_group_size = None — every worker operates
independently with no shared-memory group; see
Self::worker_group_size for the framework-wide
semantics.
Fields
burst_duration: DurationWall-clock duration of CPU work between idle
periods. Use Duration to keep the unit visible at
the call site, matching
WakeChain’s work_per_hop.
Default 1ms (see
crate::workload::config::defaults::IDLE_CHURN_BURST_DURATION). Short
bursts (< 1ms) maximise idle-cycle frequency.
sleep_duration: DurationWall-clock duration of each idle period. Lower bound
— the kernel adds timer_slack_ns (~50µs) to the
requested duration. Default 5ms (see
crate::workload::config::defaults::IDLE_CHURN_SLEEP_DURATION). Sub-1ms
values produce sched_switch transitions but skip
tick-stop / C-state entry.
precise_timing: boolOpt-in: shrink current->timer_slack_ns from the
inherited 50µs default to 1ns at worker entry via
prctl(PR_SET_TIMERSLACK, 1). Default false so
existing callers see the inherited slack the variant
doc describes.
When true, the IdleChurn dispatch arm calls
prctl(PR_SET_TIMERSLACK, 1) once before the work
loop. The kernel’s PR_SET_TIMERSLACK arm at
kernel/sys.c:2653 sets current->timer_slack_ns = arg2 when arg2 > 0; passing 0 is a RESET to
default_timer_slack_ns (the inherited 50µs), so
1 is the smallest value that actually shrinks the
slack. After the call, hrtimer_nanosleep
(kernel/time/hrtimer.c:2331-2356) coalesces
expiries within a 1ns window instead of the default
50µs, exposing the scheduler’s true wake-resume
latency for sub-100µs sleep_duration values.
This setting is most useful when the test measures
wake-latency distributions for sub-50µs sleeps,
where the inherited slack would otherwise dominate
the observed sleep time. For sleep_duration ≥ 1ms
the slack contribution is < 5% noise and
precise_timing=true makes no observable
difference.
RT/DL workers ignore this setting. The kernel
guard at kernel/sys.c:2647
(if (rt_or_dl_task_policy(current)) break;)
makes prctl(PR_SET_TIMERSLACK, ...) a no-op for
RT/DL tasks; their slack is independently forced to
0 at sched-class entry by
kernel/sched/syscalls.c:258. Setting
precise_timing=true for an RT IdleChurn worker is
harmless but redundant.
Field defaults to false so existing
from_name("IdleChurn") callers
see the historical (inherited-slack) behaviour. Opt
in via the struct-literal form
WorkType::IdleChurn { ..., precise_timing: true }.
TimerLatency
Cyclictest-style timer-latency probe. Each worker sleeps to an ABSOLUTE
deadline via clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, next) and
records the wake latency = observed wake time − the deadline (floored at
0), accumulating next += interval (NOT now + interval) so a late wake
shows up AS latency instead of pushing the next period out — the
coordinated-omission-free measurement cyclictest(8) makes.
Kernel path: clock_nanosleep(TIMER_ABSTIME) →
hrtimer_nanosleep(HRTIMER_MODE_ABS) → do_nanosleep
(kernel/time/hrtimer.c): schedule() blocks the task; on expiry
hrtimer_wakeup → wake_up_process → try_to_wake_up re-runs it. The
latency is the scheduler’s wake-to-on-CPU delay for a timer-woken task —
the canonical real-time-determinism signal.
vs IdleChurn: IdleChurn does a RELATIVE
nanosleep(sleep_duration) after a CPU burst and measures resume
OVERHEAD against an Instant deadline — an idle/run duty cycle that
frees CPUs for borrowing. TimerLatency does an absolute-deadline sleep
with no CPU burst and measures the timer wake-LATENCY distribution
(timer_latency_p50/p99/p999_us + worst), the RT-determinism shape. Use
IdleChurn to free CPUs; use TimerLatency to measure wake-up jitter under
load. Here the SLEEPING task is the one woken (self-timer-wake).
Metrics: the per-cycle latency feeds the distinct timer_latencies_ns
reservoir (NOT the shared wake_latencies_ns), so a TimerLatency run’s
timer_latency_p99_us never blurs with the blocking variants’
p99_wake_latency_us. Guest-resident (an intrinsic latency probe — the
documented observer-effect exception).
worker_group_size = None (any worker count; each worker runs an
independent cyclictest loop). Pin workers to dedicated CPUs
(e.g. crate::workload::AffinityIntent) to measure per-CPU wake
jitter. Default 1000µs (1kHz, cyclictest’s default) — see
crate::workload::config::defaults::TIMER_LATENCY_INTERVAL_US.
Fields
interval_us: u64Inter-wake interval in microseconds — the absolute deadline advances
by this each cycle (next += interval_us). 1000 (1kHz) matches
cyclictest’s default; smaller intervals raise the wake frequency
(and the sample count per second). Validated > 0 at spawn (a zero
interval never advances the deadline and busy-spins).
NetTraffic
AF_PACKET traffic generator that drives the virtio-net NIC’s RX
hardirq and NAPI softirq. Each worker opens an AF_PACKET /
SOCK_RAW socket bound to the single non-loopback (virtio-net)
interface, brings it administratively up, and sends self-addressed
L2 frames in a loop. Every sendto is a virtio TX kick; the v0
in-VMM-loopback backend echoes the frame straight into RX and raises
the guest’s RX-completion interrupt, so the workload generates real
per-CPU hardirq + softirq load for the scheduler to absorb.
Kernel path: sendto → packet_sendmsg → dev_queue_xmit (an
L2 inject that bypasses the IP stack) → virtio-net start_xmit →
virtqueue_notify (an MMIO QUEUE_NOTIFY that exits to the host). The
host loopback echoes TX→RX and signals the RX virtqueue; the guest
virtio-mmio ISR runs vring_interrupt → skb_recv_done →
virtqueue_napi_schedule, raising NET_RX_SOFTIRQ (drained by
virtnet_poll in softirq context). NAPI coalesces, so a tight burst
yields fewer hardirqs but sustained softirq work.
Why AF_PACKET, not IP traffic: a guest sending to its own IP is
routed to lo (RTN_LOCAL) and never reaches the NIC, raising zero
virtio IRQs. Only an AF_PACKET raw socket bound to the interface
drives a real TX kick. Requires CONFIG_PACKET=y (ktstr.kconfig)
and CAP_NET_ADMIN for the interface-up ioctl — ktstr always runs as
root, so the capability is present.
Precondition: a NIC must be attached via
#[ktstr_test(networks = [...])] with a crate::prelude::NetConfig.
With no non-loopback interface present the worker is a LOUD no-op: it
warns once and returns work_units == 0 rather than silently doing
nothing.
worker_group_size = None (any worker count; each worker drives the
shared NIC independently). Frames sent are reported as work_units /
iterations; the IRQ-side signals (rq->avg_irq, per-CPU softirq
time, /proc/interrupts) are observed separately, not by this
variant.
Fields
IrqWake
Paired sender/receiver that wakes a blocked task from NET_RX softirq
(or ksoftirqd) context — the genuine “woken from softirq” wake that
Self::TimerLatency (a hardirq hrtimer wake) and Self::NetTraffic (a
wakee-less sender) do not exercise. Each pair: one worker reuses the
Self::NetTraffic sender (self-addressed AF_PACKET / SOCK_RAW frames
on the virtio-net NIC), the other BLOCKS in recvfrom on the same socket
and records a wake-presence sample per delivered frame.
Kernel path (the wake): the sender’s frame arrives via the virtio RX
IRQ, which only schedules NAPI (vring_interrupt → skb_recv_done →
__napi_schedule); delivery + wake run in the NET_RX_SOFTIRQ handler
net_rx_action → virtnet_poll → packet_rcv → sk_data_ready =
sock_def_readable → wake_up_interruptible_sync_poll, which ttwus the
receiver blocked in __skb_wait_for_more_packets. At a low rate the softirq
runs inline (on the receiving CPU at irq_exit); at saturation
(interval_us == 0) the softirq budget is exceeded and the work — and the
wake — defers to ksoftirqd. The regime is kernel-decided by rate, not a
flag. The wake is NEVER in hardirq context (the virtio ISR only schedules
NAPI).
Precondition + no-NIC behavior: like Self::NetTraffic — needs a NIC
via #[ktstr_test(networks = [...])]; with no non-loopback interface the pair
is a LOUD no-op (warn once, work_units == 0).
What the wake reservoir means: worker_group_size = Some(2) — workers
spawn in sender/receiver pairs (even count enforced at spawn). The receiver
pushes each recvfrom block-to-return duration into the wake reservoir
(wake_latencies_ns). A NON-EMPTY reservoir is a LIVENESS signal — softirq-
delivered frames scheduled the receiver to run — NOT a precise wake-to-run
latency: the magnitude is a block-duration proxy dominated by the
inter-frame pacing wait, and when the queue never empties (interval_us == 0) a recvfrom may return an already-queued frame WITHOUT blocking, so a
sample need not correspond to a softirq wake at all. The authoritative
“softirq fired” proof is the rising IRQ-observability count
(total_softirq_net_rx, rq->avg_irq, PSI-irq), not the reservoir.
Fields
interval_us: u64Inter-frame pause (µs) on the sender. Default 1000 (1 kHz): paces
the sender so the receiver drains its queue and genuinely blocks
between frames — each frame is then a real empty-queue block woken by
the next NET_RX softirq, giving a usable (non-degenerate) wake
reservoir. 0 sends continuously (maximum softirq load, serviced by
ksoftirqd) but the receive queue rarely empties, so recvfrom mostly
returns an already-queued frame without blocking and the wake reservoir
degenerates to near-zero block durations. A larger > 0 value lowers
the rate. Paces the sender side only; the receiver always blocks.
AluHot
Sustained high-IPC ALU workload. Each worker runs four
independent multiply chains in parallel, with
std::hint::black_box wrapping every step to prevent
the optimizer from collapsing the chain into a closed-form
expression. Distinct from SpinWait —
SpinWait issues PAUSE (std::hint::spin_loop) whose
per-iteration retire is a single fused micro-op and which
signals the front-end to back off, depressing the IPC the
scheduler observes. AluHot retires real arithmetic at
IPC ≥ 2.0 on every modern x86_64 / aarch64 core, so
scheduler decisions that respond to per-task runtime
characteristics (lavd’s lat_cri per-task
latency-criticality scoring) see a meaningfully different
signal.
The width field selects the
data-path width — see AluWidth for the resolution
rules and the AVX-512 / AMX caveats. Workers do NOT
adjust frequency or voltage state themselves; the
package-wide frequency throttle on x86_64 is a kernel-
observable effect of running AVX-512 / AMX instructions.
All widths currently run a scalar four-stream multiply
chain; the width selector is preserved on WorkerReport
so a downstream classifier can distinguish runs that
requested SIMD from runs that requested scalar, even
though the dispatch is uniform in this revision.
worker_group_size = None (any worker count is valid;
each worker runs an independent multiply chain). No
shared-memory region; no per-iteration syscall overhead.
For duty-cycle modulation (e.g. ALU 90 % / Sleep 10 %), use
WorkPhase::AluHot inside a Sequence — the
composable counterpart with a per-phase duration.
Fields
SmtSiblingSpin
Tight PAUSE-spin from a paired worker, intended to be
pinned to two SMT siblings of the same physical core so
the spinning thread contends for the core’s shared
front-end / execution resources with its sibling. Distinct
from SpinWait which is a single-
position spin: SmtSiblingSpin requires
worker_group_size == 2 and
is paired with an SMT-aware affinity that pins both
workers to the two siblings of one physical core.
The framework provides
AffinityIntent::SmtSiblingPair for this purpose: the
scenario engine resolves it against the host topology
(using sysfs’s
/sys/devices/system/cpu/cpu_a/topology/thread_siblings_list
when the topology was built from sysfs) and produces a
2-CPU AffinityIntent::Exact for the spawn pipeline.
Resolving on a non-SMT host (threads_per_core == 1)
returns an explicit error rather than silently degrading.
Test authors who want exact CPU IDs (e.g. comparing
same-core vs. cross-core behaviour on a known topology)
can still hand-pick via AffinityIntent::Exact.
Without one of those affinity intents the variant
degenerates to two independent SpinWait
workers and exercises no SMT contention.
worker_group_size = Some(2) so paired workers share
the position metadata the dispatch arm uses to assert
the partner exists; the variant carries no shared-memory
region itself.
IpcVariance
Per-thread alternating high-IPC / low-IPC workload. Each
worker runs hot_iters of dependent integer multiplies
(high IPC, ALU-bound) followed by cold_iters of random
cache-line touches over a working-set region (low IPC,
memory-bound), repeating the alternation
period_iters times before checking
stop_requested. The phase split
is deterministic per worker — no shared state — so two
workers iterate at offset cadences only if they are
scheduled differently.
Drives task-level runtime variance between phases: any
scheduler that estimates a task’s “bursty” or
“memory-stall” character from a windowed runtime sample
(lavd’s lat_cri per-task latency-criticality field on
task_ctx) sees this task switch character every
hot_iters + cold_iters boundary. Tests scheduler
adaptation latency: how quickly does the scheduler
re-classify the task as the phase changes?
Field semantics:
hot_iters: number of multiply-chain steps per hot phase. Chosen to span ~tens of microseconds on a modern core; e.g. 100_000 ≈ 50µs at IPC 2.0 / 2 GHz.cold_iters: number of random cache-line touches per cold phase. The cold phase reads a 512KB region (LLC pressure on most desktop hosts; spills to DRAM on workloads with smaller LLCs) at random offsets.period_iters: hot/cold pair count per outer iteration. Higher values reduce the per-stop-check overhead but increase shutdown latency.
All three must be > 0; both the
ipc_variance constructor and
WorkloadHandle::spawn reject zeros with
WorkTypeValidationError::ZeroIpcVarianceParam.
Stop responsiveness. The hot and cold inner loops do
not poll stop. The outer
period_iters loop checks stop_requested between each
hot/cold pair, so worst-case shutdown latency is one
hot-phase + one cold-phase. Large hot_iters /
cold_iters increase the shutdown-latency floor
proportionally; pick values that keep a single phase
under the test author’s tolerance for stop lag.
iterations counter semantics. Each completed outer
loop bumps the per-worker iterations counter by ONE,
regardless of how many period_iters the inner loop
actually completed before stop_requested fired. The
counter records ENTERED outer cycles, not completed
inner periods; the per-multiply / per-touch progress
flows through work_units instead. A worker that exits
during the inner period_iters loop still bumps
iterations by 1 for that outer cycle — the
iterations += 1 at the end of the dispatch arm is
unconditional.
worker_group_size = None. No shared memory; no
per-iteration syscall.
Fields
hot_iters: u64Multiply-chain steps per hot phase. Must be > 0.
Larger values increase shutdown latency
proportionally — the inner hot loop does not poll
stop between steps, so a worker mid-hot-phase
finishes the phase before the outer loop sees the
stop signal.
Implementations§
Source§impl WorkType
impl WorkType
Sourcepub const ALL_NAMES: &'static [&'static str] = <Self as strum::VariantNames>::VARIANTS
pub const ALL_NAMES: &'static [&'static str] = <Self as strum::VariantNames>::VARIANTS
PascalCase names for all built-in variants, matching the enum arm names.
Generated by strum::VariantNames at compile time from the
WorkType enum definition, so a new variant appears here
automatically. Includes "Sequence" and "Custom" even though
from_name cannot construct them (sequences
require explicit phases; custom requires a function pointer).
Sourcepub fn from_name(s: &str) -> Option<WorkType>
pub fn from_name(s: &str) -> Option<WorkType>
Look up a variant by PascalCase name and return it with default
parameters. Returns None for unknown names, "Sequence"
(requires explicit phases), and "Custom" (requires a function
pointer).
Sourcepub fn suggest(s: &str) -> Option<&'static str>
pub fn suggest(s: &str) -> Option<&'static str>
Case-insensitive lookup that returns the canonical PascalCase
entry from ALL_NAMES matching the input,
or None when no entry matches.
Distinct from from_name in two ways:
- It matches case-insensitively, so
"spinwait"/"SPINWAIT"/"SpinWait"all map to the same canonical"SpinWait". - It returns the name string rather than a default-parameter
WorkTypevalue, so callers can quote the canonical spelling in error messages without also instantiating the variant.
Intended as a CLI / config-parser helper: when from_name
returns None for the user’s input, pass the same string
here to recover the canonical spelling (if any) for a
friendlier “did you mean SpinWait?” diagnostic. Includes
"Sequence" and "Custom" in the match space even though
from_name refuses to construct them — the point of
suggest is naming, not construction.
Whitespace handling: the match uses eq_ignore_ascii_case
without trimming, so surrounding whitespace in s
(" SpinWait", "SpinWait\n") suppresses a match. Callers
that accept user input with possible surrounding whitespace
must s.trim() before calling — the same convention
from_name follows. Keeping the predicate strict here
avoids confusing “suggested canonical spelling” reports for
inputs that were already nearly correct save for stray
whitespace the caller should have already normalized.
Sourcepub fn worker_group_size(&self) -> Option<usize>
pub fn worker_group_size(&self) -> Option<usize>
Worker group size for this work type, or None if ungrouped.
num_workers must be divisible by this value. Paired types return 2,
fan-out returns fan_out + 1 (1 messenger + N receivers), and
MutexContention returns contenders.
Whether this work type needs a pre-fork shared memory region (MAP_SHARED mmap).
RtStarvation opts in even though its body never reads or
writes the futex word: the spawn-side (futex_ptr, pos)
tuple is the only mechanism that hands the worker its
per-position index, which RtStarvation consumes to
classify itself as RT or CFS. Allocating a single 4-byte
MAP_SHARED region per group is the cheapest way to get
pos plumbed through worker_main without a wider dispatch
contract change. IrqWake opts in for the same reason:
pos == 0 is the frame sender, pos == 1 the receiver that
blocks in recvfrom — neither touches the futex word.
Sourcepub fn chain_pipe_depth(&self) -> Option<usize>
pub fn chain_pipe_depth(&self) -> Option<usize>
Number of pipes per chain that the spawn-side must allocate
for this work type, or None when no per-stage pipe ring is
needed. The returned depth matches the variant’s depth
field for WakeChain { wake: WakeMechanism::Pipe, .. };
every other variant (and WakeChain with
wake: WakeMechanism::Futex) returns None.
When this returns Some(depth), the spawn-side allocates
depth pipes per chain so stage i holds
pipe[i].write_end (to wake stage i + 1) and
pipe[(i + depth - 1) % depth].read_end (predecessor’s
wake). WakeMechanism::Futex keeps the existing futex-word
ring and returns None.
Sourcepub fn needs_cache_buf(&self) -> bool
pub fn needs_cache_buf(&self) -> bool
Whether this work type allocates a per-worker cache buffer post-fork.
Sourcepub const fn bursty(burst_duration: Duration, sleep_duration: Duration) -> Self
pub const fn bursty(burst_duration: Duration, sleep_duration: Duration) -> Self
Bursty work: CPU burst for burst_duration, sleep for
sleep_duration, repeat.
Validation fires at spawn time, not construction time; see
WorkType::Bursty variant doc for preconditions.
Sourcepub const fn pipe_io(burst_iters: u64) -> Self
pub const fn pipe_io(burst_iters: u64) -> Self
Paired pipe I/O with CPU burst between exchanges.
Validation fires at spawn time, not construction time; see
WorkType::PipeIo variant doc for preconditions.
Sourcepub const fn futex_ping_pong(spin_iters: u64) -> Self
pub const fn futex_ping_pong(spin_iters: u64) -> Self
Paired futex ping-pong with CPU spin between wakes.
Validation fires at spawn time, not construction time; see
WorkType::FutexPingPong variant doc for preconditions.
Sourcepub const fn cache_pressure(size_kib: usize, stride: usize) -> Self
pub const fn cache_pressure(size_kib: usize, stride: usize) -> Self
Strided read-modify-write over a size_kib KiB buffer.
Validation fires at spawn time, not construction time; see
WorkType::CachePressure variant doc for preconditions.
Sourcepub const fn cache_yield(size_kib: usize, stride: usize) -> Self
pub const fn cache_yield(size_kib: usize, stride: usize) -> Self
Cache pressure burst followed by sched_yield().
Validation fires at spawn time, not construction time; see
WorkType::CacheYield variant doc for preconditions.
Sourcepub const fn cache_pipe(size_kib: usize, burst_iters: u64) -> Self
pub const fn cache_pipe(size_kib: usize, burst_iters: u64) -> Self
Cache pressure burst then pipe exchange with a partner worker.
Validation fires at spawn time, not construction time; see
WorkType::CachePipe variant doc for preconditions.
Sourcepub const fn futex_fan_out(fan_out: usize, spin_iters: u64) -> Self
pub const fn futex_fan_out(fan_out: usize, spin_iters: u64) -> Self
1:N fan-out wake pattern with CPU spin between wakes.
Validation fires at spawn time, not construction time; see
WorkType::FutexFanOut variant doc for preconditions.
Sourcepub const fn affinity_churn(spin_iters: u64) -> Self
pub const fn affinity_churn(spin_iters: u64) -> Self
Rapid self-directed affinity changes with spin_iters CPU work between.
Validation fires at spawn time, not construction time; see
WorkType::AffinityChurn variant doc for preconditions.
Sourcepub const fn cross_affinity_churn(spin_iters: u64) -> Self
pub const fn cross_affinity_churn(spin_iters: u64) -> Self
Rapid cross-task affinity churn: each worker rewrites its
siblings’ affinity. Must run in a dedicated cgroup; see the
WorkType::CrossAffinityChurn variant doc for preconditions.
Sourcepub const fn policy_churn(spin_iters: u64) -> Self
pub const fn policy_churn(spin_iters: u64) -> Self
Cycle scheduling policies with spin_iters CPU work between switches.
Validation fires at spawn time, not construction time; see
WorkType::PolicyChurn variant doc for preconditions.
Sourcepub const fn fan_out_compute(
fan_out: usize,
cache_footprint_kib: usize,
operations: usize,
sleep_usec: u64,
) -> Self
pub const fn fan_out_compute( fan_out: usize, cache_footprint_kib: usize, operations: usize, sleep_usec: u64, ) -> Self
Messenger/worker fan-out with compute work using the given parameters.
fan_out is passed to futex_wake(ptr, N) where N: i32 is
the number of waiters to wake. Realistic values are tens of
workers; sched-test topologies that need more than i32::MAX
(~2.1B) receivers per messenger are not expressible.
The worker’s futex_wake call site clamps the cast via
clamp_futex_wake_n (worker/mod.rs) so a pathological
usize input wakes at most i32::MAX waiters instead of
wrapping to a negative N (FUTEX_WAKE broadcasts when passed a
negative N on some kernels, which would wake every waiter on
the futex rather than just this messenger’s receivers).
Validation fires at spawn time, not construction time; see
WorkType::FanOutCompute variant doc for preconditions.
Sourcepub const fn schbench(config: SchbenchConfig) -> Self
pub const fn schbench(config: SchbenchConfig) -> Self
The schbench_rs workload (schbench’s default mode), configured by
config. Build the config with SchbenchConfig::default plus its
chainable setters, e.g.
WorkType::schbench(SchbenchConfig::default().message_threads(2)). Use
with a single ktstr worker; see the WorkType::Schbench variant doc.
Sourcepub const fn taobench(config: TaobenchConfig) -> Self
pub const fn taobench(config: TaobenchConfig) -> Self
The taobench_rs workload (a bounded, evicting key-value cache),
configured by config. Build the config with TaobenchConfig::default
plus its chainable setters, e.g.
WorkType::taobench(TaobenchConfig::default().target_hit_pct(95)). Use
with a single ktstr worker; see the WorkType::Taobench variant doc.
Sourcepub const fn page_fault_churn(
region_kib: usize,
touches_per_cycle: usize,
spin_iters: u64,
) -> Self
pub const fn page_fault_churn( region_kib: usize, touches_per_cycle: usize, spin_iters: u64, ) -> Self
Rapid page fault cycling with spin_iters CPU work between cycles.
Validation fires at spawn time, not construction time; see
WorkType::PageFaultChurn variant doc for preconditions.
Sourcepub const fn mutex_contention(
contenders: usize,
hold_iters: u64,
work_iters: u64,
) -> Self
pub const fn mutex_contention( contenders: usize, hold_iters: u64, work_iters: u64, ) -> Self
N-way futex mutex contention with contenders workers per group.
Validation fires at spawn time, not construction time; see
WorkType::MutexContention variant doc for preconditions.
Sourcepub const fn thundering_herd(
waiters: usize,
batches: u64,
inter_batch_ms: u64,
) -> Self
pub const fn thundering_herd( waiters: usize, batches: u64, inter_batch_ms: u64, ) -> Self
One waker, N waiters on a single global futex; broadcasts via
FUTEX_WAKE per batch. Pairs with
WorkType::ThunderingHerd.
Validation fires at spawn time, not construction time; see
WorkType::ThunderingHerd variant doc for preconditions.
Sourcepub const fn priority_inversion(
high_count: usize,
medium_count: usize,
low_count: usize,
hold_iters: u64,
work_iters: u64,
pi_mode: FutexLockMode,
) -> Self
pub const fn priority_inversion( high_count: usize, medium_count: usize, low_count: usize, hold_iters: u64, work_iters: u64, pi_mode: FutexLockMode, ) -> Self
Three priority tiers contending for one shared lock. See
WorkType::PriorityInversion for behavior; pass
FutexLockMode::Pi to invoke FUTEX_LOCK_PI or
FutexLockMode::Plain for a non-PI futex.
Validation fires at spawn time, not construction time; see
WorkType::PriorityInversion variant doc for preconditions.
Sourcepub const fn producer_consumer_imbalance(
producers: usize,
consumers: usize,
produce_rate_hz: u64,
consume_iters: u64,
queue_depth_target: u64,
) -> Self
pub const fn producer_consumer_imbalance( producers: usize, consumers: usize, produce_rate_hz: u64, consume_iters: u64, queue_depth_target: u64, ) -> Self
Producer/consumer pipeline with deliberately unbalanced
rates. See WorkType::ProducerConsumerImbalance.
Validation fires at spawn time, not construction time; see
WorkType::ProducerConsumerImbalance variant doc for preconditions.
Sourcepub const fn rt_starvation(
rt_workers: usize,
cfs_workers: usize,
rt_priority: i32,
burst_iters: u64,
) -> Self
pub const fn rt_starvation( rt_workers: usize, cfs_workers: usize, rt_priority: i32, burst_iters: u64, ) -> Self
rt_workers SCHED_FIFO workers vs. cfs_workers SCHED_NORMAL
workers competing on the same CPU set. See
WorkType::RtStarvation.
Validation fires at spawn time, not construction time; see
WorkType::RtStarvation variant doc for preconditions.
Sourcepub const fn asymmetric_waker(
waker_class: SchedClass,
wakee_class: SchedClass,
burst_iters: u64,
) -> Self
pub const fn asymmetric_waker( waker_class: SchedClass, wakee_class: SchedClass, burst_iters: u64, ) -> Self
Paired workers in mismatched scheduling classes. See
WorkType::AsymmetricWaker.
Validation fires at spawn time, not construction time; see
WorkType::AsymmetricWaker variant doc for preconditions.
Sourcepub const fn wake_chain(
depth: usize,
wake: WakeMechanism,
work_per_hop: Duration,
) -> Self
pub const fn wake_chain( depth: usize, wake: WakeMechanism, work_per_hop: Duration, ) -> Self
Pipeline of waker-wakee hops with optional WF_SYNC. See
WorkType::WakeChain.
Validation fires at spawn time, not construction time; see
WorkType::WakeChain variant doc for preconditions
(depth >= 2, num_workers divisible by depth, etc.).
Sourcepub fn numa_working_set_sweep(
region_kib: usize,
sweep_period_ms: u64,
target_nodes: impl IntoIterator<Item = usize>,
) -> Self
pub fn numa_working_set_sweep( region_kib: usize, sweep_period_ms: u64, target_nodes: impl IntoIterator<Item = usize>, ) -> Self
NUMA working-set sweep with periodic mbind rotation. See
WorkType::NumaWorkingSetSweep. target_nodes accepts
any IntoIterator<Item = usize> for ergonomic call sites
([0, 1, 2], 0..node_count, BTreeSet, etc.).
Validation fires at spawn time, not construction time; see
WorkType::NumaWorkingSetSweep variant doc for preconditions.
Sourcepub fn sequence(
first: WorkPhase,
rest: impl IntoIterator<Item = WorkPhase>,
) -> Self
pub fn sequence( first: WorkPhase, rest: impl IntoIterator<Item = WorkPhase>, ) -> Self
Construct a WorkType::Sequence from a head phase and an
iterator of follow-on phases.
The Sequence variant cannot use from_name
because phases require explicit construction; this constructor
is the only typed entry point. Accepts any IntoIterator<Item = WorkPhase> for rest so callers can pass arrays, Vec, or
builder-style chains.
Validation fires at spawn time, not construction time; see
WorkType::Sequence variant doc for preconditions.
Sourcepub const fn cgroup_churn(groups: usize, cycle_ms: u64) -> Self
pub const fn cgroup_churn(groups: usize, cycle_ms: u64) -> Self
Construct a WorkType::CgroupChurn.
Validation fires at spawn time, not construction time; see
WorkType::CgroupChurn variant doc for preconditions.
Sourcepub fn cgroup_attach_storm(dest: impl Into<String>, reap: ReapMode) -> Self
pub fn cgroup_attach_storm(dest: impl Into<String>, reap: ReapMode) -> Self
Construct a WorkType::CgroupAttachStorm.
dest is the sibling cgroup name each forked child is migrated
into; it must already exist at run time (e.g. created via
Op::add_cgroup). Non-const
because Into::<String>::into allocates (mirrors
custom). Validation fires at spawn time, not
construction time; see the WorkType::CgroupAttachStorm variant
doc for preconditions.
Sourcepub const fn signal_storm(signals_per_iter: u64, work_iters: u64) -> Self
pub const fn signal_storm(signals_per_iter: u64, work_iters: u64) -> Self
Construct a WorkType::SignalStorm.
Validation fires at spawn time, not construction time; see
WorkType::SignalStorm variant doc for preconditions.
Sourcepub const fn preempt_storm(
cfs_workers: usize,
rt_burst_iters: u64,
rt_sleep_us: u64,
) -> Self
pub const fn preempt_storm( cfs_workers: usize, rt_burst_iters: u64, rt_sleep_us: u64, ) -> Self
Construct a WorkType::PreemptStorm.
Validation fires at spawn time, not construction time; see
WorkType::PreemptStorm variant doc for preconditions.
Sourcepub const fn epoll_storm(
producers: usize,
consumers: usize,
events_per_burst: u64,
) -> Self
pub const fn epoll_storm( producers: usize, consumers: usize, events_per_burst: u64, ) -> Self
Construct a WorkType::EpollStorm.
Validation fires at spawn time, not construction time; see
WorkType::EpollStorm variant doc for preconditions.
Sourcepub const fn numa_migration_churn(period_ms: u64) -> Self
pub const fn numa_migration_churn(period_ms: u64) -> Self
Construct a WorkType::NumaMigrationChurn.
Validation fires at spawn time, not construction time; see
WorkType::NumaMigrationChurn variant doc for preconditions.
Sourcepub const fn idle_churn(
burst_duration: Duration,
sleep_duration: Duration,
) -> Self
pub const fn idle_churn( burst_duration: Duration, sleep_duration: Duration, ) -> Self
Construct a WorkType::IdleChurn with the default
precise_timing = false.
§Spawn-time precondition
burst_duration and sleep_duration must both be
strictly greater than Duration::ZERO. The constructor
itself accepts any value (no early validation); the
rejection fires at WorkloadHandle::spawn time with an
actionable bail message naming the offending field. See
WorkType::IdleChurn variant doc for the rationale and
the kernel-source citation.
§precise_timing
This constructor sets precise_timing to
defaults::IDLE_CHURN_PRECISE_TIMING (false),
preserving the inherited current->timer_slack_ns
(~50µs default). To opt into 1ns timer slack, build the
variant directly via the struct-literal form:
WorkType::IdleChurn { burst_duration, sleep_duration, precise_timing: true }. See the variant’s
precise_timing field doc for the kernel-side
mechanism.
Sourcepub const fn alu_hot(width: AluWidth) -> Self
pub const fn alu_hot(width: AluWidth) -> Self
Construct a WorkType::AluHot at the given execution
width.
AluWidth::Widest resolves to the widest data-path the
host supports at worker entry. See AluWidth for the
per-variant data-path width and the runtime resolution
rules.
Validation fires at spawn time, not construction time;
see WorkType::AluHot variant doc for preconditions.
Sourcepub const fn timer_latency(interval_us: u64) -> Self
pub const fn timer_latency(interval_us: u64) -> Self
Construct a WorkType::TimerLatency with an inter-cycle interval
(interval_us, the cyclictest -i analogue). interval_us == 0 is
rejected at spawn (validate_workload_admission); pass a non-zero
period. See the WorkType::TimerLatency variant doc.
Sourcepub const fn net_traffic(interval_us: u64, frame_bytes: u16) -> Self
pub const fn net_traffic(interval_us: u64, frame_bytes: u16) -> Self
Construct a WorkType::NetTraffic with an inter-frame interval
(interval_us; 0 = continuous burst) and Ethernet frame_bytes.
frame_bytes is validated to [60, 1514] at spawn, not here; see the
WorkType::NetTraffic variant doc.
Sourcepub const fn irq_wake(interval_us: u64, frame_bytes: u16) -> Self
pub const fn irq_wake(interval_us: u64, frame_bytes: u16) -> Self
Construct a WorkType::IrqWake — a paired sender/receiver where the
receiver blocks in recvfrom and is woken from NET_RX softirq context.
interval_us paces the sender (default 1000 µs gives the receiver a clean
empty-queue block per frame; 0 maximizes softirq load into ksoftirqd
but degenerates the wake reservoir); frame_bytes is validated to
[60, 1514] at spawn. Spawn an even worker count (group size 2). See the
WorkType::IrqWake variant doc.
Sourcepub const fn ipc_variance(
hot_iters: u64,
cold_iters: u64,
period_iters: u64,
) -> Result<Self, WorkTypeValidationError>
pub const fn ipc_variance( hot_iters: u64, cold_iters: u64, period_iters: u64, ) -> Result<Self, WorkTypeValidationError>
Construct a WorkType::IpcVariance with explicit hot,
cold, and period iteration counts.
Returns WorkTypeValidationError::ZeroIpcVarianceParam
when any of hot_iters, cold_iters, or period_iters
is 0. Construction-time validation matches the
spawn-time check so callers get immediate feedback at
the call site rather than discovering the rejection
only at WorkloadHandle::spawn time.
Sourcepub fn custom(
name: impl Into<String>,
run: fn(&WorkerCtx<'_>) -> WorkerReport,
) -> Self
pub fn custom( name: impl Into<String>, run: fn(&WorkerCtx<'_>) -> WorkerReport, ) -> Self
User-supplied work function with a display name.
run receives a WorkerCtx exposing the worker’s stop flag
(flipped per-mode: the SIGUSR1 handler for CloneMode::Fork, a
per-worker AtomicBool for CloneMode::Thread) plus its
effective cpuset, cgroup-sibling pids, own cgroup-v2 dir, and a
default-zero CustomCfg payload (use Self::custom_with to pass
a non-default cfg), and must return a WorkerReport when the stop
flag becomes true. The
framework handles fork / thread spawn, cgroup placement,
affinity, scheduling policy, and signal setup (Fork mode only);
run owns only the work loop.
The per-iteration built-in instrumentation (wake-latency samples,
iter_slot publish, gap tracking) runs only for built-in variants
and is bypassed for Custom. See the Custom
variant doc for the full telemetry contract and what run must
populate on WorkerReport to keep downstream assertions honest.
Sourcepub fn custom_with(
name: impl Into<String>,
run: fn(&WorkerCtx<'_>) -> WorkerReport,
cfg: CustomCfg,
) -> Self
pub fn custom_with( name: impl Into<String>, run: fn(&WorkerCtx<'_>) -> WorkerReport, cfg: CustomCfg, ) -> Self
Like Self::custom but carries a CustomCfg payload, surfaced
to the closure via WorkerCtx::cfg. The payload is Copy POD,
inherited byte-faithfully across fork, so a Custom worker reads its
per-worker config from ctx.cfg() instead of a static / global —
see tests/preempt_regression.rs for the shared-futex pattern.
Trait Implementations§
Source§impl<'de> Deserialize<'de> for WorkType
impl<'de> Deserialize<'de> for WorkType
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
impl StructuralPartialEq for WorkType
Auto Trait Implementations§
impl Freeze for WorkType
impl RefUnwindSafe for WorkType
impl Send for WorkType
impl Sync for WorkType
impl Unpin for WorkType
impl UnwindSafe for WorkType
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more