Zero to ktstr
This tutorial walks through writing a complete #[ktstr_test] from
scratch. By the end you’ll have a scheduler test that runs two
cgroups with different lifecycle patterns across a multi-LLC
topology, asserts fairness, throughput parity, and cpuset isolation
— and you’ll have broken it on purpose once, so real failures look
familiar.
Already have a scheduler binary? This tutorial teaches ktstr from the ground up. If you have an existing
scx_Xyou want to test, jump to one of the targeted recipes instead: test-new-scheduler.md (5 minutes, validates basic behavior), ab-compare.md (compare two scheduler builds), or diagnose-slow-scheduler.md (debug performance regressions).
What you’ll build
A test named mixed_workloads that:
- Runs two cgroups on separate LLCs:
background_spinner— a persistent CPU-bound load that runs for the entire test duration.phased_worker— a worker that loops through explicitSpin → Yield → Spin → Yield …phases viaWorkType::Sequence.
- Targets a 2-LLC, 4-core topology so the scheduler has a real cache boundary to respect.
- Sets an explicit test duration.
- Asserts fairness (per-cgroup spread), throughput parity (CV across workers + minimum rate), and cpuset isolation (workers stay on their assigned CPUs).
- Fails once, deliberately, so you learn the failure output.
- Captures a snapshot of the scheduler’s BPF state after the workload.
The complete test is at the end of this page.
Prerequisites
Getting Started covers the toolchain, KVM
access, the dev-dependency, and building a bootable kernel
(Build a kernel). With those in
place, create a file under your crate’s tests/ directory (e.g.
tests/mixed_workloads.rs) and follow along.
Step 1: The skeleton
Every #[ktstr_test] is a Rust function that takes &Ctx and
returns Result<AssertResult>. Start with an empty body that passes
unconditionally:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
let _ = ctx;
Ok(AssertResult::pass())
}
let _ = ctx; keeps the unused-variable lint quiet at the skeleton
stage; Step 2 onward uses ctx.
Try it. Once this file compiles, run just this test with
cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'. A bare-skeleton test passes immediately — the rest of the tutorial adds the workload and assertions on top.
use ktstr::prelude::*; brings in every type the test body needs —
Ctx, AssertResult, CgroupDef, WorkType, CpusetSpec,
execute_defs, and the Result alias from anyhow. The
#[ktstr_test] attribute registers the function so cargo ktstr test discovers it and boots a VM with the requested topology.
A test without a scheduler = … attribute runs under the kernel’s
default EEVDF scheduler — a useful baseline (see
Overview). Step 2 swaps in a sched_ext scheduler so
the rest of the tutorial exercises that scheduler instead.
For the full attribute reference, see The #[ktstr_test] Attribute.
Step 2: Define your scheduler
To target a sched_ext scheduler, declare it with
declare_scheduler! and reference the generated const from
#[ktstr_test(scheduler = …)]. The example uses scx-ktstr, the
test-fixture scheduler shipped in the ktstr workspace; substitute
your own binary name to target a different scheduler.
use ktstr::prelude::*;
declare_scheduler!(KTSTR_SCHED, {
name = "ktstr_sched",
binary = "scx-ktstr",
});
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
let _ = ctx;
Ok(AssertResult::pass())
}
declare_scheduler! emits a pub static KTSTR_SCHED: Scheduler and
registers it so the verifier sweep
discovers it automatically. The scheduler = slot expects the bare
const name. The fields used here:
name— scheduler name for display and result files.binary— binary name, resolved on the host:target/{debug,release}/, the directory containing the test binary, or aKTSTR_SCHEDULERoverride path. The resolved binary is packed into the VM’s initramfs.
Other commonly used fields: topology = (numa, llcs, cores, threads) sets a default VM topology that per-test attributes can
override; sched_args = ["--flag"] prepends CLI args to every test
using this scheduler; kernels = [...] lists kernel specs for the
verifier sweep. For the full surface (sysctls, kargs,
config_file, gauntlet constraints, scheduler-level assertion
overrides) and the manual-builder path for programmatic composition,
see Scheduler Definitions.
Step 3: Add workloads
A CgroupDef declares a cgroup along with the workers that will run
inside it. The builder methods configure worker count, the work each
worker performs, scheduling policy, and cpuset assignment.
Add two cgroups — both running tight CPU spinners for now. Step 5 will swap one of them for a phased workload:
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::SpinWait),
])
}
Without .cpuset(...), a cgroup’s workers run on every CPU in the
test’s topology — they share the VM’s full CPU set with all other
cgroups. .cpuset(CpusetSpec::Llc(idx)) (introduced in Step 4)
restricts a cgroup to one LLC’s CPUs.
WorkType::SpinWait runs a tight CPU spin loop; it is one of many
work primitives, each targeting a different kernel scheduling path —
see Work Types for the full set and how to
choose one.
execute_defs runs each cgroup concurrently for the test’s full
duration. Use execute_steps when you need to add cgroups mid-run
or swap cpusets between phases — see
Ops, Steps, and Backdrop.
Step 4: Set topology
The #[ktstr_test] attribute carries the VM’s CPU topology.
Dimensions are big-to-little: numa_nodes (default 1), llcs
(total across all NUMA nodes), cores per LLC, and threads per
core. Total CPU count is llcs * cores * threads.
LLC count matters because the last-level cache is the primary
scheduling boundary — tasks sharing an LLC benefit from shared
cache lines, while cross-LLC migration carries a cold-cache penalty.
A scheduler that ignores LLC topology will look fine on llcs = 1
and start failing as soon as there is a real cache boundary to
respect.
Bump the topology to two LLCs with two cores each (4 CPUs total) so each cgroup can own its own LLC:
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(1)),
])
}
CpusetSpec::Llc(idx) confines a cgroup to the CPUs that belong to
LLC idx. Other variants (Numa, Range, Disjoint, Overlap,
Exact) cover NUMA-node binding, fractional partitioning, and
hand-built CPU sets — see Topology.
Step 5: Compose phased work inside a cgroup
So far both cgroups run identical CPU spinners. The point of this
test is to exercise a scheduler against different lifecycle
patterns at once, so swap phased_worker for a worker that loops
through explicit phases.
WorkType::Sequence runs each phase for its specified duration and
then advances to the next; when the last phase ends the loop
restarts. Phases: WorkPhase::Spin(Duration),
WorkPhase::Sleep(Duration), WorkPhase::Yield(Duration),
WorkPhase::Io(Duration), and WorkPhase::AluHot { .. }. Use the
WorkType::sequence(first, rest) constructor. Only
std::time::Duration needs an extra use line:
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
// Persistent CPU pressure on LLC 0 for the whole run.
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
// Phased worker on LLC 1: spin 100 ms, yield for 20 ms,
// then loop. Stresses the scheduler's wake-after-yield
// placement repeatedly while the LLC-0 spinner keeps
// runqueue pressure constant.
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
])
}
The two cgroups now exercise distinct paths concurrently:
background_spinner keeps two CPUs continuously busy on LLC 0,
while phased_worker alternates between burning CPU and yielding on
LLC 1, exercising voluntary preemption and wakeup placement.
Both cgroups still run for the entire scenario duration: the phasing
happens within each phased_worker worker’s loop. To express
phasing across cgroups (e.g. add phased_worker only for the
second half of the run), use execute_steps with multiple Step
entries — see Ops, Steps, and Backdrop.
Step 6: Tune execution
Several #[ktstr_test] attributes control how the VM runs the
scenario. The defaults are tuned for fast iteration:
| Attribute | Default | What it does |
|---|---|---|
duration_s | 12 | Per-scenario wall-clock seconds. Workers run for this long, then stop and report. |
watchdog_timeout_s | 5 | sched_ext watchdog fire threshold. |
memory_mib | 2048 | VM memory in MiB. |
watchdog_timeout_s is sched_ext’s per-task stall threshold — if a
runnable task is not picked for that many seconds, the scheduler
exits with SCX_EXIT_ERROR_STALL. The scenario duration and
watchdog are independent; a 12 s scenario with a 5 s watchdog is
normal. Tune the watchdog only when the scheduler under test is
expected to legitimately leave a runnable task parked longer than
the default 5 s.
For the run we’re building, set the duration to 20 s (so each phase iteration repeats many times):
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
// body unchanged from Step 5 — two cgroups via execute_defs
}
For the full attribute reference (auto-repro, performance mode, topology constraints, etc.), see The #[ktstr_test] Attribute.
Step 7: Add assertions
Every check is opt-in — no threshold is compared until you turn its
check on, either at the scheduler level or on the per-test attribute
(Checking explains the model, and
Customize Checking the override
chain). The first check to opt into is not_starved = true, which
enables three related worker-level checks together:
- Starvation — any worker with zero work units fails the test.
- Fairness spread — per-cgroup
max(off-CPU%) - min(off-CPU%)must stay under the spread threshold (release default 15%; debug default 35% — debug builds in small VMs show higher spread, so the threshold loosens automatically in debug builds). - Scheduling gaps — the longest wall-clock gap observed at work-unit checkpoints must stay under the gap threshold (release default 2000 ms; debug default 3000 ms).
Cpuset isolation is separate — enable it with isolation = true.
Override the spread threshold and add throughput-parity gates:
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
isolation = true,
not_starved = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
])
}
What each new attribute gates:
isolation = true— workers must only run on CPUs in their assigned cpuset; any execution on an unexpected CPU fails the test.not_starved = true— enables the starvation/spread/gap trio described above, at the default thresholds.max_spread_pct = 20.0— custom fairness threshold. It replaces the default-threshold spread verdict fromnot_starvedwith your limit (and enables the spread check on its own even withoutnot_starved). 20.0 loosens the release default of 15.0 slightly to absorb noise from the phased worker’s yield-driven re-placement.max_throughput_cv = 0.5— coefficient of variation ofwork_units / cpu_timeacross workers. Catches a scheduler that gives some workers disproportionately less effective CPU.min_work_rate = 1.0— minimum work units per CPU-second per worker. Catches the case where every worker is equally slow (CV passes but absolute throughput is too low).
Host-side monitor checks (imbalance ratio, DSQ depth, stall detection, event rates) also run on every test, but they are report-only by default — Checking covers what they observe and how to make them enforce.
Step 8: Run it
Run the test with cargo ktstr test, scoped to this one test name:
cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'
cargo ktstr test resolves the kernel image, boots a VM with the
declared topology, runs the test as the guest’s init, and reports
the result. A real passing run looks like this (transcript captured
from ktstr’s own suite — your run shows ktstr/mixed_workloads on
the PASS line instead):
cargo ktstr: fetching latest 7.0.x kernel version cargo ktstr: latest 7.0.x kernel: 7.0.14 cargo ktstr: resolved kernel "7.0" ... ──────────── Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default Starting 1 test across 121 binaries (12531 tests skipped) PASS [ 34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields ──────────── Summary [ 34.498s] 1 test run: 1 passed, 12531 skipped cargo ktstr: test outputs ... (1 stats sidecar(s), 0 wprof trace(s) written this run)
That run took about 35 seconds end to end on a cached kernel — VM
boot, scenario, teardown, and evaluation included. The ktstr/
prefix on the test name marks the base variant; see
Running Tests for the name shapes and the
sidecar files each run writes.
If something goes wrong instead:
- “kernel not found” — the
--kernelargument points at a directory without a built kernel, or at a version the cache cannot locate. Runcargo ktstr kernel buildto populate the cache — see Getting Started: Build a kernel. - “scheduler binary not found” — the declared
binary = "..."from Step 2 didn’t land where the discovery cascade looks. SetKTSTR_SCHEDULER=/path/to/binaryto pin an explicit path, or rebuild the scheduler crate so the binary lands undertarget/{debug,release}/. - probe-related errors (“probe skeleton load failed”, “trigger
attach failed”) — re-run with
RUST_LOG=ktstr=debugto see the underlying libbpf reason; see Troubleshooting.
Step 9: Break it on purpose
A green run tells you the harness works; it doesn’t teach you to read a failure. Crank one threshold to an impossible value and watch what comes out. Add an iteration-rate floor no 2-core VM can meet:
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
isolation = true,
not_starved = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
min_iteration_rate = 50_000_000.0, // deliberately impossible
)]
Below is a real capture of exactly this experiment — a demo test with the same impossible floor, on a 2-CPU topology:
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK
...
cargo ktstr: test outputs
...
FAILED throughput_gate [my_sched 1n1l2c1t]
...
replay cargo ktstr replay --filter throughput_gate --exec
How to read it:
- The header names the test, the scheduler, and the topology variant. Every detail line under it names the check that tripped, the observed value, and the threshold — here, workers managing ~40k iterations/s against a 50M floor.
--- stats ---gives the per-cgroup roll-up: worker counts, CPUs touched, fairness spread, worst scheduling gap, migrations, and iteration totals.verdict: monitor OKis worth noticing: the host-side monitor saw nothing wrong. The scheduler behaved fine — the test’s own gate was impossible. When a real scheduler bug trips a check, the monitor and timeline sections are usually where the story is.- The footer hands you a ready-to-paste
cargo ktstr replayline to re-run exactly the failing variant.
The full failure anatomy — timeline, scheduler log, auto-repro,
failure-dump artifacts — is covered in
Reading Failure Output. Now delete the
min_iteration_rate line and the test goes green again.
Step 10: Capture a snapshot
Threshold assertions tell you something is off; snapshots tell you
what the scheduler’s state actually was.
Op::capture_snapshot(name) freezes every vCPU long enough to read
the scheduler’s BPF map state, vCPU registers, and per-CPU counters
into a named report, then resumes the guest.
execute_defs (used so far) takes a flat list of cgroups. To inject
a snapshot, switch to execute_steps, which takes a list of Steps
— each with setup cgroups, an ops list, and a hold duration.
Warning
Within a step, ops fire before the
setupcgroups are created. A single step with both the workload and a snapshot op named “after_workload” would capture an empty guest. Use two steps: a setup step that holds the workload, then a follow-up step whose op fires after the hold ends.
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1, duration_s = 20)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_steps(ctx, vec![
Step {
setup: Setup::Defs(vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
]),
ops: vec![],
hold: HoldSpec::FULL,
},
Step {
setup: Setup::Defs(Vec::new()),
ops: vec![Op::capture_snapshot("after_workload")],
hold: HoldSpec::Fixed(Duration::ZERO),
},
])
}
The first step creates the cgroups and holds them for the full
scenario duration; the second step’s op runs after that hold
finishes, so the snapshot reflects the post-workload guest state.
Downstream code reads the captured report by name and walks fields
with a dotted-path accessor — e.g.
snap.var("nr_dispatched").as_u64()? reads a scheduler global. For
the traversal API, error handling, and the write-driven
Op::watch_snapshot variant, see
Snapshots.
The complete test
The shape exercised by every step above, in one file — the Step 7 assertions plus the Step 10 snapshot steps:
use std::time::Duration;
use ktstr::prelude::*;
declare_scheduler!(KTSTR_SCHED, {
name = "ktstr_sched",
binary = "scx-ktstr",
});
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
isolation = true,
not_starved = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_steps(ctx, vec![
Step {
setup: Setup::Defs(vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
]),
ops: vec![],
hold: HoldSpec::FULL,
},
Step {
setup: Setup::Defs(Vec::new()),
ops: vec![Op::capture_snapshot("after_workload")],
hold: HoldSpec::Fixed(Duration::ZERO),
},
])
}
Run it:
cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'
Going further
Each of these builds directly on the test you just wrote.
- Gauntlet.
#[ktstr_test]doesn’t emit just one test — it also generates variants that run the same body across every accepted topology preset (gauntlet/mixed_workloads/smt-2llc, …), catching the bugs only odd LLC counts, SMT siblings, or NUMA crossings expose. See Gauntlet. - Worker identity.
.comm("name"),.nice(n), and.pcomm("name")onCgroupDefgive workers realistic names and priorities for schedulers that key ontask->commor nice values. See Work Types. - Inline scheduler config. Schedulers like
scx_layeredtake a JSON config file;config_file_defon the scheduler plusconfig = …on the test writes it into the guest. See The #[ktstr_test] Attribute. - Periodic capture and temporal assertions.
num_snapshots = Ncaptures BPF state at evenly spaced points across the run, and apost_vmcallback asserts temporal patterns over the series (nondecreasing counters, bounded rates, convergence). See Periodic Capture and Temporal Assertions. - Performance mode. For benchmark-grade runs, ktstr pins vCPUs
to reserved host cores and strips host scheduling noise; for
topologies your host can’t mirror,
no_perf_mode = truebuilds the virtual topology as declared. See Performance Mode. - Stats and regression gates. Every run writes machine-readable
sidecars;
cargo ktstr statsaggregates them andcargo ktstr perf-deltagates HEAD against a baseline. See Runs and Regression Gates. - Custom scenarios. When the declarative ops can’t express your scenario, the test body is arbitrary Rust — resize cpusets based on observed telemetry, assert on migrations directly. See Custom Scenarios and Ops, Steps, and Backdrop.