Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Zero to ktstr

This tutorial walks through writing a complete #[ktstr_test] from scratch. By the end you’ll have a scheduler test that runs two cgroups with different lifecycle patterns across a multi-LLC topology, asserts fairness, throughput parity, and cpuset isolation — and you’ll have broken it on purpose once, so real failures look familiar.

Already have a scheduler binary? This tutorial teaches ktstr from the ground up. If you have an existing scx_X you want to test, jump to one of the targeted recipes instead: test-new-scheduler.md (5 minutes, validates basic behavior), ab-compare.md (compare two scheduler builds), or diagnose-slow-scheduler.md (debug performance regressions).

What you’ll build

A test named mixed_workloads that:

  • Runs two cgroups on separate LLCs:
    • background_spinner — a persistent CPU-bound load that runs for the entire test duration.
    • phased_worker — a worker that loops through explicit Spin → Yield → Spin → Yield … phases via WorkType::Sequence.
  • Targets a 2-LLC, 4-core topology so the scheduler has a real cache boundary to respect.
  • Sets an explicit test duration.
  • Asserts fairness (per-cgroup spread), throughput parity (CV across workers + minimum rate), and cpuset isolation (workers stay on their assigned CPUs).
  • Fails once, deliberately, so you learn the failure output.
  • Captures a snapshot of the scheduler’s BPF state after the workload.

The complete test is at the end of this page.

Prerequisites

Getting Started covers the toolchain, KVM access, the dev-dependency, and building a bootable kernel (Build a kernel). With those in place, create a file under your crate’s tests/ directory (e.g. tests/mixed_workloads.rs) and follow along.

Step 1: The skeleton

Every #[ktstr_test] is a Rust function that takes &Ctx and returns Result<AssertResult>. Start with an empty body that passes unconditionally:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}

let _ = ctx; keeps the unused-variable lint quiet at the skeleton stage; Step 2 onward uses ctx.

Try it. Once this file compiles, run just this test with cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'. A bare-skeleton test passes immediately — the rest of the tutorial adds the workload and assertions on top.

use ktstr::prelude::*; brings in every type the test body needs — Ctx, AssertResult, CgroupDef, WorkType, CpusetSpec, execute_defs, and the Result alias from anyhow. The #[ktstr_test] attribute registers the function so cargo ktstr test discovers it and boots a VM with the requested topology.

A test without a scheduler = … attribute runs under the kernel’s default EEVDF scheduler — a useful baseline (see Overview). Step 2 swaps in a sched_ext scheduler so the rest of the tutorial exercises that scheduler instead.

For the full attribute reference, see The #[ktstr_test] Attribute.

Step 2: Define your scheduler

To target a sched_ext scheduler, declare it with declare_scheduler! and reference the generated const from #[ktstr_test(scheduler = …)]. The example uses scx-ktstr, the test-fixture scheduler shipped in the ktstr workspace; substitute your own binary name to target a different scheduler.

use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
});

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}

declare_scheduler! emits a pub static KTSTR_SCHED: Scheduler and registers it so the verifier sweep discovers it automatically. The scheduler = slot expects the bare const name. The fields used here:

  • name — scheduler name for display and result files.
  • binary — binary name, resolved on the host: target/{debug,release}/, the directory containing the test binary, or a KTSTR_SCHEDULER override path. The resolved binary is packed into the VM’s initramfs.

Other commonly used fields: topology = (numa, llcs, cores, threads) sets a default VM topology that per-test attributes can override; sched_args = ["--flag"] prepends CLI args to every test using this scheduler; kernels = [...] lists kernel specs for the verifier sweep. For the full surface (sysctls, kargs, config_file, gauntlet constraints, scheduler-level assertion overrides) and the manual-builder path for programmatic composition, see Scheduler Definitions.

Step 3: Add workloads

A CgroupDef declares a cgroup along with the workers that will run inside it. The builder methods configure worker count, the work each worker performs, scheduling policy, and cpuset assignment.

Add two cgroups — both running tight CPU spinners for now. Step 5 will swap one of them for a phased workload:

use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait),
    ])
}

Without .cpuset(...), a cgroup’s workers run on every CPU in the test’s topology — they share the VM’s full CPU set with all other cgroups. .cpuset(CpusetSpec::Llc(idx)) (introduced in Step 4) restricts a cgroup to one LLC’s CPUs.

WorkType::SpinWait runs a tight CPU spin loop; it is one of many work primitives, each targeting a different kernel scheduling path — see Work Types for the full set and how to choose one.

execute_defs runs each cgroup concurrently for the test’s full duration. Use execute_steps when you need to add cgroups mid-run or swap cpusets between phases — see Ops, Steps, and Backdrop.

Step 4: Set topology

The #[ktstr_test] attribute carries the VM’s CPU topology. Dimensions are big-to-little: numa_nodes (default 1), llcs (total across all NUMA nodes), cores per LLC, and threads per core. Total CPU count is llcs * cores * threads.

LLC count matters because the last-level cache is the primary scheduling boundary — tasks sharing an LLC benefit from shared cache lines, while cross-LLC migration carries a cold-cache penalty. A scheduler that ignores LLC topology will look fine on llcs = 1 and start failing as soon as there is a real cache boundary to respect.

Bump the topology to two LLCs with two cores each (4 CPUs total) so each cgroup can own its own LLC:

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(1)),
    ])
}

CpusetSpec::Llc(idx) confines a cgroup to the CPUs that belong to LLC idx. Other variants (Numa, Range, Disjoint, Overlap, Exact) cover NUMA-node binding, fractional partitioning, and hand-built CPU sets — see Topology.

Step 5: Compose phased work inside a cgroup

So far both cgroups run identical CPU spinners. The point of this test is to exercise a scheduler against different lifecycle patterns at once, so swap phased_worker for a worker that loops through explicit phases.

WorkType::Sequence runs each phase for its specified duration and then advances to the next; when the last phase ends the loop restarts. Phases: WorkPhase::Spin(Duration), WorkPhase::Sleep(Duration), WorkPhase::Yield(Duration), WorkPhase::Io(Duration), and WorkPhase::AluHot { .. }. Use the WorkType::sequence(first, rest) constructor. Only std::time::Duration needs an extra use line:

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        // Persistent CPU pressure on LLC 0 for the whole run.
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(0)),
        // Phased worker on LLC 1: spin 100 ms, yield for 20 ms,
        // then loop. Stresses the scheduler's wake-after-yield
        // placement repeatedly while the LLC-0 spinner keeps
        // runqueue pressure constant.
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                WorkPhase::Spin(Duration::from_millis(100)),
                [WorkPhase::Yield(Duration::from_millis(20))],
            ))
            .cpuset(CpusetSpec::Llc(1)),
    ])
}

The two cgroups now exercise distinct paths concurrently: background_spinner keeps two CPUs continuously busy on LLC 0, while phased_worker alternates between burning CPU and yielding on LLC 1, exercising voluntary preemption and wakeup placement.

Both cgroups still run for the entire scenario duration: the phasing happens within each phased_worker worker’s loop. To express phasing across cgroups (e.g. add phased_worker only for the second half of the run), use execute_steps with multiple Step entries — see Ops, Steps, and Backdrop.

Step 6: Tune execution

Several #[ktstr_test] attributes control how the VM runs the scenario. The defaults are tuned for fast iteration:

AttributeDefaultWhat it does
duration_s12Per-scenario wall-clock seconds. Workers run for this long, then stop and report.
watchdog_timeout_s5sched_ext watchdog fire threshold.
memory_mib2048VM memory in MiB.

watchdog_timeout_s is sched_ext’s per-task stall threshold — if a runnable task is not picked for that many seconds, the scheduler exits with SCX_EXIT_ERROR_STALL. The scenario duration and watchdog are independent; a 12 s scenario with a 5 s watchdog is normal. Tune the watchdog only when the scheduler under test is expected to legitimately leave a runnable task parked longer than the default 5 s.

For the run we’re building, set the duration to 20 s (so each phase iteration repeats many times):

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    // body unchanged from Step 5 — two cgroups via execute_defs
}

For the full attribute reference (auto-repro, performance mode, topology constraints, etc.), see The #[ktstr_test] Attribute.

Step 7: Add assertions

Every check is opt-in — no threshold is compared until you turn its check on, either at the scheduler level or on the per-test attribute (Checking explains the model, and Customize Checking the override chain). The first check to opt into is not_starved = true, which enables three related worker-level checks together:

  • Starvation — any worker with zero work units fails the test.
  • Fairness spread — per-cgroup max(off-CPU%) - min(off-CPU%) must stay under the spread threshold (release default 15%; debug default 35% — debug builds in small VMs show higher spread, so the threshold loosens automatically in debug builds).
  • Scheduling gaps — the longest wall-clock gap observed at work-unit checkpoints must stay under the gap threshold (release default 2000 ms; debug default 3000 ms).

Cpuset isolation is separate — enable it with isolation = true. Override the spread threshold and add throughput-parity gates:

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    not_starved = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                WorkPhase::Spin(Duration::from_millis(100)),
                [WorkPhase::Yield(Duration::from_millis(20))],
            ))
            .cpuset(CpusetSpec::Llc(1)),
    ])
}

What each new attribute gates:

  • isolation = true — workers must only run on CPUs in their assigned cpuset; any execution on an unexpected CPU fails the test.
  • not_starved = true — enables the starvation/spread/gap trio described above, at the default thresholds.
  • max_spread_pct = 20.0 — custom fairness threshold. It replaces the default-threshold spread verdict from not_starved with your limit (and enables the spread check on its own even without not_starved). 20.0 loosens the release default of 15.0 slightly to absorb noise from the phased worker’s yield-driven re-placement.
  • max_throughput_cv = 0.5 — coefficient of variation of work_units / cpu_time across workers. Catches a scheduler that gives some workers disproportionately less effective CPU.
  • min_work_rate = 1.0 — minimum work units per CPU-second per worker. Catches the case where every worker is equally slow (CV passes but absolute throughput is too low).

Host-side monitor checks (imbalance ratio, DSQ depth, stall detection, event rates) also run on every test, but they are report-only by default — Checking covers what they observe and how to make them enforce.

Step 8: Run it

Run the test with cargo ktstr test, scoped to this one test name:

cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'

cargo ktstr test resolves the kernel image, boots a VM with the declared topology, runs the test as the guest’s init, and reports the result. A real passing run looks like this (transcript captured from ktstr’s own suite — your run shows ktstr/mixed_workloads on the PASS line instead):

cargo ktstr test --kernel 7.0 -- -E 'test(=ktstr/failure_dump_renders_bss_fields)'
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
────────────
 Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
     Summary [  34.498s] 1 test run: 1 passed, 12531 skipped

cargo ktstr: test outputs
...
    (1 stats sidecar(s), 0 wprof trace(s) written this run)

That run took about 35 seconds end to end on a cached kernel — VM boot, scenario, teardown, and evaluation included. The ktstr/ prefix on the test name marks the base variant; see Running Tests for the name shapes and the sidecar files each run writes.

If something goes wrong instead:

  • “kernel not found” — the --kernel argument points at a directory without a built kernel, or at a version the cache cannot locate. Run cargo ktstr kernel build to populate the cache — see Getting Started: Build a kernel.
  • “scheduler binary not found” — the declared binary = "..." from Step 2 didn’t land where the discovery cascade looks. Set KTSTR_SCHEDULER=/path/to/binary to pin an explicit path, or rebuild the scheduler crate so the binary lands under target/{debug,release}/.
  • probe-related errors (“probe skeleton load failed”, “trigger attach failed”) — re-run with RUST_LOG=ktstr=debug to see the underlying libbpf reason; see Troubleshooting.

Step 9: Break it on purpose

A green run tells you the harness works; it doesn’t teach you to read a failure. Crank one threshold to an impossible value and watch what comes out. Add an iteration-rate floor no 2-core VM can meet:

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    not_starved = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
    min_iteration_rate = 50_000_000.0,   // deliberately impossible
)]

Below is a real capture of exactly this experiment — a demo test with the same impossible floor, on a 2-CPU topology:

ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
  worker 71 iteration rate 41903.3/s below floor 50000000.0/s
  worker 73 iteration rate 37834.5/s below floor 50000000.0/s

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK

...
cargo ktstr: test outputs
...
    FAILED  throughput_gate  [my_sched 1n1l2c1t]
      ...
      replay        cargo ktstr replay --filter throughput_gate --exec

How to read it:

  • The header names the test, the scheduler, and the topology variant. Every detail line under it names the check that tripped, the observed value, and the threshold — here, workers managing ~40k iterations/s against a 50M floor.
  • --- stats --- gives the per-cgroup roll-up: worker counts, CPUs touched, fairness spread, worst scheduling gap, migrations, and iteration totals.
  • verdict: monitor OK is worth noticing: the host-side monitor saw nothing wrong. The scheduler behaved fine — the test’s own gate was impossible. When a real scheduler bug trips a check, the monitor and timeline sections are usually where the story is.
  • The footer hands you a ready-to-paste cargo ktstr replay line to re-run exactly the failing variant.

The full failure anatomy — timeline, scheduler log, auto-repro, failure-dump artifacts — is covered in Reading Failure Output. Now delete the min_iteration_rate line and the test goes green again.

Step 10: Capture a snapshot

Threshold assertions tell you something is off; snapshots tell you what the scheduler’s state actually was. Op::capture_snapshot(name) freezes every vCPU long enough to read the scheduler’s BPF map state, vCPU registers, and per-CPU counters into a named report, then resumes the guest.

execute_defs (used so far) takes a flat list of cgroups. To inject a snapshot, switch to execute_steps, which takes a list of Steps — each with setup cgroups, an ops list, and a hold duration.

Warning

Within a step, ops fire before the setup cgroups are created. A single step with both the workload and a snapshot op named “after_workload” would capture an empty guest. Use two steps: a setup step that holds the workload, then a follow-up step whose op fires after the hold ends.

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1, duration_s = 20)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_steps(ctx, vec![
        Step {
            setup: Setup::Defs(vec![
                CgroupDef::named("background_spinner")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .cpuset(CpusetSpec::Llc(0)),
                CgroupDef::named("phased_worker")
                    .workers(2)
                    .work_type(WorkType::sequence(
                        WorkPhase::Spin(Duration::from_millis(100)),
                        [WorkPhase::Yield(Duration::from_millis(20))],
                    ))
                    .cpuset(CpusetSpec::Llc(1)),
            ]),
            ops: vec![],
            hold: HoldSpec::FULL,
        },
        Step {
            setup: Setup::Defs(Vec::new()),
            ops: vec![Op::capture_snapshot("after_workload")],
            hold: HoldSpec::Fixed(Duration::ZERO),
        },
    ])
}

The first step creates the cgroups and holds them for the full scenario duration; the second step’s op runs after that hold finishes, so the snapshot reflects the post-workload guest state. Downstream code reads the captured report by name and walks fields with a dotted-path accessor — e.g. snap.var("nr_dispatched").as_u64()? reads a scheduler global. For the traversal API, error handling, and the write-driven Op::watch_snapshot variant, see Snapshots.

The complete test

The shape exercised by every step above, in one file — the Step 7 assertions plus the Step 10 snapshot steps:

use std::time::Duration;
use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
});

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    not_starved = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_steps(ctx, vec![
        Step {
            setup: Setup::Defs(vec![
                CgroupDef::named("background_spinner")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .cpuset(CpusetSpec::Llc(0)),
                CgroupDef::named("phased_worker")
                    .workers(2)
                    .work_type(WorkType::sequence(
                        WorkPhase::Spin(Duration::from_millis(100)),
                        [WorkPhase::Yield(Duration::from_millis(20))],
                    ))
                    .cpuset(CpusetSpec::Llc(1)),
            ]),
            ops: vec![],
            hold: HoldSpec::FULL,
        },
        Step {
            setup: Setup::Defs(Vec::new()),
            ops: vec![Op::capture_snapshot("after_workload")],
            hold: HoldSpec::Fixed(Duration::ZERO),
        },
    ])
}

Run it:

cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'

Going further

Each of these builds directly on the test you just wrote.

  • Gauntlet. #[ktstr_test] doesn’t emit just one test — it also generates variants that run the same body across every accepted topology preset (gauntlet/mixed_workloads/smt-2llc, …), catching the bugs only odd LLC counts, SMT siblings, or NUMA crossings expose. See Gauntlet.
  • Worker identity. .comm("name"), .nice(n), and .pcomm("name") on CgroupDef give workers realistic names and priorities for schedulers that key on task->comm or nice values. See Work Types.
  • Inline scheduler config. Schedulers like scx_layered take a JSON config file; config_file_def on the scheduler plus config = … on the test writes it into the guest. See The #[ktstr_test] Attribute.
  • Periodic capture and temporal assertions. num_snapshots = N captures BPF state at evenly spaced points across the run, and a post_vm callback asserts temporal patterns over the series (nondecreasing counters, bounded rates, convergence). See Periodic Capture and Temporal Assertions.
  • Performance mode. For benchmark-grade runs, ktstr pins vCPUs to reserved host cores and strips host scheduling noise; for topologies your host can’t mirror, no_perf_mode = true builds the virtual topology as declared. See Performance Mode.
  • Stats and regression gates. Every run writes machine-readable sidecars; cargo ktstr stats aggregates them and cargo ktstr perf-delta gates HEAD against a baseline. See Runs and Regression Gates.
  • Custom scenarios. When the declarative ops can’t express your scenario, the test body is arbitrary Rust — resize cpusets based on observed telemetry, assert on migrations directly. See Custom Scenarios and Ops, Steps, and Backdrop.