Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Overview

ktstr

Test Linux schedulers like code. Every test boots a real kernel in a KVM micro-VM with the topology it declares — and ktstr watches what your scheduler does from the host, without touching the guest.

Scheduler bugs hide in topology: the fairness regression that only shows up on an odd LLC count, the starvation that needs SMT siblings, the crash that wants a NUMA crossing. Testing a sched_ext scheduler against those shapes has meant scrounging hardware and hand-running repro scripts. ktstr turns it into cargo test: declare the topology on the test, and the VM actually has it.

Quick taste

use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_mine",
});

#[ktstr_test(scheduler = MY_SCHED, llcs = 1, cores = 2, threads = 1)]
fn steady_under_my_sched(ctx: &Ctx) -> Result<AssertResult> {
    scenarios::steady(ctx)
}

Run it against any kernel — a released version, a local source tree, or a git URL:

cargo ktstr test --kernel 7.0
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
 Nextest run ID 98581174-… with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
     Summary [  34.498s] 1 test run: 1 passed, 12531 skipped

Without a scheduler attribute, tests run under the kernel’s default scheduler (EEVDF) — useful for baselines and A/B comparisons.

When it breaks, you see why

A crash log tells you where the scheduler died. ktstr also tells you what the state looked like on the way there: on a crash it boots a second VM, attaches BPF probes along the crash path, and reruns the scenario. Each probed function prints decoded struct fields; marks fields that changed between entry and exit:

cargo ktstr test — auto-repro output after a scheduler crash
=== AUTO-PROBE: scx_exit fired ===

  ktstr_enqueue                                                   main.bpf.c:21
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID
      ...
      scx_flags   QUEUED|ENABLED
  do_enqueue_task                                               kernel/sched/ext.c
    rq *rq
      cpu         1
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID          →  SCX_DSQ_LOCAL
      ...
      scx_flags   QUEUED|DEQD_FOR_SLEEP    →  QUEUED

Auto-repro is on by default and needs a kernel with the sched_ext_exit tracepoint — see Auto-Repro. For the anatomy of ordinary failures (stats, timeline, monitor verdict), see Reading Failure Output.

Real kernels under KVM

Each test gets a fresh micro-VM booting the exact kernel you target. Real cgroups, real BPF, no shared state.

How it works →

Topology as code

NUMA nodes, LLCs, cores, SMT — declared on the test attribute, realized in the guest down to the ACPI tables.

Topology →

Gauntlet

One test declaration fans out across a matrix of topology presets — odd LLC counts, SMT, NUMA crossings — with budget-aware selection for CI.

Gauntlet →

Auto-repro

Crashes rerun themselves in a probe VM that captures function arguments and struct state along the crash path.

Auto-Repro →

Design

Fidelity without overhead. Every test boots a real Linux kernel in a KVM VM with real cgroups and real BPF programs — no mocking, no containers, no state carried between tests. The VMM is purpose-built for this job; see VMM.

Direct access over tooling layers. The host-side monitor reads guest memory through BTF-resolved struct offsets — runqueues, DSQ depths, schedstat counters — loading nothing into the guest, so observation does not perturb the scheduler under test. See Monitor.

What it tests

  • Fair scheduling — workers get CPU time without starvation or excessive scheduling gaps.
  • Cpuset isolation — workers stay on assigned CPUs.
  • Dynamic operations — cgroups created, destroyed, and resized mid-run.
  • Affinity — the scheduler respects thread affinity constraints.
  • Stress — many cgroups, many workers, rapid topology changes.
  • Stall detection — the scheduler doesn’t drop tasks.

Note

ktstr is pre-release. 0.x APIs change between releases, so pin the exact version — Getting Started shows how.

Next steps

ktstr in Action

The feature tour: each capability in a few sentences, with real output where output is the point — captured from actual runs (ktstr 0.23.0, kernel 7.0.14 unless noted), trimmed but never edited.

Testing

Real kernel, clean slate, x86/arm parity

Every VM test boots its own Linux kernel in KVM — fresh state each run, no shared daemons, no containers. Topology is configurable per test: NUMA nodes, LLCs, cores per LLC, threads per core, with real ACPI SRAT/SLIT tables on x86_64 and FDT cpu nodes on aarch64 — 24 topology presets on x86_64, 14 on aarch64. The framework drives the whole lifecycle: boot, attach, scenario, collect, teardown. See Topology.

Fast boot

The initramfs base (test binary + busybox + shared libraries) is LZ4-compressed and cached in shared memory; concurrent VMs COW-map the cached base instead of rebuilding it, so boot is dominated by kernel init, not initramfs preparation. Measured on a 64-CPU host:

  initramfs spawn: 55.583µs
  kvm+kernel: 867.005µs
  setup_memory (joins initramfs): 1.409360963s
  setup_vcpus: 1.409565321s
VM setup total: 1.409619773s

Declarative scheduler registration

One macro declares a scheduler — binary, default topology, kernel filter for the verifier sweep, assertion overrides, always-on CLI args — and tests reference the const it emits:

use ktstr::declare_scheduler;

declare_scheduler!(MITOSIS, {
    name = "mitosis",
    binary = "scx_mitosis",
    topology = (1, 2, 4, 1),
    sched_args = ["--exit-dump-len", "1048576"],
});

Schedulers that take a --config JSON file declare the arg template once; each test supplies its config inline via config = …. See Scheduler Definitions and The #[ktstr_test] Attribute.

Data-driven scenarios

You declare intent as data — cgroups, cpusets, workloads, mid-run ops — and the framework creates cgroups, spawns workers, applies policies and affinity, and tears it all down. Canned scenarios grouped by scheduling concern — affinity, cpusets, dynamic cgroups, nesting, contention, stress — cover the common patterns; the ops DSL underneath (Step, Op, Backdrop) expresses the rest. See Scenarios and Ops, Steps, and Backdrop.

45 work types

Each work type targets a specific scheduling pressure, so a test can pin the kernel path a regression lives in. A sample of the 45:

PressureExamples
CPU and IPCSpinWait, AluHot, IpcVariance
Wakeup placementFutexPingPong, WakeChain, PipeIo
Task churnForkExit, CgroupAttachStorm, AffinityChurn
Priority and starvationPreemptStorm, RtStarvation, PriorityInversion
Memory and NUMACachePressure, PageFaultChurn, NumaWorkingSetSweep
IRQ and I/OIrqWake, NetTraffic, IoConvoy
BenchmarksSchbench, Taobench

Workers can also set comm, nice level, and thread-group-leader name (pcomm) to model real applications, and Custom takes a user-supplied work function. Full catalog: Work Types.

Gauntlet

A single #[ktstr_test] auto-expands across topology presets, and multi-kernel runs (--kernel A --kernel B) add the kernel as another matrix dimension. Budget-based selection maximizes coverage within a CI time limit; multi-NUMA and very large presets are opt-in via constraint attributes. See Gauntlet.

Real-kernel BPF verifier analysis

The verifier sweep boots a VM per (scheduler × kernel × topology preset), loads the scheduler through struct_ops — the same path production uses — and reads actual verified instruction counts from guest memory. Topology is a real verification axis: values baked into .rodata (like CPU counts) change what the verifier explores, so a scheduler can attach on one topology and be rejected on another.

cargo ktstr verifier --kernel 7.0 --scheduler ktstr_sched
        PASS [  12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
...
verifier verified_insns (per scheduler; rows: kernel, cols: BPF program, cell: range across topologies):

ktstr_sched:
 kernel      ktstr_dispatch  ktstr_dump  ktstr_dump_cpu  ktstr_dump_task  ktstr_enqueue  ktstr_exit  ktstr_exit_task  ktstr_init  ktstr_init_task  ktstr_select_cp  ktstr_yield 
 kernel_7_0  102             81          13              70               74             25          419              2296        29077            39               8           

verifier summary: 4 ✅  0 ❌  0 🇽
 topology   ktstr_sched 
 odd-3llc   ✅          
 smt-2llc   ✅          
 tiny-1llc  ✅          
 tiny-2llc  ✅          

On rejection, the log is cycle-collapsed — repeated loop-unrolling iterations are deduplicated so the offending access is readable:

Global function ktstr_dispatch() doesn't return scalar. Only those are supported.
...
--- 8x of the following 25 lines ---
; u64 t = bpf_ktime_get_ns(); @ main.bpf.c:453
38: (85) call bpf_ktime_get_ns#5      ; R0=scalar()
...
--- 6 identical iterations omitted ---
...
--- end repeat ---
192: (63) *(u32 *)(r1 +0) = r2
R1 invalid mem access 'scalar'
processed 186 insns (limit 1000000) max_states_per_insn 0 total_states 7 peak_states 7 mark_read 0

See BPF Verifier Sweep.

Bare-metal export

cargo ktstr export packages a registered test as a self-extracting .run script that reproduces the scenario on real hardware, no VM. The script freezes the scheduler binary, its args and config files, and the required topology; it validates the host and refuses to displace an already-attached sched_ext scheduler.

wrote /tmp/sched_basic_proportional.run (90074903 bytes archive, 0 include files)
----- head -40 of the generated script -----
#!/bin/bash
# Generated by `cargo ktstr export`. Do not edit; regenerate to update.
...
# --- frozen test specification ---
KTSTR_TEST_NAME=sched_basic_proportional
KTSTR_SCHED_NAME=ktstr_sched
KTSTR_GIT_HASH=73730e0
NEED_LLCS=1
NEED_CORES_PER_LLC=2
NEED_THREADS_PER_CORE=1
NEED_NUMA_NODES=1
...

See cargo ktstr.

Observability

Zero-perturbation introspection

Everything is built on direct reads of guest physical memory from the host via the KVM memory mapping. Kernel state — per-CPU runqueues, sched_domain trees, schedstat and sched_ext event counters — is read through BTF-resolved struct offsets; BPF maps get typed field access via program BTF. No guest-side instrumentation, no BPF syscalls: the observer does not perturb the scheduler under test. See Monitor.

Cast analysis

Schedulers stash kernel and arena pointers in BPF map fields declared as u64, because BTF cannot express those pointer types. The cast analyzer walks the scheduler’s instruction stream and proves which fields are really pointers and to what — so dumps chase through them and print typed structs annotated (cast→arena) or (cast→kernel) instead of raw hex. It runs on every scheduler load, no configuration (the failure-dump excerpt below shows it in action). See Monitor.

Periodic capture and temporal assertions

#[ktstr_test(num_snapshots = N)] samples BPF map fields and scheduler stats at N points across the workload window, from outside the guest. Temporal patterns — nondecreasing, rate_within, steady_within, converges_to, and friends — assert over the whole series; on-demand and write-triggered snapshots share the same machinery. See Periodic Capture, Temporal Assertions, and Snapshots.

Statistical regression detection

Every run writes machine-readable results; cross-run comparison with dual-gate significance thresholds (absolute and relative) catches regressions single-run assertions miss. Gated metrics include worst_spread, worst_gap_ms, worst_p99_wake_latency_us, and the duration-invariant total_run_delay_ns_per_sched; run cargo ktstr stats list-metrics for the registry. See Runs and Regression Gates and Assertable Metrics.

Debugging

Failure dumps

A failing test’s stderr carries the whole story: the tripped check, per-cgroup stats, a phase timeline, the scheduler log, the monitor’s verdict. On scheduler crash, ktstr also snapshots every BPF map with fields rendered by name through BTF — .bss globals, arena allocator state, typed pointer chases — plus vCPU registers at the instant of death:

cargo ktstr test — failure dump excerpt
--- repro VM failure dump ---
DualFailureDumpReport: early=absent (max_age never crossed threshold (peak=66j, threshold=2500j)), late=(12 maps, 2 vcpu_regs)
...
map bpf_bpf.bss (type=array, value_size=448, max_entries=1)
.bss:
  scx_arena_verify_once=true   ktstr_alloc_count=76   nr_dispatched=907
  nr_enqueued=495              nr_select_cpu=372      stats_magic=6004496034161779060
...
  scx_task_allocator scx_allocator:
...
    root 0x100000006000 → sdt_desc:
      nr_free=512
      chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{}
...
vcpu_regs:
  vcpu 0: ip=0xffffffff96347fbf sp=0xffffffff97203e78 ptroot=0x0000000001e85003
  vcpu 1: ip=0xffffffff9560bdc5 sp=0xff3b18cb8000f778 ptroot=0x0000000001e85003

The same report is written as a JSON artifact next to the run’s stats sidecar. See Reading Failure Output and Snapshots.

Auto-repro

On a scheduler crash, ktstr extracts the crash stack, discovers the struct_ops callbacks, and reruns the scenario in a second VM with BPF probes attached along the crash path — decoded function arguments and struct state at each call site, arrows marking entry-to-exit changes:

do_enqueue_task                                               kernel/sched/ext.c
  rq *rq
    cpu         1
  task_struct *p
    pid         97
    cpus_ptr    0xf(0-3)
    dsq_id      SCX_DSQ_INVALID          →  SCX_DSQ_LOCAL
...
    scx_flags   QUEUED|DEQD_FOR_SLEEP    →  QUEUED

On by default; requires a kernel with the sched_ext_exit tracepoint. See Auto-Repro.

Interactive shell

ktstr shell boots a busybox VM and drops you into it. --include-files injects host binaries with their shared-library closure resolved automatically (recursive DT_NEEDED discovery); --exec "cmd" runs one command non-interactively. For debugging, not tests. See ktstr (standalone).

ctprof

ktstr ctprof capture snapshots per-task and per-cgroup scheduler telemetry on the host — no VM involved. Capture before and after a change, then compare to see which processes scheduled differently:

## Primary metrics
 comm                              threads  metric             value                delta      %         %uptime 
 kworker/{N}:{N}-mm_percpu_wq                                                                                    
     kworker/{N}:{N}-mm_percpu_wq  11→37    voluntary_csw      8.697K → 101.154K    +92.457K   +1063.1%  93%     
     kworker/{N}:{N}-mm_percpu_wq  11→37    timeslices         8.699K → 101.166K    +92.467K   +1063.0%  93%     
     kworker/{N}:{N}-mm_percpu_wq  11→37    wait_time_ns       2.684s → 27.653s     +24.969s   +930.2%   93%     
...

See ctprof.

Infrastructure

Supported kernels

CapabilityKernel requirement
CI-tested series6.14 and 7.1, on x86_64 and aarch64, every push
Watchdog-timeout override7.1+ via BTF (scx_sched.watchdog_timeout); older kernels via the static scx_watchdog_timeout symbol
sched_ext event counters6.16+ (two BTF layouts); sampling is disabled when neither is present
Auto-repro probe triggerkernels with the sched_ext_exit tracepoint

Outside the CI-tested series, the monitor degrades feature by feature rather than failing: tests still run, and unavailable capabilities are reported as absent.

Kernel management

cargo ktstr kernel build builds and caches kernel images from version numbers, local source paths, or git URLs; automatic discovery resolves cached images, host kernels, and CI-provided paths, and an optional GHA cache backend shares built kernels across CI runs. See cargo ktstr and CI.

Performance mode

Performance mode pins vCPUs to reserved host cores, pre-faults 2 MB hugepages, runs vCPU threads under SCHED_FIFO, and suppresses PAUSE/HLT exits — removing host scheduling noise so a latency spike in the guest points at the scheduler under test, not the host. Guest-side jitter from shared LLCs and memory bandwidth remains, so compare performance-mode runs only against the same host. When the host can’t physically satisfy a declared topology, no_perf_mode builds the VM anyway and skips the isolation. See Performance Mode.

Resource-budget coordination

--cpu-cap N confines kernel builds and no-perf-mode VMs to N host CPUs, reserved whole-LLC-at-a-time and coordinated between concurrent ktstr processes through per-LLC locks — so a kernel build and a performance run can share a box without trampling each other. ktstr locks lists every held lock with its holder. See Resource Budget and ktstr (standalone).

Change-scoped selection

cargo ktstr affected attributes a base..HEAD diff to the schedulers it touches and emits a JSON array ready for a GitHub Actions dynamic matrix — one job per affected scheduler instead of the whole fleet. --relevant applies the same attribution to the local working tree for a fast inner loop. Any uncertainty widens to “run all”: silently skipping an affected scheduler is the worst outcome. See CI.

Guest coverage

Profraw data is collected inside the VM over shared memory and merged with host coverage, so guest-side code paths count toward the same cargo llvm-cov report as host-side ones.


Ready to try it? Getting Started sets up your first test; the Recipes cover complete workflows.

Getting Started

Every #[ktstr_test] boots a real Linux kernel in a KVM microVM with the CPU topology the test declares, runs your workload inside it, and checks the scheduler’s behavior from the host. This page takes you from nothing to a green run.

Zero to green

cargo install --locked cargo-nextest
cargo install --locked ktstr             # installs `cargo-ktstr` + `ktstr`
cargo ktstr kernel build --kernel 7.0    # one-time full kernel build; cached after
$EDITOR tests/sched_test.rs              # write a #[ktstr_test] (below)
cargo ktstr test --kernel 7.0

Both installs are required: cargo ktstr test delegates to nextest, and the ktstr package installs cargo-ktstr (the cargo plugin behind every command in this guide) plus the standalone ktstr host CLI. The kernel build is a real make -j$(nproc) kernel build — plan for that once; later runs reuse the cache. On a cached kernel, the run shown below took about 35 seconds end to end.

Prerequisites

Linux only (x86_64, aarch64). ktstr boots KVM virtual machines; it does not build or run on other platforms.

  • KVM access (/dev/kvm) — see Troubleshooting if it’s missing or unreadable
  • Rust ≥ 1.94.1 (the crate’s MSRV)
  • clang, pkg-config, make, gcc, and autotools (autoconf, autopoint, flex, bison, gawk) — BPF skeletons and the vendored libbpf/libelf/zlib build
  • BTF (/sys/kernel/btf/vmlinux) — present by default on most distros
  • Internet access on first build (downloads busybox source; kernel builds download tarballs from kernel.org)

The host kernel only needs KVM. The guest kernel — the one your tests boot — needs sched_ext, which landed in 6.12; the next section builds one.

# Ubuntu/Debian
sudo apt install clang pkg-config make gcc autoconf autopoint flex bison gawk
# Fedora
sudo dnf install clang pkgconf make gcc autoconf gettext-devel flex bison gawk

Add the dependency

[dev-dependencies]
ktstr = "=0.23.0"

ktstr is pre-release: pin the exact patch version and keep the installed cargo-ktstr on the same one — minor bumps may break the test-facing API. To keep ktstr out of a scheduler crate’s normal builds, gate it behind a feature instead — see Test a New Scheduler.

Build a kernel

cargo ktstr kernel build downloads a kernel tarball from kernel.org, applies the embedded ktstr.kconfig fragment (sched_ext, BPF, kprobes, minimal boot), builds it, and caches the result:

cargo ktstr kernel build                    # latest stable series with >= 8 point releases
cargo ktstr kernel build --kernel 7.0       # highest 7.0.x release
cargo ktstr kernel build --kernel 6.14.2    # exact version
cargo ktstr kernel build --kernel ../linux  # local source tree

The bare form skips series with fewer than 8 maintenance releases — brand-new majors tend to hit build issues on older toolchains; name a version explicitly to override. cargo ktstr kernel list shows the cache and cargo ktstr kernel clean --keep 3 prunes it. You can also skip this step entirely — cargo ktstr test --kernel 7.0 builds and caches on first use.

Write a test

One mental model before the first example: your test function runs inside the VM, as the guest’s init process. execute_defs and friends create real cgroups and spawn real workers; ctx hands you the guest topology (ctx.topo) and cgroup management (ctx.cgroups).

Create a file in your crate’s tests/ directory (e.g. tests/sched_test.rs). The simplest test runs a canned scenario:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]  // llcs = last-level caches
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    // Canned scenario: two cgroups of CPU spinners, default duration.
    scenarios::steady(ctx)
}

No scheduler attribute means the test runs under the kernel’s default EEVDF scheduler (see Overview) — a useful baseline before pointing at your own.

When the canned scenarios stop being enough, declare your own cgroups, workloads, and cpusets with CgroupDef — the Tutorial builds that up one step at a time, and Writing Tests is the reference.

Point it at a sched_ext scheduler

Declare your scheduler once and reference it from any test:

use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_mysched",   // your scheduler's binary name
});

#[ktstr_test(scheduler = MY_SCHED, llcs = 2, cores = 2, threads = 1)]
fn my_sched_steady(ctx: &Ctx) -> Result<AssertResult> {
    scenarios::steady(ctx)
}

The binary is resolved on the host — target/{debug,release}/, the test binary’s directory, or a KTSTR_SCHEDULER=/path override — and packed into the VM’s initramfs. Full field reference: Scheduler Definitions; walkthrough: Test a New Scheduler.

Run it

cargo ktstr test --kernel 7.0                          # everything
cargo ktstr test --kernel 7.0 -- -E 'test(my_test)'    # one test (nextest filter)

cargo ktstr test resolves the kernel — an explicit --kernel version, path, or cache key, or, without the flag, a discovery chain through environment variables, the kernel cache, and host kernels — then wraps cargo nextest run. The full chain and flag grammar live in cargo ktstr.

Here is a real run, on a cached kernel (transcript captured from ktstr’s own suite — your run shows ktstr/my_test on the PASS line instead):

cargo ktstr test --kernel 7.0 -- -E 'test(=ktstr/failure_dump_renders_bss_fields)'
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
────────────
 Nextest run ID 24c18577-cd34-43bd-9d14-b0197701c187 with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
     Summary [  34.490s] 1 test run: 1 passed, 12531 skipped

cargo ktstr: test outputs
...
    (1 stats sidecar(s), 0 wprof trace(s) written this run)

Reading it:

  • The first three lines are kernel resolution: --kernel 7.0 picked the newest 7.0.x release and found it already cached — no rebuild.
  • Test names have the shape crate::binary ktstr/test_name; the ktstr/ prefix marks the base variant, and the same test also generates gauntlet/ topology variants, skipped by default (see Running Tests). The 34 s covers everything: VM boot, scenario, teardown, evaluation.
  • Every run writes a stats sidecar per test under target/ktstr/{kernel}-{commit}/ — the raw material for regression gates (Runs and Regression Gates).

What gets checked

Warning

Nothing, by default. A bare #[ktstr_test] boots the VM, runs the scenario, and reports pass even if the scheduler stalled, starved workers, or never dispatched a task.

Every check is an opt-in attribute: not_starved = true enables the starvation/fairness/gap trio, max_spread_pct, min_iteration_rate, and friends set explicit thresholds. Checking explains the model; Customize Checking shows the override flow.

When a check fails

A failing check prints the violated threshold with the observed value, then per-cgroup statistics. This excerpt is from a real run that set an impossible min_iteration_rate floor:

ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
  worker 71 iteration rate 41903.3/s below floor 50000000.0/s
  worker 73 iteration rate 37834.5/s below floor 50000000.0/s

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...

The header names the test, scheduler, and topology variant; each detail line names the check, observed value, and threshold. The full output continues with timeline, scheduler-log, and monitor sections, plus failure-dump artifacts and a ready-to-paste cargo ktstr replay command — Reading Failure Output walks the whole anatomy.

Next steps

  • Tutorial: Zero to ktstr — build a complete test step by step, break it on purpose, and read the wreckage.
  • Test a New Scheduler — you have an scx_* binary and want it under test in five minutes.
  • Writing Tests — the authoring reference: attributes, scenarios, snapshots, assertions.

Zero to ktstr

This tutorial walks through writing a complete #[ktstr_test] from scratch. By the end you’ll have a scheduler test that runs two cgroups with different lifecycle patterns across a multi-LLC topology, asserts fairness, throughput parity, and cpuset isolation — and you’ll have broken it on purpose once, so real failures look familiar.

Already have a scheduler binary? This tutorial teaches ktstr from the ground up. If you have an existing scx_X you want to test, jump to one of the targeted recipes instead: test-new-scheduler.md (5 minutes, validates basic behavior), ab-compare.md (compare two scheduler builds), or diagnose-slow-scheduler.md (debug performance regressions).

What you’ll build

A test named mixed_workloads that:

  • Runs two cgroups on separate LLCs:
    • background_spinner — a persistent CPU-bound load that runs for the entire test duration.
    • phased_worker — a worker that loops through explicit Spin → Yield → Spin → Yield … phases via WorkType::Sequence.
  • Targets a 2-LLC, 4-core topology so the scheduler has a real cache boundary to respect.
  • Sets an explicit test duration.
  • Asserts fairness (per-cgroup spread), throughput parity (CV across workers + minimum rate), and cpuset isolation (workers stay on their assigned CPUs).
  • Fails once, deliberately, so you learn the failure output.
  • Captures a snapshot of the scheduler’s BPF state after the workload.

The complete test is at the end of this page.

Prerequisites

Getting Started covers the toolchain, KVM access, the dev-dependency, and building a bootable kernel (Build a kernel). With those in place, create a file under your crate’s tests/ directory (e.g. tests/mixed_workloads.rs) and follow along.

Step 1: The skeleton

Every #[ktstr_test] is a Rust function that takes &Ctx and returns Result<AssertResult>. Start with an empty body that passes unconditionally:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}

let _ = ctx; keeps the unused-variable lint quiet at the skeleton stage; Step 2 onward uses ctx.

Try it. Once this file compiles, run just this test with cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'. A bare-skeleton test passes immediately — the rest of the tutorial adds the workload and assertions on top.

use ktstr::prelude::*; brings in every type the test body needs — Ctx, AssertResult, CgroupDef, WorkType, CpusetSpec, execute_defs, and the Result alias from anyhow. The #[ktstr_test] attribute registers the function so cargo ktstr test discovers it and boots a VM with the requested topology.

A test without a scheduler = … attribute runs under the kernel’s default EEVDF scheduler — a useful baseline (see Overview). Step 2 swaps in a sched_ext scheduler so the rest of the tutorial exercises that scheduler instead.

For the full attribute reference, see The #[ktstr_test] Attribute.

Step 2: Define your scheduler

To target a sched_ext scheduler, declare it with declare_scheduler! and reference the generated const from #[ktstr_test(scheduler = …)]. The example uses scx-ktstr, the test-fixture scheduler shipped in the ktstr workspace; substitute your own binary name to target a different scheduler.

use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
});

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}

declare_scheduler! emits a pub static KTSTR_SCHED: Scheduler and registers it so the verifier sweep discovers it automatically. The scheduler = slot expects the bare const name. The fields used here:

  • name — scheduler name for display and result files.
  • binary — binary name, resolved on the host: target/{debug,release}/, the directory containing the test binary, or a KTSTR_SCHEDULER override path. The resolved binary is packed into the VM’s initramfs.

Other commonly used fields: topology = (numa, llcs, cores, threads) sets a default VM topology that per-test attributes can override; sched_args = ["--flag"] prepends CLI args to every test using this scheduler; kernels = [...] lists kernel specs for the verifier sweep. For the full surface (sysctls, kargs, config_file, gauntlet constraints, scheduler-level assertion overrides) and the manual-builder path for programmatic composition, see Scheduler Definitions.

Step 3: Add workloads

A CgroupDef declares a cgroup along with the workers that will run inside it. The builder methods configure worker count, the work each worker performs, scheduling policy, and cpuset assignment.

Add two cgroups — both running tight CPU spinners for now. Step 5 will swap one of them for a phased workload:

use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait),
    ])
}

Without .cpuset(...), a cgroup’s workers run on every CPU in the test’s topology — they share the VM’s full CPU set with all other cgroups. .cpuset(CpusetSpec::Llc(idx)) (introduced in Step 4) restricts a cgroup to one LLC’s CPUs.

WorkType::SpinWait runs a tight CPU spin loop; it is one of many work primitives, each targeting a different kernel scheduling path — see Work Types for the full set and how to choose one.

execute_defs runs each cgroup concurrently for the test’s full duration. Use execute_steps when you need to add cgroups mid-run or swap cpusets between phases — see Ops, Steps, and Backdrop.

Step 4: Set topology

The #[ktstr_test] attribute carries the VM’s CPU topology. Dimensions are big-to-little: numa_nodes (default 1), llcs (total across all NUMA nodes), cores per LLC, and threads per core. Total CPU count is llcs * cores * threads.

LLC count matters because the last-level cache is the primary scheduling boundary — tasks sharing an LLC benefit from shared cache lines, while cross-LLC migration carries a cold-cache penalty. A scheduler that ignores LLC topology will look fine on llcs = 1 and start failing as soon as there is a real cache boundary to respect.

Bump the topology to two LLCs with two cores each (4 CPUs total) so each cgroup can own its own LLC:

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(1)),
    ])
}

CpusetSpec::Llc(idx) confines a cgroup to the CPUs that belong to LLC idx. Other variants (Numa, Range, Disjoint, Overlap, Exact) cover NUMA-node binding, fractional partitioning, and hand-built CPU sets — see Topology.

Step 5: Compose phased work inside a cgroup

So far both cgroups run identical CPU spinners. The point of this test is to exercise a scheduler against different lifecycle patterns at once, so swap phased_worker for a worker that loops through explicit phases.

WorkType::Sequence runs each phase for its specified duration and then advances to the next; when the last phase ends the loop restarts. Phases: WorkPhase::Spin(Duration), WorkPhase::Sleep(Duration), WorkPhase::Yield(Duration), WorkPhase::Io(Duration), and WorkPhase::AluHot { .. }. Use the WorkType::sequence(first, rest) constructor. Only std::time::Duration needs an extra use line:

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        // Persistent CPU pressure on LLC 0 for the whole run.
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(0)),
        // Phased worker on LLC 1: spin 100 ms, yield for 20 ms,
        // then loop. Stresses the scheduler's wake-after-yield
        // placement repeatedly while the LLC-0 spinner keeps
        // runqueue pressure constant.
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                WorkPhase::Spin(Duration::from_millis(100)),
                [WorkPhase::Yield(Duration::from_millis(20))],
            ))
            .cpuset(CpusetSpec::Llc(1)),
    ])
}

The two cgroups now exercise distinct paths concurrently: background_spinner keeps two CPUs continuously busy on LLC 0, while phased_worker alternates between burning CPU and yielding on LLC 1, exercising voluntary preemption and wakeup placement.

Both cgroups still run for the entire scenario duration: the phasing happens within each phased_worker worker’s loop. To express phasing across cgroups (e.g. add phased_worker only for the second half of the run), use execute_steps with multiple Step entries — see Ops, Steps, and Backdrop.

Step 6: Tune execution

Several #[ktstr_test] attributes control how the VM runs the scenario. The defaults are tuned for fast iteration:

AttributeDefaultWhat it does
duration_s12Per-scenario wall-clock seconds. Workers run for this long, then stop and report.
watchdog_timeout_s5sched_ext watchdog fire threshold.
memory_mib2048VM memory in MiB.

watchdog_timeout_s is sched_ext’s per-task stall threshold — if a runnable task is not picked for that many seconds, the scheduler exits with SCX_EXIT_ERROR_STALL. The scenario duration and watchdog are independent; a 12 s scenario with a 5 s watchdog is normal. Tune the watchdog only when the scheduler under test is expected to legitimately leave a runnable task parked longer than the default 5 s.

For the run we’re building, set the duration to 20 s (so each phase iteration repeats many times):

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    // body unchanged from Step 5 — two cgroups via execute_defs
}

For the full attribute reference (auto-repro, performance mode, topology constraints, etc.), see The #[ktstr_test] Attribute.

Step 7: Add assertions

Every check is opt-in — no threshold is compared until you turn its check on, either at the scheduler level or on the per-test attribute (Checking explains the model, and Customize Checking the override chain). The first check to opt into is not_starved = true, which enables three related worker-level checks together:

  • Starvation — any worker with zero work units fails the test.
  • Fairness spread — per-cgroup max(off-CPU%) - min(off-CPU%) must stay under the spread threshold (release default 15%; debug default 35% — debug builds in small VMs show higher spread, so the threshold loosens automatically in debug builds).
  • Scheduling gaps — the longest wall-clock gap observed at work-unit checkpoints must stay under the gap threshold (release default 2000 ms; debug default 3000 ms).

Cpuset isolation is separate — enable it with isolation = true. Override the spread threshold and add throughput-parity gates:

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    not_starved = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                WorkPhase::Spin(Duration::from_millis(100)),
                [WorkPhase::Yield(Duration::from_millis(20))],
            ))
            .cpuset(CpusetSpec::Llc(1)),
    ])
}

What each new attribute gates:

  • isolation = true — workers must only run on CPUs in their assigned cpuset; any execution on an unexpected CPU fails the test.
  • not_starved = true — enables the starvation/spread/gap trio described above, at the default thresholds.
  • max_spread_pct = 20.0 — custom fairness threshold. It replaces the default-threshold spread verdict from not_starved with your limit (and enables the spread check on its own even without not_starved). 20.0 loosens the release default of 15.0 slightly to absorb noise from the phased worker’s yield-driven re-placement.
  • max_throughput_cv = 0.5 — coefficient of variation of work_units / cpu_time across workers. Catches a scheduler that gives some workers disproportionately less effective CPU.
  • min_work_rate = 1.0 — minimum work units per CPU-second per worker. Catches the case where every worker is equally slow (CV passes but absolute throughput is too low).

Host-side monitor checks (imbalance ratio, DSQ depth, stall detection, event rates) also run on every test, but they are report-only by default — Checking covers what they observe and how to make them enforce.

Step 8: Run it

Run the test with cargo ktstr test, scoped to this one test name:

cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'

cargo ktstr test resolves the kernel image, boots a VM with the declared topology, runs the test as the guest’s init, and reports the result. A real passing run looks like this (transcript captured from ktstr’s own suite — your run shows ktstr/mixed_workloads on the PASS line instead):

cargo ktstr test --kernel 7.0 -- -E 'test(=ktstr/failure_dump_renders_bss_fields)'
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
────────────
 Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
     Summary [  34.498s] 1 test run: 1 passed, 12531 skipped

cargo ktstr: test outputs
...
    (1 stats sidecar(s), 0 wprof trace(s) written this run)

That run took about 35 seconds end to end on a cached kernel — VM boot, scenario, teardown, and evaluation included. The ktstr/ prefix on the test name marks the base variant; see Running Tests for the name shapes and the sidecar files each run writes.

If something goes wrong instead:

  • “kernel not found” — the --kernel argument points at a directory without a built kernel, or at a version the cache cannot locate. Run cargo ktstr kernel build to populate the cache — see Getting Started: Build a kernel.
  • “scheduler binary not found” — the declared binary = "..." from Step 2 didn’t land where the discovery cascade looks. Set KTSTR_SCHEDULER=/path/to/binary to pin an explicit path, or rebuild the scheduler crate so the binary lands under target/{debug,release}/.
  • probe-related errors (“probe skeleton load failed”, “trigger attach failed”) — re-run with RUST_LOG=ktstr=debug to see the underlying libbpf reason; see Troubleshooting.

Step 9: Break it on purpose

A green run tells you the harness works; it doesn’t teach you to read a failure. Crank one threshold to an impossible value and watch what comes out. Add an iteration-rate floor no 2-core VM can meet:

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    not_starved = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
    min_iteration_rate = 50_000_000.0,   // deliberately impossible
)]

Below is a real capture of exactly this experiment — a demo test with the same impossible floor, on a 2-CPU topology:

ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
  worker 71 iteration rate 41903.3/s below floor 50000000.0/s
  worker 73 iteration rate 37834.5/s below floor 50000000.0/s

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK

...
cargo ktstr: test outputs
...
    FAILED  throughput_gate  [my_sched 1n1l2c1t]
      ...
      replay        cargo ktstr replay --filter throughput_gate --exec

How to read it:

  • The header names the test, the scheduler, and the topology variant. Every detail line under it names the check that tripped, the observed value, and the threshold — here, workers managing ~40k iterations/s against a 50M floor.
  • --- stats --- gives the per-cgroup roll-up: worker counts, CPUs touched, fairness spread, worst scheduling gap, migrations, and iteration totals.
  • verdict: monitor OK is worth noticing: the host-side monitor saw nothing wrong. The scheduler behaved fine — the test’s own gate was impossible. When a real scheduler bug trips a check, the monitor and timeline sections are usually where the story is.
  • The footer hands you a ready-to-paste cargo ktstr replay line to re-run exactly the failing variant.

The full failure anatomy — timeline, scheduler log, auto-repro, failure-dump artifacts — is covered in Reading Failure Output. Now delete the min_iteration_rate line and the test goes green again.

Step 10: Capture a snapshot

Threshold assertions tell you something is off; snapshots tell you what the scheduler’s state actually was. Op::capture_snapshot(name) freezes every vCPU long enough to read the scheduler’s BPF map state, vCPU registers, and per-CPU counters into a named report, then resumes the guest.

execute_defs (used so far) takes a flat list of cgroups. To inject a snapshot, switch to execute_steps, which takes a list of Steps — each with setup cgroups, an ops list, and a hold duration.

Warning

Within a step, ops fire before the setup cgroups are created. A single step with both the workload and a snapshot op named “after_workload” would capture an empty guest. Use two steps: a setup step that holds the workload, then a follow-up step whose op fires after the hold ends.

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1, duration_s = 20)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_steps(ctx, vec![
        Step {
            setup: Setup::Defs(vec![
                CgroupDef::named("background_spinner")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .cpuset(CpusetSpec::Llc(0)),
                CgroupDef::named("phased_worker")
                    .workers(2)
                    .work_type(WorkType::sequence(
                        WorkPhase::Spin(Duration::from_millis(100)),
                        [WorkPhase::Yield(Duration::from_millis(20))],
                    ))
                    .cpuset(CpusetSpec::Llc(1)),
            ]),
            ops: vec![],
            hold: HoldSpec::FULL,
        },
        Step {
            setup: Setup::Defs(Vec::new()),
            ops: vec![Op::capture_snapshot("after_workload")],
            hold: HoldSpec::Fixed(Duration::ZERO),
        },
    ])
}

The first step creates the cgroups and holds them for the full scenario duration; the second step’s op runs after that hold finishes, so the snapshot reflects the post-workload guest state. Downstream code reads the captured report by name and walks fields with a dotted-path accessor — e.g. snap.var("nr_dispatched").as_u64()? reads a scheduler global. For the traversal API, error handling, and the write-driven Op::watch_snapshot variant, see Snapshots.

The complete test

The shape exercised by every step above, in one file — the Step 7 assertions plus the Step 10 snapshot steps:

use std::time::Duration;
use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
});

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    not_starved = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_steps(ctx, vec![
        Step {
            setup: Setup::Defs(vec![
                CgroupDef::named("background_spinner")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .cpuset(CpusetSpec::Llc(0)),
                CgroupDef::named("phased_worker")
                    .workers(2)
                    .work_type(WorkType::sequence(
                        WorkPhase::Spin(Duration::from_millis(100)),
                        [WorkPhase::Yield(Duration::from_millis(20))],
                    ))
                    .cpuset(CpusetSpec::Llc(1)),
            ]),
            ops: vec![],
            hold: HoldSpec::FULL,
        },
        Step {
            setup: Setup::Defs(Vec::new()),
            ops: vec![Op::capture_snapshot("after_workload")],
            hold: HoldSpec::Fixed(Duration::ZERO),
        },
    ])
}

Run it:

cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'

Going further

Each of these builds directly on the test you just wrote.

  • Gauntlet. #[ktstr_test] doesn’t emit just one test — it also generates variants that run the same body across every accepted topology preset (gauntlet/mixed_workloads/smt-2llc, …), catching the bugs only odd LLC counts, SMT siblings, or NUMA crossings expose. See Gauntlet.
  • Worker identity. .comm("name"), .nice(n), and .pcomm("name") on CgroupDef give workers realistic names and priorities for schedulers that key on task->comm or nice values. See Work Types.
  • Inline scheduler config. Schedulers like scx_layered take a JSON config file; config_file_def on the scheduler plus config = … on the test writes it into the guest. See The #[ktstr_test] Attribute.
  • Periodic capture and temporal assertions. num_snapshots = N captures BPF state at evenly spaced points across the run, and a post_vm callback asserts temporal patterns over the series (nondecreasing counters, bounded rates, convergence). See Periodic Capture and Temporal Assertions.
  • Performance mode. For benchmark-grade runs, ktstr pins vCPUs to reserved host cores and strips host scheduling noise; for topologies your host can’t mirror, no_perf_mode = true builds the virtual topology as declared. See Performance Mode.
  • Stats and regression gates. Every run writes machine-readable sidecars; cargo ktstr stats aggregates them and cargo ktstr perf-delta gates HEAD against a baseline. See Runs and Regression Gates.
  • Custom scenarios. When the declarative ops can’t express your scenario, the test body is arbitrary Rust — resize cpusets based on observed telemetry, assert on migrations directly. See Custom Scenarios and Ops, Steps, and Backdrop.

Writing Tests

Tests are Rust functions annotated with #[ktstr_test]. Each test boots a KVM VM, runs the scenario inside it, and evaluates results on the host.

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        ctx.cgroup_def("cg_0"),
        ctx.cgroup_def("cg_1"),
    ])
}

ctx.cgroup_def("name") is shorthand for CgroupDef::named("name").workers(ctx.workers_per_cgroup) — the common case. Use CgroupDef::named(...).workers(N).work_type(...) directly when the test needs to customize worker count or work type.

Run with cargo ktstr test --kernel 7.0 (see Getting Started for setup). A passing run is one nextest line per test; the VM boot, scenario, and teardown all happen inside the reported duration:

cargo ktstr: resolved kernel "7.0"
...
 Nextest run ID 24c18577-... with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
     Summary [  34.490s] 1 test run: 1 passed, 12531 skipped

Every test gets the same machinery for free: a fresh VM per test (no state shared between tests), a failure dump with BTF-rendered scheduler BPF state if the scheduler crashes (see Reading Failure Output), and an automatic second-VM reproduction run with probes attached (Auto-Repro). Each test also expands into gauntlet variants across topology presets — see Gauntlet.

Warning

No worker checks run by default. The example above passes as long as nothing crashes — it does not assert fairness, starvation, or gaps. Opt in with not_starved = true and the threshold attributes; see Checking for the model.

Where to go next

  • The #[ktstr_test] Attribute — the full attribute reference: topology, timing, checking thresholds, execution knobs.
  • Scheduler Definitionsdeclare_scheduler!: how the scheduler under test is named, found, configured, and launched.
  • Payloads and Included Files — run benchmark binaries (schbench, fio, …) alongside workers and extract their metrics.
  • Custom Scenarios — scenario logic the ops system cannot express, written directly in the test body.
  • Snapshots — capture scheduler BPF state on demand mid-scenario and assert on it.
  • Watch Snapshots — capture at the exact instant the kernel writes a chosen symbol.
  • Periodic Capture — cadenced BPF-state sampling across the workload window, no scenario code required.
  • Temporal Assertions — assert on trajectories: counters that only advance, metrics that hold steady, systems that converge.

The #[ktstr_test] Attribute

#[ktstr_test] registers a function as an integration test that boots a VM with a declared topology and runs the function body inside it. This page is the attribute reference.

Most tests need only a handful of attributes: a scheduler, a topology dimension or two, a duration, and the checking thresholds the test is actually about.

use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 2, 4, 1),
});

#[ktstr_test(
    scheduler = MY_SCHED,   // scheduler under test (default: kernel EEVDF)
    threads = 2,            // override one dimension; the rest inherit
    duration_s = 10,        // workload window (default 12 s)
    not_starved = true,     // enable starvation / spread / gap checks
    max_spread_pct = 20.0,  // tighten the fairness-spread threshold
)]
fn smt_fairness(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")])
}

The function signature is fn(&Ctx) -> anyhow::Result<AssertResult>. scheduler = expects the bare const emitted by declare_scheduler! — see Scheduler Definitions.

Attribute forms

All attributes are optional, with defaults, and most take key = value. The sixteen bool attributes (auto_repro, expect_auto_repro, not_starved, isolation, performance_mode, pci, no_perf_mode, requires_smt, expect_err, survives_storm, allow_inconclusive, fail_on_stall, host_only, ignore, kaslr, wprof) also accept a bare form as shorthand for = true#[ktstr_test(host_only)] equals #[ktstr_test(host_only = true)]. auto_repro and kaslr default to true, so their meaningful spelling is auto_repro = false / kaslr = false; the other fourteen default to false (or unset).

Each attribute key may appear at most once per invocation; a duplicate key fails at macro expansion rather than silently letting the later value win.

Topology

AttributeDefaultDescription
numa_nodesinheritedNUMA nodes
llcsinheritedTotal LLCs (not per node)
coresinheritedCores per LLC
threadsinheritedThreads per core
memory_mib2048VM memory floor in MiB (see below)

Each dimension independently inherits from Scheduler.topology when a scheduler is specified and that dimension is not set. Without a scheduler, unset dimensions use the macro defaults (numa_nodes = 1, llcs = 1, cores = 2, threads = 1). See Topology for the notation and what the guest actually gets.

Memory

memory_mib is one of three floors: the framework allocates max(total_cpus * 64, 256, memory_mib) MiB at VM launch. Above 32 vCPUs the CPU-based floor dominates the default 2048, so a 126-vCPU test gets 8064 MiB regardless. Raise memory_mib only when the test needs more headroom than the per-CPU budget provides.

Timing

AttributeDefaultDescription
duration_s12Workload window in seconds (ctx.duration)
watchdog_timeout_s5sched_ext watchdog override in seconds

The watchdog override is applied via scx_sched.watchdog_timeout on 7.1+ kernels and via the static scx_watchdog_timeout symbol on earlier kernels; when neither path is available the override logs a warning and the kernel default stands.

Checking thresholds

Checking attributes override the merged check set (library defaults → scheduler-level assert → per-test attributes). Checking explains the two evaluation channels; Customize Checking owns the merge rules and worked overrides. Everything here is inherited when unset.

Worker checks — evaluated from per-worker telemetry after the scenario:

AttributeUnitExampleFails when
not_starvedboolnot_starved = trueany worker finishes with zero work units; also enables the spread and gap checks
isolationboolisolation = truea worker ran on a CPU outside its cgroup’s cpuset
max_gap_msmsmax_gap_ms = 500a worker’s longest scheduling gap exceeds the cap
max_spread_pctpercentage pointsmax_spread_pct = 20.0max−min worker off-CPU% exceeds the cap
max_throughput_cvcoefficient of variationmax_throughput_cv = 0.35per-worker throughput CV exceeds the cap
min_work_ratework units / CPU-secondmin_work_rate = 1000.0a worker’s work rate falls below the floor
min_iteration_rateiterations / secondmin_iteration_rate = 50000.0a worker’s wall-clock iteration rate falls below the floor
max_migration_ratiomigrations / iterationmax_migration_ratio = 0.5a cgroup’s migration ratio exceeds the cap
max_p99_wake_latency_nsnsmax_p99_wake_latency_ns = 2000000p99 wake latency exceeds the cap
max_wake_latency_cvcoefficient of variationmax_wake_latency_cv = 1.0wake-latency CV exceeds the cap
min_page_localityfraction 0.0–1.0min_page_locality = 0.9fraction of pages on expected NUMA nodes falls below the floor
max_cross_node_migration_ratiofraction 0.0–1.0max_cross_node_migration_ratio = 0.1NUMA-migrated pages / total pages exceeds the cap
max_slow_tier_ratiofraction 0.0–1.0max_slow_tier_ratio = 0.05pages on memory-only (CXL) nodes exceed the cap

Monitor thresholds — evaluated from the host monitor’s samples of guest scheduler state:

AttributeUnitExampleFails when
max_imbalance_ratioratiomax_imbalance_ratio = 2.0observed run-queue imbalance exceeds the cap
max_local_dsq_depthtasksmax_local_dsq_depth = 8a local DSQ grows deeper than the cap
fail_on_stallboolfail_on_stallthe monitor’s stall detection fails the test instead of reporting
sustained_samplessamplessustained_samples = 3window size a violation must persist for before it counts
max_fallback_rateevents/smax_fallback_rate = 5.0fallback-dispatch event rate exceeds the cap
max_keep_last_rateevents/smax_keep_last_rate = 100.0keep-last event rate exceeds the cap

What a failing gate looks like

A deliberately unreachable floor, to show the output shape — 50M iterations/s is far beyond what two workers deliver:

declare_scheduler!(MY_SCHED, { name = "my_sched", binary = "scx-ktstr" });

#[ktstr_test(scheduler = MY_SCHED, llcs = 1, cores = 2, threads = 1, duration_s = 5)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
    let checks = Assert::default_checks().min_iteration_rate(50_000_000.0);
    let steps = vec![Step {
        setup: vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")].into(),
        ops: vec![],
        hold: HoldSpec::FULL,
    }];
    execute_steps_with(ctx, steps, Some(&checks))
}
cargo ktstr test --kernel 7.0 -- -E 'test(throughput_gate)'
  TRY 1 FAIL [  31.810s] (───) ktstr::docs_demo ktstr/throughput_gate
  stderr ───
...
    ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
      worker 71 iteration rate 41903.3/s below floor 50000000.0/s
      worker 73 iteration rate 37834.5/s below floor 50000000.0/s

    --- stats ---
    2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
      cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
      cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
    --- monitor ---
    samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
    avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
    events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
    verdict: monitor OK

The failure names the worker, the measured value, and the threshold it crossed; the stats and monitor sections that follow are the context for deciding whether the threshold or the scheduler is wrong. Reading Failure Output walks the full transcript.

Expected-error matchers

Two attributes narrow which failure counts as the expected bug in an expect_err = true reproducer test (both require expect_err; both may be set, composing with AND semantics):

  • expect_scx_bpf_error_contains = "literal" — the captured scx_bpf_error text must contain the literal substring. Empty strings panic at construction.
  • expect_scx_bpf_error_matches = "regex" — the text must match the regex. Empty patterns, invalid syntax, and any pattern that matches the empty string (a?, .*, ^$) panic at construction, so a vacuous matcher can never silently pass. ^/$ anchor to the whole string by default (use (?m) for line anchors), and a bare \b slips the vacuity gate — prefer a substring.

See Investigate a Crash for the pin-an-error-as-a-regression-test workflow these serve.

Topology constraints

These filter which gauntlet presets a test expands into; the base ktstr/ variant is unaffected.

AttributeDefaultDescription
min_llcs / max_llcs1 / 12LLC-count bounds
min_numa_nodes / max_numa_nodes1 / 1NUMA-node bounds — multi-NUMA presets are opt-in
min_cpus / max_cpus1 / 192Total-CPU bounds
requires_smtfalseOnly SMT (threads > 1) presets; skips the test entirely on aarch64, which ships no SMT presets

The gauntlet skips presets that fail any bound. See Gauntlet for the preset table, filtering rules, and a worked expansion.

Execution attributes

auto_repro / expect_auto_repro

On scheduler crash, auto_repro (default true) boots a second VM with probes attached to capture the state on the way to the crash — see Auto-Repro. Set auto_repro = false for faster iteration; expect_err = true also disables it. expect_auto_repro = true inverts the assertion: the test fails unless the auto-repro path actually fired (used to pin the repro machinery itself). It requires a scheduler and wprof, and is rejected alongside auto_repro = false, expect_err, or host_only.

expect_err / survives_storm / allow_inconclusive

expect_err = true asserts the run returns Err — the negative test: a scheduler crash or scenario failure is the expected outcome, and a clean pass fails the test. survives_storm = true is the positive inverse: the scx scheduler must stay attached and alive through every hold; a death or ejection fails with a survival-specific explainer. It requires a scheduler and is mutually exclusive with expect_err and expect_auto_repro. allow_inconclusive = true lets an Inconclusive verdict pass instead of exiting 2 — see Checking for when Inconclusive arises.

performance_mode / no_perf_mode / cpu_budget

performance_mode = true pins vCPUs to reserved host cores with hugepages, NUMA binding, and RT scheduling, for runs whose numbers must be comparable — see Performance Mode. no_perf_mode = true goes the other way: build the VM with the declared topology even on a smaller host, skipping all pinning. The two are mutually exclusive (rejected at compile time). cpu_budget = N overrides the auto-derived host-CPU mask size in no-perf mode (must be > 0, requires no_perf_mode; an explicit --cpu-cap / KTSTR_CPU_CAP still wins) — see Resource Budget.

kaslr

Default true: the guest boots with KASLR enabled, so tests run against the memory layout real systems have. kaslr = false appends nokaslr to the guest command line for the rare workflow that needs stable kernel addresses.

host_only

Run the function directly on the host — no VM. For tests that need host tools (cargo, nested VMs) unavailable in the guest initramfs. Mutually exclusive with scheduler, num_snapshots > 0, auto_repro = true, disk, and networks — everything that requires a VM to exist.

ignore

Emits #[ignore] on the generated test, so it is skipped by default and runs only under nextest’s --run-ignored. Use for slow-by-design tests that should not gate every local run.

pci / disk / networks

disk = CONST attaches a virtio-blk device from a const DiskConfig (built via DiskConfig::DEFAULT chained setters); the framework owns the backing file. networks = [CONST, …] attaches one virtio-net device per const NetConfig (aarch64 supports at most one). On x86_64 either device auto-enables the virtio-PCI transport; pci = true forces the PCI host bridge on without a device (default: no bridge, pci=off on the guest command line).

bpf_map_write

bpf_map_write = CONST (or [A, B]) writes a u32 into a named scheduler BPF-map field from the host, once, after the scheduler loads — pre-seeding guest state before workers start. See Snapshots — composing reads with writes.

watch_bpf_maps

watch_bpf_maps = CONST (or [A, B]) samples named scheduler BPF-map fields observer-effect-free during the run and surfaces each as a run-level metric. Full semantics in Watching BPF-map fields below.

perf_delta_assertions

perf_delta_assertions = CONST (or [A, B]) declares per-test performance-regression gates. Inert in a normal cargo ktstr test run — enforced only under cargo ktstr perf-delta --noise-adjust. Requires performance_mode (rejected at compile time otherwise, since unpinned numbers would make the gate misfire). See Assertable Metrics.

num_snapshots

num_snapshots = N fires N periodic BPF-state captures inside the workload’s 10%–90% window. 0 (default) disables periodic capture. Validated against the 64-capture bridge cap, host_only, and a 100 ms minimum boundary spacing. See Periodic Capture.

post_vm / post_vm_unconditional

Host-side callbacks (fn(&VmResult) -> anyhow::Result<()>) invoked after the VM exits — the place to drain snapshot bridges, read run metrics, and run temporal assertions. post_vm is suppressed when the guest reported failure; post_vm_unconditional always runs (guard it with if !result.success { return Ok(()); } when it reads state a crash may not have produced, and note it never turns a guest failure into a pass). When num_snapshots > 0 and post_vm is omitted, the macro installs a default callback asserting at least one periodic capture landed with real BPF state.

cleanup_budget_ms

Caps host-side VM teardown wall time; exceeding the budget folds a failing detail into the test result. Unset disables the check; 0 is rejected.

staged_schedulers

staged_schedulers = [PATH, …] packs additional &'static Scheduler binaries into the guest at boot. Required for scenarios that invoke Op::ReplaceScheduler / Op::AttachScheduler — the swap target must already be on disk in the guest. See Ops.

workload_root_cgroup

workload_root_cgroup = "/path" places the per-test workload cgroups under a specific guest cgroup path, decoupled from the scheduler’s cgroup_parent (which roots scheduler-side cells).

wprof / wprof_args

wprof = true attaches the wprof BPF tracer to the workload VM; wprof_args = "..." passes space-separated CLI args. Both require the wprof cargo feature.

payload / workloads / extra_include_files / extra_sched_args

payload = CONST declares the test’s primary benchmark binary and workloads = [A, B] composes more alongside it; the include-file pipeline packs every referenced binary into the guest. See Payloads and Included Files. extra_include_files = ["path", …] adds test-level host files that belong to no particular payload. extra_sched_args = ["--flag", …] appends scheduler CLI args after the scheduler’s own sched_args.

config

Inline scheduler config content, paired with a scheduler that declares config_file_def — covered next.

Inline scheduler config

Some schedulers (e.g. scx_layered, scx_lavd) accept a JSON config file via a CLI argument like --config /path/to/config.json. Two pieces wire this into a test:

  1. Scheduler declaration — declares the arg template and the guest path via config_file_def:

    const LAYERED_SCHED: Scheduler = Scheduler::named("layered")
        .binary(SchedulerSpec::Discover("scx_layered"))
        .config_file_def("--config {file}", "/include-files/layered.json");

    {file} in the arg template is replaced with the guest path. The framework writes the config content to that path inside the guest before the scheduler binary starts.

  2. Test attribute — supplies the inline content:

    const LAYERED_CONFIG: &str = r#"{ "layers": [...] }"#;
    
    #[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)]
    fn layered_test(ctx: &Ctx) -> Result<AssertResult> {
        Ok(AssertResult::pass())
    }

    Both a string literal and a path to a const &'static str are accepted.

The pairing gate is bidirectional and enforced at compile time (and again at runtime for programmatic entry construction): a scheduler with config_file_def requires config = … on every test, and a scheduler without it rejects config = … — the content would otherwise be silently dropped.

For schedulers that take the same config file on every test, use Scheduler::config_file(host_path) instead — see Scheduler Definitions.

Watching BPF-map fields

watch_bpf_maps turns “the scheduler computed X” into a post-VM assertion. The free-running host monitor reads the named field from the running guest’s BPF-map memory via BTF — without freezing vCPUs — and folds the samples into a run-level metric.

Each declared const is one WatchBpfMap::new(map_name_suffix, field, agg, label):

  • map_name_suffix — matched against a loaded BPF map by ends_with (".bss" for a section global, or a named map like "cpu_ctx_stor").
  • field — a dot-path into the map’s value type ("sys_stat.avg_lat_cri", or a bare global like "lat_headroom").
  • agg — pick by the field’s semantic class:
    • BpfMapAgg::Scalar — a gauge; folded as the mean over the run’s samples.
    • BpfMapAgg::ScalarCounter — a monotonic counter; folded as the value at the last sample (the final total, not a mean of a rising series).
    • BpfMapAgg::PerCpu — a per-CPU gauge array; folded into a cross-CPU mean and max.
    • BpfMapAgg::PerCpuCounter — a per-CPU counter array; folded as the cross-CPU sum at the last sample. Watch at u64 width so no per-CPU slot truncates before the sum.
  • label — the metric-key leaf; must be unique within a test.

The metric key is <scheduler-obj>_<label> (per-CPU gauges get _avg / _max variants). The prefix is libbpf’s object name from the scheduler’s global-section map, which can differ from the ops name — scx-ktstr’s object is bpf_bpf, so its prefix is bpf_bpf. Read the metric back with VmResult::run_metric in a post_vm hook; an absent metric returns None, never a false 0.0.

const AVG_LAT_CRI: WatchBpfMap =
    WatchBpfMap::new(".bss", "sys_stat.avg_lat_cri", BpfMapAgg::Scalar, "avg_lat_cri");
const LAT_HEADROOM: WatchBpfMap =
    WatchBpfMap::new("cpu_ctx_stor", "lat_headroom", BpfMapAgg::PerCpu, "lat_headroom");

fn check(result: &VmResult) -> anyhow::Result<()> {
    let avg_lat_cri = result.run_metric("scx_lavd_avg_lat_cri")
        .ok_or_else(|| anyhow::anyhow!("avg_lat_cri absent"))?;
    let headroom_max = result.run_metric("scx_lavd_lat_headroom_max")
        .ok_or_else(|| anyhow::anyhow!("lat_headroom_max absent"))?;
    anyhow::ensure!(avg_lat_cri.is_finite() && headroom_max.is_finite());
    Ok(())
}

#[ktstr_test(
    scheduler = SCX_LAVD,
    watch_bpf_maps = [AVG_LAT_CRI, LAT_HEADROOM],
    post_vm = check,
)]
fn lat_metrics_surface(ctx: &Ctx) -> anyhow::Result<AssertResult> { /* workload */ }

Resolution is lazy: the maps appear only after the scheduler attaches, so the monitor retries until the named map is present, then caches the resolved offset and re-reads only the leaf bytes each tick.

What the macro generates

The macro renames the function, registers it in the KTSTR_TESTS distributed slice, and emits a #[test] wrapper that boots the VM and dispatches. Details in the attribute rustdoc.

Scheduler Definitions

A Scheduler tells the framework how to find, configure, and launch the scheduler under test. declare_scheduler! builds one and registers it so both #[ktstr_test] and the verifier sweep can see it:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    sched_args = ["--exit-dump-len", "1048576"],
    topology = (1, 2, 4, 1),
});

#[ktstr_test(scheduler = MY_SCHED)]
fn basic(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![ctx.cgroup_def("cg_0"), ctx.cgroup_def("cg_1")])
}

MY_SCHED is the Rust handle tests reference; name = "my_sched" is the user-visible label in nextest output, sidecars, and the CLI. Rename either independently. Once declared, the scheduler shows up in the verifier sweep’s cells with no further wiring:

 Nextest run ID 3522bea7-... with nextest profile: default
    Starting 4 tests across 1 binary (55 tests skipped)
        PASS [  12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
        PASS [  12.432s] (2/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/smt-2llc
...

Defining a scheduler

declare_scheduler! emits a pub static MY_SCHED: Scheduler and registers a reference to it in the KTSTR_SCHEDULERS distributed slice, which is what cargo ktstr verifier enumerates. #[ktstr_test(scheduler = ...)] expects the bare ident; the macro takes the reference internally. The ident can carry a visibility prefix (pub, pub(crate)).

Accepted fields

name plus exactly one binary-source key (binary, binary_path, or the kernel_builtin_enable/kernel_builtin_disable pair) are required; every other key is optional.

  • name = "..." — short human name (required).
  • binary = "scx_name" — discover a binary by name. Resolution happens entirely on the host, before the VM boots, and the resolved binary is packed into the guest initramfs — nothing is resolved inside the guest. The cascade: a per-name KTSTR_SCHEDULER_BIN_<NAME> env override, then the global KTSTR_SCHEDULER, then a fresh workspace build via cargo build -p <name> — a failed build refuses to serve a possibly-stale pre-built binary unless KTSTR_SCHEDULER_ALLOW_STALE_FALLBACK is set, which enables the pre-built fallbacks (a sibling of the test binary, then target/{release,debug}/). Test binaries run outside the cargo-ktstr pipeline (KTSTR_CARGO_TEST_MODE=1) skip the build and consult the host PATH and the pre-built fallbacks first instead.
  • binary_path = "/abs/path" — explicit pre-built binary; must exist on the host, packed into the initramfs as-is.
  • kernel_builtin_enable = [...] + kernel_builtin_disable = [...] — paired guest shell-command lists for a scheduler compiled into the kernel (no userspace binary). Both keys must appear together.
  • sched_args = ["--a", "--b"] — scheduler CLI args applied to every test; per-test extra_sched_args append after them.
  • kargs = ["nosmt"] — extra guest kernel command line (not the scheduler’s CLI — that’s sched_args). Do not override the kargs ktstr injects itself (console=, loglevel=, rdinit=); those break guest init.
  • sysctls = [Sysctl::new("kernel.foo", "1")] — applied at guest boot, before the scheduler starts (see below).
  • topology = (numa_nodes, llcs, cores, threads) — default VM topology tests inherit dimension-by-dimension.
  • constraints = TopologyConstraints { ... } — gauntlet constraints tests inherit; see the macro reference.
  • cgroup_parent = "/path" — cgroup subtree the guest creates for the scheduler before it starts (see below).
  • config_file = "configs/my.toml" / config_file_def = (...) — the two config-file seams (see below).
  • assert = Assert::NO_OVERRIDES.max_imbalance_ratio(2.0) — scheduler-level checking overrides, merged between the library defaults and per-test attributes. See Customize Checking.
  • kernels = ["6.14", "6.15..=7.0"] — filters which kernel-list entries this scheduler verifies against in the verifier sweep. Entries use the same grammar as cargo ktstr verifier --kernel (exact versions, inclusive ranges, paths, git specs); empty means no filter. Match semantics live in BPF Verifier Sweep.

Manual definition

The const builder still works when the macro doesn’t fit — e.g. a programmatically composed scheduler, or a fixture that must stay out of the verifier sweep:

use ktstr::prelude::*;

const MITOSIS: Scheduler = Scheduler::named("scx_mitosis")
    .binary(SchedulerSpec::Discover("scx_mitosis"))
    .topology(1, 2, 4, 1)
    .sched_args(&["--exit-dump-len", "1048576"])
    .cgroup_parent("/ktstr")
    .assert(Assert::NO_OVERRIDES.max_imbalance_ratio(2.0));

Scheduler::named("foo").binary_discover("scx_foo") is shorthand for .binary(SchedulerSpec::Discover("scx_foo")) — the argument is the binary name to discover, not the scheduler name. A manual const is not registered in KTSTR_SCHEDULERS, so the verifier sweep does not see it; use declare_scheduler! for anything that should participate in cargo ktstr verifier.

SchedulerSpec

pub enum SchedulerSpec {
    Eevdf,                   // no sched_ext binary — kernel EEVDF
    Discover(&'static str),  // host-side discovery by name
    Path(&'static str),      // explicit host path
    KernelBuiltin {          // compiled into the kernel
        enable: &'static [&'static str],
        disable: &'static [&'static str],
    },
}

Scheduler::EEVDF (binary SchedulerSpec::Eevdf) runs tests under the kernel’s default scheduler and is what #[ktstr_test] uses when scheduler = is omitted. It is not reachable via declare_scheduler! — reference Scheduler::EEVDF directly. Eevdf and KernelBuiltin are excluded from the verifier sweep: neither has a userspace binary to load BPF programs from.

Kernel-builtin example

declare_scheduler!(MINLAT, {
    name = "minlat",
    kernel_builtin_enable = ["echo minlat > /sys/kernel/debug/sched/ext/root/ops"],
    kernel_builtin_disable = ["echo none > /sys/kernel/debug/sched/ext/root/ops"],
});

The enable commands run in the guest before scenarios start; disable runs after they complete.

Sysctls

sysctls takes Sysctl::new("key", "value") pairs (dot-separated keys; duplicates apply in order, last write wins). The framework injects each as sysctl.<key>=<value> on the guest kernel command line, so the kernel applies them at boot — each test gets a fresh VM, so there is no apply/revert step. Sysctl::new is const fn, so a shared tuning block can live in a const slice:

const RT_TUNING: &[Sysctl] = &[
    Sysctl::new("kernel.sched_rt_runtime_us", "950000"),
    Sysctl::new("kernel.numa_balancing", "0"),
];

declare_scheduler!(RT_TUNED, {
    name = "rt_tuned_scx",
    binary = "scx_rt_tuned",
    sysctls = RT_TUNING,
});

Config files

Pick one of config_file / config_file_def — they are alternatives.

  • The config is the same file for every test → config_file = "configs/my_sched.toml". The framework packs the host file into the guest at /include-files/{filename} and prepends --config /include-files/{filename} to the scheduler args. The --config flag name is fixed; a scheduler that uses a different flag can still take the packed path via sched_args, but must tolerate the extra --config argument.
  • The config varies per test → config_file_def = ("--config={file}", "/include-files/my.json") declares the arg-template + guest-path pair, and each test supplies content via #[ktstr_test(config = …)]. The pairing is enforced both ways at compile time — see Inline scheduler config.

Both fields may technically coexist (the config_file path is always packed and its flag prepended; the inline config is written when a test supplies config = …), but a two-config launch is rarely what anyone wants — pick one.

Cgroup parent

cgroup_parent = "/ktstr" makes guest init create /sys/fs/cgroup/ktstr (enabling cpuset/cpu controllers on its ancestors) before the scheduler starts. It does not pass --cell-parent-cgroup to the scheduler — a cell-aware scheduler that needs the flag must carry it in sched_args or per-test extra_sched_args, and the guest then also creates the directory named by the flag. Paths are validated at compile time by CgroupPath: they must start with /, must not be / alone, and must not contain ...

The same validation applies to any --cell-parent-cgroup value found in sched_args / extra_sched_args at test setup: empty values, bare /, relative paths, and a trailing flag with no value all panic with an actionable message instead of resolving to (or next to) the host cgroup root and corrupting host state.

Default topology

topology = (numa_nodes, llcs, cores_per_llc, threads_per_core) sets the VM topology tests inherit. Scheduler::named() defaults to (1, 1, 2, 1) — a minimal 2-CPU VM. Tests override individual dimensions; unset ones still inherit:

// Inherits llcs=2, cores=4 from MITOSIS; overrides threads to 2.
#[ktstr_test(scheduler = MITOSIS, threads = 2)]
fn smt_test(ctx: &Ctx) -> Result<AssertResult> { /* ... */ }

Two #[ktstr_test] attributes complement the scheduler definition: staged_schedulers = [PATH, …] packs extra scheduler binaries for runtime swaps via Op::ReplaceScheduler / Op::AttachScheduler, and workload_root_cgroup = "/path" roots workload cgroups independently of the scheduler’s cgroup_parent. Both are documented in the macro reference.

Payloads

Payload authoring — #[derive(Payload)], metric hints, include files — lives on its own page: Payloads and Included Files.

Payloads and Included Files

Scheduler tests often need a real benchmark running alongside the cgroup workers — schbench for wakeup latency, fio for IO pressure, stress-ng for raw contention. A Payload declares that binary once: its default args, how to parse its output, the metrics it emits, the checks that gate them, and the files it needs packed into the guest.

Declaring a payload

#[derive(Payload)] on a marker struct generates a const Payload. This is a real fixture from ktstr’s own test suite (tests/common/fixtures.rs) — schbench with a machine-parseable JSON summary on stdout:

use ktstr::Payload;

#[derive(Payload)]
#[payload(binary = "schbench", name = "schbench_json", output = Json)]
#[default_args("--runtime", "5", "--message-threads", "2", "--json", "-")]
#[default_check(exit_code_eq(0))]
#[metric(name = "int.rps_pct50.0", polarity = HigherBetter, unit = "rps")]
#[metric(name = "int.wakeup_latency_pct99.0", polarity = LowerBetter, unit = "us")]
#[metric(name = "int.request_latency_pct99.0", polarity = LowerBetter, unit = "us")]
pub struct SchbenchJsonPayload;

The derive emits pub const SCHBENCH_JSON: Payload — the const name is the struct name with a trailing Payload stripped and converted to SCREAMING_SNAKE_CASE (FioPayloadFIO; a suffixless BenchDriverBENCH_DRIVER). The const’s visibility matches the struct’s.

The attributes:

  • #[payload(binary = "...", name = "...", output = ...)]binary (required) names the executable the include-file pipeline resolves and packs; name (default: the binary name) is the display label in sidecars and logs, so two fixtures can share one binary; output is Json (parse numeric leaves from the output) or ExitCode (status code only, the default).
  • #[default_args("--a", "--b")] — CLI args prepended to every invocation; per-test .arg(...) calls append after them.
  • #[default_check(exit_code_eq(0))] — a MetricCheck constructor (min, max, range, exists, exit_code_eq); the MetricCheck:: prefix is optional. Repeat the attribute for several checks.
  • #[metric(name = "...", polarity = ..., unit = "...")] — declares a metric the payload emits. polarity is HigherBetter, LowerBetter, TargetValue(x), or Unknown; it drives list-metrics and comparison direction. Duplicate metric names are rejected at expansion.
  • #[include_files("helper", "config.json")] — extra files packed into the guest alongside the binary. The binary itself is auto-prepended, so it never needs listing.

Payload is #[non_exhaustive]: downstream crates cannot use struct-literal construction. For a binary with no declared metrics or args, Payload::binary(name, executable) is the one-line constructor; for anything richer, use the derive. Metrics extracted with no matching #[metric] hint still land in the sidecar with Polarity::Unknown — declare a hint for any metric a comparison verdict should classify.

Using a payload in a test

Reference the const from the test attribute, run it from the body:

#[ktstr_test(scheduler = MY_SCHED, payload = SCHBENCH_JSON, duration_s = 10)]
fn wakeup_latency_under_load(ctx: &Ctx) -> Result<AssertResult> {
    ctx.payload(&SCHBENCH_JSON)
        .arg("--runtime").arg("8")
        .run()
        .map(|(assert_result, _metrics)| assert_result)
}

payload = is the primary slot; workloads = [A, B] composes more payloads alongside it (each runnable via ctx.payload(&A)). The builder returned by ctx.payload(...) inherits the payload’s default args and checks; .arg(...) / .args(...) extend, .in_cgroup(...) places the child, .timeout(...) bounds it, and the terminal .run() blocks and returns Result<(AssertResult, PayloadMetrics)>. Only binary-kind payloads are runnable; the scheduler = slot is separate and takes a bare Scheduler, never a Payload.

Two dedup rules: the same const may not appear in both payload and workloads (or twice in workloads), but two distinct consts that share a binary — like FIO and FIO_JSON — are not deduped and will spawn the binary twice, each with its own argv. Pick one fixture per binary unless two instances are the point.

Metric extraction: stdout first, then stderr

OutputFormat::Json reads the payload’s stdout as the primary stream, then falls back to stderr if stdout is empty or yields no metrics. Some benchmarks emit their numbers only to stderr — schbench, for example, writes its Wakeup Latencies percentiles / Request Latencies percentiles blocks via fprintf(stderr, ...) and leaves stdout blank (pass --json - for a machine-parseable summary on stdout). The fallback keeps those benchmarks usable without a redirect.

Consequence: a payload that writes mixed output to both streams has metrics extracted from stdout only, because the fallback fires solely when the primary stream yields nothing parseable. If you care about stderr-side numbers for a stdout-emitting binary, redirect stderr into stdout at the payload layer.

stress-ng is the mirror trap: progress and per-stressor summaries go to stderr and stdout is blank, so the fallback sees prose and OutputFormat::Json returns zero metrics. Keep OutputFormat::ExitCode for stress-ng unless the payload is wired to emit JSON on stdout.

Included files

Payloads declare their guest-filesystem dependencies on the Payload itself via #[include_files(...)], instead of relying on the CLI -i / --include-files flag at every invocation. Specs are resolved at test time through the same pipeline the CLI flag uses (see ktstr shell).

Spec shapes

Which branch fires is decided by the shape of the string:

  • Bare name (single component, no /) — looked up in the current working directory first, then the host PATH. Packed as include-files/<filename>. "fio" → host /usr/bin/fio → guest /include-files/fio.
  • Relative or absolute path — used verbatim and must exist; relative paths resolve against the harness’s working directory at test time. Packed as include-files/<filename>. "./test-fixtures/workload.json" → guest /include-files/workload.json.
  • Directory — walked recursively (symlinks followed, non-regular files skipped); the basename becomes the root. "./helpers" containing a.sh and sub/b.sh → guest /include-files/helpers/a.sh and /include-files/helpers/sub/b.sh.

Strings in the test-level extra_include_files attribute follow the same three shapes. They are not anchored to CARGO_MANIFEST_DIR — they resolve against the working directory at test time, plus PATH for bare names, and the attribute accepts plain string literals only (no concat!(env!(...))). For fixtures shipped alongside test source, the reliable options are a bare name placed on PATH by a setup step, or a relative path rooted where the test is invoked.

A fully declarative test

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 1, 2, 1),
});

#[derive(Payload)]
#[payload(binary = "bench-driver")]
#[include_files("bench-helper")]
#[metric(name = "ops_per_sec", polarity = HigherBetter, unit = "ops/s")]
struct BenchDriver;

#[ktstr_test(
    scheduler = MY_SCHED,
    payload = BENCH_DRIVER,
    extra_include_files = ["test-fixtures/workload.json"],
    duration_s = 5,
)]
fn bench_driver_runs_with_declared_helpers(ctx: &Ctx) -> Result<AssertResult> {
    // bench-driver, bench-helper, and workload.json all land in the
    // guest at /include-files/ and are on the worker's PATH; no -i
    // flag on any host-side invocation.
    ctx.payload(&BENCH_DRIVER)
        .run()
        .map(|(assert_result, _metrics)| assert_result)
}

The declarative set — the payload’s include_files, each workload’s, and extra_include_files — is aggregated at test time and deduped on identical (archive path, host path) pairs. Two declarations that resolve to the same archive slot with different host paths are a hard error naming both host paths, rather than one silently overwriting the other.

Probe-wiring environment variables

Two variables pack the jemalloc allocator probe pair into the guest: KTSTR_JEMALLOC_PROBE_BINARY and KTSTR_JEMALLOC_ALLOC_WORKER_BINARY (absolute host paths; unset means no probe is packed). They must be populated before ktstr’s nextest pre-dispatch runs — plain test-body code is too late — so tests that need them set both from a #[ctor] constructor, using the re-export at ktstr::__private::ctor to avoid a second ctor crate in the dependency tree. See Environment Variables for the reference rows.

Custom Scenarios

The body of a #[ktstr_test] function is the scenario — there is no separate registration step. Most bodies hand control to a canned scenario or to execute_defs / execute_steps; a custom scenario is the same function keeping control and driving cgroups, workers, and assertions itself.

For dynamic scenarios (cgroup creation/removal, cpuset changes), prefer the ops/steps system over a hand-written scenario. Reach for custom code only when ops cannot express the logic:

  • Cgroups created, removed, or resized at fixed points in the run — ops cover it.
  • Different work types, worker counts, or phase-scoped checks per step — ops cover it.
  • Snapshot captures at chosen points — ops cover it (Snapshots).
  • Branching on state observed mid-run, computing cpusets from runtime conditions, or asserting directly on raw WorkerReports — ops cannot; write a custom scenario.

A worked custom scenario

Shrink one cgroup’s cpuset mid-run — a decision ops cannot make, because the second half’s cpuset and the assertion both depend on runtime state — then assert the scheduler actually moved the workers:

use ktstr::prelude::*;
use ktstr::scenario::*;

#[ktstr_test(llcs = 2, cores = 4, threads = 1, duration_s = 10)]
fn workers_follow_cpuset_shrink(ctx: &Ctx) -> Result<AssertResult> {
    let wl = dfl_wl(ctx);
    // Creates cg_0 and cg_1, spawns and starts workers in each.
    let (mut handles, _guard) = setup_cgroups(ctx, 2, &wl)?;

    // First half: full topology.
    std::thread::sleep(ctx.duration / 2);

    // Mid-run: pin cg_0 to LLC 1 only.
    let llc1 = ctx.topo.llc_aligned_cpuset(1);
    ctx.cgroups.set_cpuset("cg_0", &llc1)?;

    // Second half: workers must migrate off LLC 0.
    std::thread::sleep(ctx.duration / 2);

    let cg0_reports = handles.remove(0).stop_and_collect();
    let migrations: u64 = cg0_reports.iter().map(|r| r.migration_count).sum();
    anyhow::ensure!(
        migrations > 0,
        "cpuset shrink forced no migrations — cg_0 workers never moved"
    );

    let mut result = ctx.assert.assert_cgroup(&cg0_reports, None);
    result.merge(collect_all(handles, &ctx.assert));
    Ok(result)
}

Bind the CgroupGroup to a named variable (_guard) so the cgroups live until end of scope — see CgroupGroup for drop semantics. Sleeping ctx.duration (rather than a hard-coded period) keeps the scenario composable with duration_s = N overrides and the gauntlet budget controller.

Imports: setup_cgroups, dfl_wl, collect_all, and spawn_diverse live in ktstr::scenario, not in the prelude. The use ktstr::scenario::*; line is required — use ktstr::prelude::*; alone does not bring them into scope.

Helper functions

setup_cgroups(ctx, n, wl) — creates cgroups cg_0..cg_{n-1}, spawns and starts workers in each, and returns (Vec<WorkloadHandle>, CgroupGroup) with handles in cgroup order.

collect_all(handles, checks) — stops all workers and collects reports. Per-cgroup telemetry is always produced; only the checks the caller enabled record assertion outcomes, and with no checks enabled the result stays pass (there is no implicit starvation fallback).

dfl_wl(ctx) — a WorkloadConfig with ctx.workers_per_cgroup workers and default settings (WorkType::SpinWait).

spawn_diverse(ctx, cgroup_names) — spawns rotating work types across cgroups (SpinWait, Bursty, IoSyncWrite, Mixed, YieldHeavy); IoSyncWrite cgroups always get 2 workers so blocking IO does not drown the scenario.

Custom work functions

When the built-in work types don’t generate the load pattern you need, WorkType::Custom runs a user-supplied work function inside each worker. The framework handles fork, cgroup placement, affinity, and signal setup; the function owns the work loop and all WorkerReport population — framework telemetry (migration tracking, gap detection, schedstat deltas) is not provided.

use std::sync::atomic::Ordering;
use ktstr::workload::{WorkType, WorkerCtx, WorkerReport};

fn my_workload(ctx: &WorkerCtx) -> WorkerReport {
    let tid: i32 = std::process::id() as i32; // one worker = one process
    let start = std::time::Instant::now();
    let mut work_units = 0u64;
    while !ctx.stop().load(Ordering::Relaxed) {
        // ... custom work ...
        work_units += 1;
    }
    // Start from default() so unpopulated fields stay zero/empty.
    WorkerReport {
        tid,
        work_units,
        iterations: work_units,
        wall_time_ns: start.elapsed().as_nanos() as u64,
        ..WorkerReport::default()
    }
}

let wt = WorkType::custom("my_workload", my_workload);

WorkerCtx exposes the stop flag (ctx.stop()), ctx.cpus(), ctx.sibling_pids(), ctx.cgroup_dir(), and ctx.cfg(). Only plain function pointers are accepted — they carry no captured state across the fork boundary; closures are not supported. To pass per-worker configuration, build the work type with WorkType::custom_with(name, run, cfg): CustomCfg is a Copy POD payload inherited byte-faithfully across fork. For genuinely shared state, allocate a MAP_SHARED region and pass its address through a u64 slot.

Warning

Every worker calls setpgid(0, 0) after fork, and teardown SIGKILLs the worker’s whole process group — twice (at collect and at handle drop). Any child a custom function spawns inherits that pgid and dies with it. A child that must outlive the worker needs setpgid(child_pid, 0) after fork, or an explicit wait before the function returns.

The Ctx fields scenario authors use

  • ctx.cgroups — create/remove cgroups, set cpusets, move tasks. A &dyn CgroupOps trait object; CgroupManager is the production implementation.
  • ctx.topo — CPU/LLC/NUMA queries and cpuset generation. See Topology.
  • ctx.duration — the workload wall-clock budget; sleep against this, not a literal.
  • ctx.settle — time to wait after cgroup creation for the scheduler to stabilize.
  • ctx.workers_per_cgroup — default per-cgroup worker count (dfl_wl reads it; there is no workers test attribute — set counts via CgroupDef::named("x").workers(n)).
  • ctx.sched_pid — scheduler PID for liveness checks; None when running under kernel-default EEVDF.
  • ctx.assert — the merged check set (defaults → scheduler → per-test). Pass to collect_all / assert_cgroup so attribute overrides actually apply.
  • ctx.work_type_override — gauntlet-supplied work type applied to CgroupDefs marked swappable; it does not affect dfl_wl.
  • ctx.current_step — live phase counter (0 = baseline, 1..=N = step ordinal), readable via ctx.current_step.load(Ordering::Acquire) to gate behavior on phase; periodic captures are stamped with the same value.

The remaining fields are framework wiring; see the Ctx rustdoc.

Snapshots

Was the scheduler’s per-task state right in the middle of the run? A snapshot answers that: the freeze coordinator pauses every vCPU long enough to walk the kernel’s BPF maps, BTF-render every captured value, and store the result under a name you choose. Test code reads it back through a typed accessor whose errors carry the available alternatives — a typo’d map or field name tells you what was actually there.

Three capture triggers share this machinery:

CaptureTriggerThe question it answers
Op::capture_snapshot (this page)a chosen point in the scenariowhat does state look like right now?
Watch Snapshotsa kernel write to a named symbolwhat was state at the instant the kernel touched X?
Periodic Captureevenly spaced boundarieshow does state evolve across the run?

In a #[ktstr_test] scenario the pipeline is wired automatically: the op sends a request from the guest to the host coordinator, which freezes, captures, and stores the report on the host-side SnapshotBridge. The test reads captures after the VM exits, in a post_vm callback. No bridge setup is needed — manual wiring exists only for host-side unit tests.

Capturing and reading

use ktstr::prelude::*;

fn inspect_after_spawn(result: &VmResult) -> anyhow::Result<()> {
    let drained = result.snapshot_bridge.drain_ordered_with_stats();
    let entry = drained
        .iter()
        .find(|e| e.tag == "after_spawn")
        .ok_or_else(|| anyhow::anyhow!("snapshot 'after_spawn' missing"))?;
    let snap = Snapshot::new(&entry.report);

    let nr_dispatched = snap.var("nr_dispatched").as_u64()?;
    anyhow::ensure!(nr_dispatched > 0, "scheduler never dispatched");
    Ok(())
}

#[ktstr_test(scheduler = MY_SCHED, post_vm = inspect_after_spawn)]
fn snapshot_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
    let steps = vec![Step {
        setup: vec![ctx.cgroup_def("workers")].into(),
        ops: vec![Op::capture_snapshot("after_spawn")],
        hold: HoldSpec::FULL,
    }];
    execute_steps(ctx, steps)
}

A scenario may issue any number of Op::capture_snapshot ops with distinct names; reusing a name overwrites the prior capture (with a warning). If the capture pipeline is unavailable, the op fails loudly — a snapshot that silently didn’t happen would let assertions that depend on it pass vacuously.

The accessor surface

Snapshot::new(report) builds a borrowed view; accessors walk the report in place.

Maps and globals

let map = snap.map("scx_per_task")?;         // a captured map by name
let nr = snap.var("nr_cpus_onln").as_u64()?; // a top-level global

var(name) searches every *.bss / *.data / *.rodata global-section map for a top-level member. When several schedulers’ sections carry the same name, var first tries to resolve the active scheduler’s copy automatically; live_var(name) opts into that active-scheduler filter explicitly, and map(name) addresses one scheduler’s section directly. Note var does not split dotted paths — to walk into a struct global, chain: snap.var("ctx").get("weight").

Entries inside a map

let first   = map.at(0);                                               // by index
let busy    = map.find(|e| e.get("tid").as_i64().unwrap_or(-1) == 1234);
let busiest = map.max_by(|e| e.get("runtime_ns").as_u64().unwrap_or(0));
let active  = map.filter(|e| e.get("runtime_ns").as_u64().unwrap_or(0) > 0);

Per-CPU maps (BPF_MAP_TYPE_PERCPU_*) need narrowing before reading: map.cpu(1).at(0). Calling get on a per-CPU entry without .cpu(N) first is an error, not a silent first-slot read.

Dotted paths and terminal reads

get(path) walks struct members along a dotted path (entry.get("ctx.weight")entry.get("ctx").get("weight")), transparently following pointer dereferences up to 16 hops — you write the path the BTF suggests, indirection is invisible. get("") returns the current value, for terminal reads on scalar per-CPU slots.

MethodReturnsAccepts
as_u64()u64Uint, non-negative Int/Enum, Bool, Char, Ptr (raw pointer value)
as_i64()i64Int, Uinti64::MAX, Bool, Char, Enum
as_bool()boolBool; non-zero scalar is true
as_f64()f64Float, Int, Uint, Enum
as_str()&strEnum with a resolved variant name
raw()Option<&RenderedValue>the underlying rendered value

Errors carry the fix

Every accessor returns Result<_, SnapshotError>, and each variant carries what you need to correct the call site without re-running the test. The rendered messages (quoted from the Display impl):

  • Snapshot::map miss — snapshot has no map '{requested}' (captured maps: {available:?})
  • Snapshot::var miss — snapshot has no global variable '{requested}' in any *.bss/*.data/*.rodata map (available globals: {available:?})
  • ambiguous global — snapshot global '{requested}' is ambiguous (found in {found_in:?}); use Snapshot::active().var(name) (or the shorthand Snapshot::live_var(name)) to pick the active scheduler's copy automatically, or Snapshot::map(name) to address a specific scheduler's bss explicitly
  • path-walk miss — path '{requested}': component '{component}' (after walking '{walked}') not found (members at this depth: {available:?})
  • wrong terminal type — path '{requested}': cannot read as {expected} — actual rendered variant is {actual}
  • predicate miss (find / max_by) — map '{map}': {op} matched none of {len} entries (first {sampled}: {available_keys:?}); an empty map instead renders map '{map}': {op} matched no entries (map is empty), distinguishing it from a populated map whose every entry the predicate rejected. When every sampled key renders as raw hex (no BTF for the key type at capture time), the message appends a hint naming CONFIG_DEBUG_INFO_BTF=y as the fix.

Two variants matter for series-based assertions and are routed specially by the temporal patterns: PlaceholderSample (the freeze rendezvous timed out, so the report carries no real data — skipped, never counted as zero progress) and MissingStats (the per-sample scx_stats request failed or no stats client was wired — distinct from an in-JSON path miss so the assertion site can branch on the cause).

SnapshotError implements std::error::Error, so it composes with ? and anyhow.

Cast-recovered pointers

Schedulers stash kernel and arena pointers in fields whose BTF says u64, because BTF cannot express a pointer to a per-allocation type. The host-side cast analyzer recovers the real target type from the scheduler’s instruction stream, and the renderer chases the pointer into the right address space. For the test author:

  • as_u64() still returns the raw pointer value — existing tests keep working.
  • Dotted-path walks follow the recovered chase transparently; nested fields appear under the same path a natively-typed pointer would give.
  • Rendered dumps annotate recovered pointers so you can tell them from BTF-typed ones — no extra calls needed to consume them.

This is what the annotations look like in a real failure dump (scx-ktstr’s .bss, from the run on the macro reference page):

map bpf_bpf.bss (type=array, value_size=448, max_entries=1)
.bss:
  scx_arena_verify_once=true   ktstr_alloc_count=76   nr_dispatched=907
  nr_enqueued=495              nr_select_cpu=372      stats_magic=6004496034161779060
...
  scx_task_allocator scx_allocator:
...
    root 0x100000006000 → sdt_desc:
      nr_free=512
      chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{}
  ktstr_bss_arena_holder ktstr_bss_arena_holder:
    bss_plain_counter=76
    arena_target 0x10000000aa80 (cast→arena) [chase: arena chase: STX-flow path tagged slot as Arena with deferred resolve; bridge had no entry for 0x10000000aa80]

(cast→arena) / (cast→kernel) mark analyzer-recovered pointers; (sdt_alloc) marks a forward-declared arena type resolved through the allocator bridge. The full annotation taxonomy lives in Monitor.

Composing reads with writes

Snapshots are the read half of host↔guest interaction. The write half is the #[ktstr_test] attribute bpf_map_write = CONST — a one-shot host-side poke at scheduler-load time:

use ktstr::prelude::*;

const TRIGGER_FAULT: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);
// (map_name_suffix, BPF global variable name, u32 value). The
// variable's byte offset is resolved from the map's program BTF at
// write time.

#[ktstr_test(scheduler = MY_SCHED, bpf_map_write = TRIGGER_FAULT, expect_err = true)]
fn fault_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
    // The host writes 1 into the scheduler's `crash` global before
    // workers start; the scheduler reads the flag and reacts.
    /* Op::capture_snapshot + post_vm read as above */
    Ok(AssertResult::pass())
}

The write waits for the scheduler’s map to appear, resolves the named variable to an offset via BTF, writes the value, and signals completion to the guest before workers spawn. Only BPF_MAP_TYPE_ARRAY maps are supported. A read+write test then composes naturally: seed a flag with bpf_map_write, run the scenario, capture with Op::capture_snapshot, assert on the scheduler’s reaction through the Snapshot accessors.

There is no op for runtime writes — mid-scenario mutation belongs to interfaces the scheduler itself exports (sysfs, debugfs, a BPF map command interface) driven from a workload process.

Harness internals: manual bridge wiring

Warning

Do not install a thread-local bridge inside a #[ktstr_test] scenario that boots a VM — the host coordinator owns the bridge there, and a scenario-local one would shadow it. Read captures in post_vm from VmResult::snapshot_bridge instead.

Host-side unit tests that exercise the executor without booting a guest install a fixture bridge:

let cb: CaptureCallback = std::sync::Arc::new(|_name: &str| {
    Some(FailureDumpReport::default())   // hand-crafted report
});
let bridge = SnapshotBridge::new(cb);
let handle = bridge.clone();
let _guard = bridge.set_thread_local();
// ... execute_steps(...) ... then handle.drain() ...

set_thread_local returns a guard that restores the prior bridge on drop; bind it to _guard, not let _ = — the latter drops the guard immediately and clears the bridge before any op runs. tests/snapshot_e2e.rs exercises this pattern end-to-end.

Watch Snapshots

What did the scheduler look like at the exact instant the kernel wrote a specific variable? Op::watch_snapshot("symbol") arms a hardware data-write watchpoint on a named kernel symbol; every guest write to it triggers a full snapshot capture, tagged with the symbol name. Where Op::capture_snapshot answers “what does state look like at this point in my scenario”, Op::watch_snapshot answers “what was state when the kernel did X”.

Watch snapshots are supported on x86_64 and aarch64 KVM hosts; each architecture’s KVM plumbing maps the slots onto its native hardware-watchpoint facility.

Issuing a watch

use ktstr::prelude::*;

fn read_watch_fires(result: &VmResult) -> anyhow::Result<()> {
    let drained = result.snapshot_bridge.drain_ordered_with_stats();
    // Each fire is stored under the symbol name as its tag.
    let fires = drained.iter().filter(|e| e.tag == "scx_watchdog_timestamp");
    anyhow::ensure!(fires.count() > 0, "watchpoint never fired");
    Ok(())
}

#[ktstr_test(scheduler = MY_SCHED, post_vm = read_watch_fires)]
fn watch_watchdog_writes(ctx: &Ctx) -> Result<AssertResult> {
    let steps = vec![
        Step::with_defs(vec![ctx.cgroup_def("workers")], HoldSpec::FULL)
            .set_ops(vec![Op::watch_snapshot("scx_watchdog_timestamp")]),
    ];
    execute_steps(ctx, steps)
}

In a VM-booting #[ktstr_test], the wiring is automatic: the op registers the symbol with the host coordinator, which resolves the address from the vmlinux ELF, arms a free hardware watchpoint slot via KVM_SET_GUEST_DEBUG, and stores one capture per fire on the host-side bridge. Read the captures in post_vm through the same Snapshot accessors every capture kind shares. When a sidecar dump path is configured for the run, each fire’s report is also mirrored to a tagged JSON file for post-hoc inspection.

Choosing a symbol

Production resolution is a verbatim, byte-for-byte match against the vmlinux ELF symbol table — no prefix stripping, no BTF lookup, no kallsyms walk. Use exactly the name nm prints:

nm vmlinux | grep -w scx_watchdog_timestamp

A string that matches nothing fails the step with symbol '<name>' not found in vmlinux symtab (typo, symbol stripped from the build, or a non-ELF kernel image).

Warning

High-frequency symbols soft-lock the guest. Watching a symbol the kernel writes every jiffy (e.g. jiffies_64 at HZ=1000) fires 1000+ captures per second, and each capture freezes all vCPUs for the full dump pipeline. The guest spends almost all of its wall time paused — schedulers stall, watchdogs fire, and the test wedges before any meaningful work runs. Pick symbols the kernel writes at scenario-relevant cadence: a state field, a per-event counter.

Three watches per scenario

The cap is 3, tied to the hardware watchpoint slots KVM exposes: slot 0 is permanently reserved for the *scx_root->exit_kind trigger that drives the failure-dump pipeline on SCX_EXIT_ERROR (it always runs, whether or not a scenario declares watches), and the remaining three user slots are yours. A fourth Op::watch_snapshot fails the step with the pinned message:

Op::WatchSnapshot cap exceeded: scenario already registered 3
watchpoints (3 user watchpoint slots occupied; slot 0 reserved for
the error-class exit_kind trigger). Drop a watch or use
Op::CaptureSnapshot for a time-driven capture instead.

A failed registration — cap exceeded, resolution failure, callback error — does not consume a slot; the bridge rolls the count back so the scenario can retry with a different symbol.

Failure modes

Registration is the single point where the production pipeline can fail. The callback returns an error when:

  • The symbol does not match any vmlinux ELF symtab entry.
  • The resolved address is not 4-byte aligned (the 4-byte watch length requires addr & 0x3 == 0 on every supported architecture).
  • All three user watchpoint slots are already allocated.
  • KVM_SET_GUEST_DEBUG rejected the arm (host kernel limitation).

When registration fails, the executor bails the step immediately with the symbol and the reason. Silent degradation is deliberately avoided — a watch that never fires would look identical to a healthy passing run, and the test author would never notice the captures were missing.

Host-side unit tests

Outside a VM, a watch-capable fixture bridge needs both callbacks — a bridge built with only SnapshotBridge::new(cb) rejects every Op::watch_snapshot with an error naming the missing wiring:

let cb: CaptureCallback = std::sync::Arc::new(|_name| {
    Some(FailureDumpReport::default())
});
let reg: WatchRegisterCallback = std::sync::Arc::new(|symbol: &str| {
    println!("would arm watchpoint on {symbol}");
    Ok(())
});
let bridge = SnapshotBridge::new(cb).with_watch_register(reg);
let _guard = bridge.set_thread_local();

Do not install a thread-local bridge in a VM-booting scenario — see the warning in Snapshots.

Periodic Capture

A single snapshot proves state was right once; scheduler bugs are usually about how state evolves — a counter that stops advancing, utilization that drifts after warmup. Periodic capture samples guest BPF state on a cadence across the workload window, driven entirely by the host: no scenario-code changes, no capture calls in the test body. The result is a time-ordered series of samples that feeds the temporal assertion patterns.

Enabling it

Set num_snapshots = N on the test; 0 (the default) disables periodic capture entirely.

use ktstr::prelude::*;

#[ktstr_test(num_snapshots = 3, duration_s = 10)]
fn paced_capture(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}

When boundaries fire

The window is the 10%–90% slice of the workload duration, anchored at the moment the scenario actually starts — VM boot and BPF verifier time do not eat the budget. The 10% buffers at each end keep samples off ramp-up and ramp-down transients. The remaining 80% divides into N + 1 equal intervals, yielding N interior boundaries at 0.1·d + (i+1)·0.8·d/(N+1). For a 10 s workload, num_snapshots = 3 captures at scenario start + {3 s, 5 s, 7 s}.

The boundary clock is workload time, not wall-clock: a scenario pause shifts every un-fired boundary by the pause duration.

Two validation rules, enforced when the entry is built:

  • Minimum spacing0.8 · duration / (N + 1) >= 100 ms. Boundaries closer than that would fire back-to-back with no workload progress between them. Reduce num_snapshots or extend duration_s.
  • Bridge capnum_snapshots cannot exceed 64 (MAX_STORED_SNAPSHOTS). Validation rejects higher values rather than silently evicting the earliest samples.

What a capture costs

Each boundary runs the same pipeline as an on-demand Op::capture_snapshot: every vCPU is parked, the BPF maps are walked, the report is stored. On a healthy guest the freeze is tens of milliseconds (10–100 ms steady state; cold-cache or large guest-memory walks push higher). The host watchdog deadline is extended by each freeze’s duration, so periodic captures do not eat the workload’s wall-clock budget — but they do briefly stop the guest, which is why the spacing floor exists.

Tags and best-effort delivery

Each capture lands on the host SnapshotBridge under periodic_NNN (periodic_000, periodic_001, …), coexisting with on-demand and watchpoint tags on the same bridge — filter with SampleSeries::periodic_only() before asserting.

Delivery is best-effort: an early VM exit, rendezvous timeout, or watchdog deadline can cut the sequence short, and the run loop abandons the remainder after 2 consecutive rendezvous timeouts so a sustained host overload does not pile up placeholder samples. Under KASLR (the default), a boundary that would fire before the guest’s address slide is published is deferred, not dropped — it fires on the next loop iteration. Assert a lower bound on coverage, not equality:

fn check_coverage(result: &VmResult) -> Result<()> {
    anyhow::ensure!(result.periodic_target == 3);
    anyhow::ensure!(
        result.periodic_fired >= 2,
        "too few periodic samples ({}/{})",
        result.periodic_fired,
        result.periodic_target,
    );
    Ok(())
}

periodic_target mirrors the configured num_snapshots; periodic_fired counts boundaries actually serviced (including rendezvous-timeout placeholders). When post_vm is omitted on a periodic-configured test, the macro installs a default callback asserting at least one boundary fired with real BPF state.

Draining the bridge

The assertion pipeline runs on the host after vm.run() returns — inside a post_vm callback. The recommended path is drain_ordered_with_stats fed into SampleSeries::from_drained_typed, which preserves insertion order, per-sample stats results, and timestamps:

use ktstr::prelude::*;

fn post_vm(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained_typed(
        result.snapshot_bridge.drain_ordered_with_stats(),
        result.monitor.clone(),
    )
    .periodic_only();

    anyhow::ensure!(
        !series.is_empty(),
        "no periodic samples — coordinator never fired",
    );

    // ... project a field and feed a temporal pattern ...
    Ok(())
}

#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = post_vm)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}

Each drained entry carries the tag, the captured report, the typed per-sample stats result (Err(MissingStatsReason) when the stats request failed or no scheduler stats client was wired), a pause-adjusted elapsed_ms timestamp, the scheduled boundary_offset_ms, and the scenario phase stamp (step_index). The other drain variants drop metadata the temporal pipeline needs — see the SnapshotBridge rustdoc if you need them.

Temporal Assertions owns the sample anatomy and projection surface; Snapshots owns the per-sample error routing (PlaceholderSample, MissingStats).

What to assert

Two stages: compose the series (drain, periodic_only()), then project a column and pick a pattern. For monotonic counters, nondecreasing is the canonical choice; for utilization-style metrics that should hold once warmup ends, steady_within; for “stabilizes near a target by a deadline”, converges_to. The full pattern surface, projection helpers, and failure rendering live in Temporal Assertions.

Temporal Assertions

Periodic snapshots produce a series of samples over time. Temporal assertions answer questions about the trajectory — does a counter only ever advance? Does a utilization metric stay near its mean once warmup ends? Does a load average converge before a deadline?

The shape is two-stage: build a SampleSeries from the drained periodic captures, then project a SeriesField<T> — one column of T-typed values across every sample — and feed it through a pattern. Pick by the question:

PatternQuestion it answersTypeOn a projection error
nondecreasingdoes this counter only go up?any orderedskip the pair, note the gap
strictly_increasingdoes it advance every period?any orderedskip the pair, note the gap
rate_within(lo, hi)does it advance at the right speed?f64gap — no rate across it, note
steady_within(warmup, tol)does it hold near its mean after warmup?f64skip the sample, note
converges_to(target, tol, deadline)does it stabilize near a target in time?f64interrupts the witness run, note
always_truedoes this invariant hold at every sample?boolfail — strict
ratio_within(other, lo, hi)do two series stay in proportion?f64skip the index, note
each(...)is every sample inside a scalar bound?anyfail — strict

Every pattern takes &mut Verdict and returns it, so assertions chain onto one accumulator; each failure records a DetailKind::Temporal detail, and coverage gaps record Notes. For enabling capture and draining the bridge, see Periodic Capture — this page covers projection and assertion.

SampleSeries

SampleSeries is the ordered sample sequence drained from the bridge after the VM exits:

use ktstr::prelude::*;

let drained = vm_result.snapshot_bridge.drain_ordered_with_stats();
let series = SampleSeries::from_drained_typed(drained, monitor).periodic_only();

periodic_only() filters to tags beginning with "periodic_", stripping on-demand captures and watchpoint fires that share the bridge; periodic_ref() is the borrowed-iterator equivalent when one test needs both views.

SampleSeries exposes:

  • len(), is_empty() — sample count.
  • iter_samples() — borrowed Sample<'_> views. Each sample carries tag, a pause-adjusted elapsed_ms, a Snapshot<'_> over the captured BPF state, a step_index: Option<u16> phase stamp, and stats: Result<&Value, &MissingStatsReason> — the per-sample scx_stats JSON, or the typed reason the stats request failed.
  • bpf(label, |snap| …) / stats(label, |sv| …) — closure projection along the BPF or stats axis.
  • bpf_live_u64(name) / bpf_live_i64 / bpf_live_f64 — terse BPF-axis shorthand that resolves name via the auto-disambiguating Snapshot::live_var accessor (no closure); mirrored on the stats axis as stats_live_u64(path) / _i64 / _f64.
  • bpf_map(map_name) / stats_path(path) — typed auto-projection (see Auto-projection).
  • by_stamped_phase() — group samples by the bridge-stamped scenario phase (BTreeMap<u16, Vec<Sample>>; 0 = baseline, 1..=N = step ordinals). Prefer by_stimulus_phase(stimulus_events) when a stimulus timeline is available — it re-derives the phase from each sample’s boundary_offset_ms and is immune to deferred-fire bursts that collapse stamped phases.

SeriesField

A SeriesField<T> is one per-sample column. Each slot is a SnapshotResult<T>, so a missing field, type mismatch, or placeholder report on one sample does not abort the projection — it surfaces at the assertion site as a per-sample error the pattern decides how to handle (see the table above). The field carries each sample’s tag and timestamp alongside the value, so failure messages name the offending sample without re-threading the series.

Projecting from BPF state

The bpf closure receives each sample’s Snapshot<'_>; the body is a normal Snapshot accessor expression:

let nr_dispatched: SeriesField<u64> = series.bpf(
    "nr_dispatched",
    |snap| snap.var("nr_dispatched").as_u64(),
);

Projecting from scx_stats JSON

The stats closure receives a StatsValue<'_> wrapper over the per-sample stats JSON:

let busy: SeriesField<f64> = series.stats(
    "busy",
    |sv| sv.get("busy").as_f64(),
);

A sample whose stats slot is Err (the stats request failed, or no scheduler stats client was wired) yields a SnapshotError::MissingStats { tag, reason } slot — distinct from an in-JSON path miss (FieldNotFound / TypeMismatch) so coverage gaps and data errors stay distinguishable.

Auto-projection

The typed auto-projectors emit ready-to-feed SeriesFields without a closure:

// Top-level scalar member of a BPF map's first entry.
let dispatched = series
    .bpf_map("scx_obj.bss")
    .at(0)
    .field_u64("nr_dispatched");

// Stats path drilling into nested layer/cgroup keys.
let layer_util = series
    .stats_path("layers")
    .key("batch")
    .field_f64("util");

Bulk discovery: member_names() / u64_fields() / f64_fields() on the BPF projector (key_names() on the stats projector) project every member that yields at least one Ok across the series — useful for blanket “every counter must be nondecreasing” sweeps. The typed field_* helpers reach top-level scalars only; nested members ("ctx.weight") need the closure path. Per-CPU maps use the projector’s cross-CPU reductions (field_cpu_sum_* / field_cpu_max_* / field_cpu_min_*) or .cpu(n).field_*.

The patterns

nondecreasing / strictly_increasing

Pass when every consecutive pair satisfies values[i] <= values[i+1] (or < for the strict variant). The shape for kernel counters whose only legal direction is up.

let mut v = Verdict::new();
nr_dispatched.nondecreasing(&mut v);
nr_dispatched.strictly_increasing(&mut v); // require advance every period

Projection errors are skipped — the affected pair is dropped, the skip is logged as a Note, and the verdict is not flipped on missing data; adjacent samples on either side of a gap are still checked. Fewer than 2 samples records a “vacuously holds” Note and passes.

rate_within(lo, hi) (f64 only)

Pass when every consecutive (delta_value / delta_ms) lies in [lo, hi], computed from the per-sample timestamps — a counter that should advance at ~1 unit/ms reads as rate_within(0.5, 2.0).

let ticks: SeriesField<f64> = series.bpf("ticks",
    |snap| snap.var("ticks").as_f64());
ticks.rate_within(&mut v, 0.5, 2.0);

A zero-time delta records an inconclusive detail (zero denominator) naming the pair; a non-finite rate records its own detail rather than slipping past the band; lo > hi is a single caller-error detail. Projection errors are gaps — no rate is computed across them.

steady_within(warmup_ms, tolerance) (f64 only)

Pass when every post-warmup sample (elapsed_ms >= warmup_ms) lies inside [mean·(1-tolerance), mean·(1+tolerance)]. The mean is computed over post-warmup samples only, so ramp-up does not bias the baseline. tolerance is a fraction (0.10 = ±10%).

let util: SeriesField<f64> = series.stats("busy",
    |sv| sv.get("busy").as_f64());
util.steady_within(&mut v, /*warmup_ms=*/ 1000, /*tolerance=*/ 0.10);

Projection errors are skipped with a Note. When warmup absorbs every sample, the pattern notes “no samples beyond warmup” and passes vacuously.

converges_to(target, tolerance, deadline_ms) (f64 only)

Pass when three consecutive samples land inside [target - tolerance, target + tolerance] at or before deadline_ms — the convergence-witness shape for “the system stabilizes near target by the deadline”.

load.converges_to(&mut v, /*target=*/ 1.0, /*tol=*/ 0.5, /*deadline_ms=*/ 5_000);

Distinct outcomes: witness found — pass. No witness before the deadline — a temporal failure naming the sample count (and any errored samples that interrupted in-progress runs). Fewer than 3 successfully-projected samples in the window — a Note, not a failure: absence of data is a coverage gap, not a negative finding, and the note distinguishes “did not collect enough” from “collected enough but never converged”.

always_true (bool only)

Pass when every sample’s value is true. Projection errors fail the assertion — this is a strict pattern; a missing boolean is a coverage gap that must surface.

let alive: SeriesField<bool> = series.bpf("scheduler_alive",
    |snap| snap.var("scheduler_alive").as_bool());
alive.always_true(&mut v);

ratio_within(other, lo, hi) (f64 only)

Pass when every per-index self[i] / other[i] lies in [lo, hi] — two same-length series walked in lock-step.

util.ratio_within(&mut v, &runtime, 0.4, 0.6);

A length mismatch fires one caller-error detail and aborts. A zero denominator records an inconclusive detail naming the sample; out-of-band ratios record the lhs/rhs values. Projection errors on either side are skipped with a Note naming each gap and which side errored.

Per-sample scalar checks: each

For per-sample bounds, bypass the trajectory patterns via SeriesField::each:

nr_dispatched.each(&mut v).at_least(1u64);
util.each(&mut v).between(0.0_f64, 100.0_f64);
ticks.each(&mut v).at_most(10_000.0_f64);

each runs the comparator on every successfully-projected sample; the first failure records a detail and subsequent failures pile on, so the timeline shows every offending sample. Projection errors flip the verdict (each is strict, matching always_true). NaN samples report an incomparable failure by name — without that branch, IEEE-754 comparisons against NaN are always false, and a NaN would silently pass value < floor checks.

Phase-bucketed comparisons

Steps stamp each capture with a scenario phase (Phase::BASELINE, then Phase::step(0), Phase::step(1), … in step order). Per-phase reducers on a projected field — counter_delta_per_phase(), first_per_phase(), last_per_phase(), value_at_phase(phase) — reduce the series to one value per phase, and ratio_across_phases pins a later phase against an earlier one. The swap-A/B shape (step 0 runs scheduler A, step 1 swaps in B via Op::ReplaceScheduler):

use ktstr::assert::Phase;

let dispatched = series.bpf_live_f64("nr_dispatched");
dispatched
    .ratio_across_phases(&mut v, Phase::step(0), Phase::step(1))
    .at_most(1.5); // B may cost at most 1.5x A on this counter

at_most records the computed ratio and both phase values in the verdict — pass or fail — so the margin is visible without extra printing. A phase with no Ok-samples or a zero baseline records an inconclusive detail rather than a fake ratio. PhaseMapExt::ratio_across_phases does the same on a pre-reduced BTreeMap<Phase, _> for caller-derived per-phase values.

Failure rendering

Every temporal failure carries the field’s label, the pattern name, and the offending sample’s tag and timestamp. A nondecreasing regression renders as (shape pinned by the library’s format strings):

nr_dispatched (nondecreasing): regression at sample periodic_004 (+850ms): \
    value 100 after prior value 200 at sample periodic_003 (+700ms)

Coverage Notes render with the per-sample error variant, so PlaceholderSample (rendezvous timeout), MissingStats (stats request failed), FieldNotFound (typo / wrong map), and TypeMismatch are distinguishable without a debugger:

nr_dispatched (nondecreasing): skipped 1 sample(s) with projection errors: \
    periodic_002(+500ms): snapshot has no global variable 'nrdispatch' \
    in any *.bss/*.data/*.rodata map (available globals: ["nr_dispatched", \
    "stall"])

Worked example

The pipeline runs on the host: post_vm receives the VmResult after vm.run() returns, drains the bridge, and walks the series:

use ktstr::prelude::*;

fn assert_temporal_patterns(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained_typed(
        result.snapshot_bridge.drain_ordered_with_stats(),
        result.monitor.clone(),
    )
    .periodic_only();

    let mut v = Verdict::new();

    // BPF axis: counter must never regress.
    let nr_dispatched: SeriesField<u64> = series.bpf(
        "nr_dispatched",
        |snap| snap.var("nr_dispatched").as_u64(),
    );
    nr_dispatched.nondecreasing(&mut v);

    // Stats axis: stay under a generous ceiling.
    let stats_dispatched: SeriesField<u64> = series.stats(
        "nr_dispatched",
        |sv| sv.get("nr_dispatched").as_u64(),
    );
    stats_dispatched.each(&mut v).at_most(1_000_000_000u64);

    v.into_anyhow_or_log()
}

#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = assert_temporal_patterns)]
fn dispatch_counter_advances(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}

For capture wiring and num_snapshots semantics, see Periodic Capture; for the Snapshot accessors the projection closures call into, see Snapshots.

Running Tests

Every #[ktstr_test] boots a fresh KVM microVM with the topology the test declares, on the exact kernel you target. cargo ktstr test resolves that kernel (building and caching it when needed) and wraps cargo nextest run, so nextest’s filtering, retries, and parallelism all apply.

Quick reference

# Run all tests
cargo ktstr test --kernel ../linux

# Run a specific test
cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'

# Run all ktstr-managed tests, skipping non-ktstr tests in the same crate
cargo ktstr test --kernel ../linux -- -E 'test(/^ktstr/)'

# Run ignored gauntlet variants
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'

What’s in this chapter

  • cargo ktstr — the host-side command: kernel resolution, test dispatch, replay, coverage, export.
  • ktstr (standalone) — the debugging companion: interactive VM shells, topo, ctprof, locks.
  • Gauntlet — run every test across a matrix of topology presets.
  • BPF Verifier Sweep — verify, attach, and dispatch every declared scheduler across topologies.
  • Reading Failure Output — what a failed test prints, section by section, and how to investigate.
  • Auto-Repro — the second VM that replays a scheduler crash with probes attached.
  • Runs and Regression Gates — result sidecars, stats, and perf-delta.

Test names and variants

Tests registered through #[ktstr_test] show up in nextest output under one of four prefixes:

  • ktstr/{name} — single-kernel run (or any host_only test, which never boots a VM and so never multiplies across kernels).
  • ktstr/{name}/{kernel} — one case per (test × kernel) when --kernel resolves to two or more kernels.
  • gauntlet/{name}/{preset} — one case per topology preset (see Gauntlet).
  • gauntlet/{name}/{preset}/{kernel} — the full (test × preset × kernel) expansion under a multi-kernel run.

This is what those names look like in a real run:

 Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/smt-3llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-1llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-2llc

Filter by prefix with -E 'test(/^ktstr/)' or -E 'test(/^gauntlet/)'.

Tip

test(NAME) is a substring match; the exact-match form test(=NAME) matches the full nextest name, prefix included. Use test(=ktstr/sched_basic_proportional), not the bare function name — test(=sched_basic_proportional) matches nothing.

The {kernel} suffix is a sanitized kernel label: kernel_ prefix, lowercase, non-alphanumeric characters collapsed to _6.16.1 becomes kernel_6_16_1, and a path spec becomes kernel_path_{basename}_{hash6} (with _dirty appended when the source tree has uncommitted changes). The 6-character hash disambiguates two source paths that share a basename.

RUST_BACKTRACE=1 controls panic backtraces and verbose failure output, not guest console streaming — see Reading Failure Output for the investigation knobs.

Budget-based test selection

Set KTSTR_BUDGET_SECS to select the subset of tests that maximizes configuration coverage within a time budget — useful for CI pipelines and quick smoke tests:

KTSTR_BUDGET_SECS=300 cargo ktstr test --kernel ../linux

The selector encodes each test as a bitset of properties (scheduler, topology class, SMT, workload characteristics) and greedily picks the tests with the highest marginal coverage per estimated second, with duration estimates accounting for VM boot overhead by vCPU count. A summary is printed to stderr during budget-mode listing:

ktstr budget: 42/1200 tests, 295/300s used, 38/38 configurations covered

Testing your own scheduler

Declare it with declare_scheduler! and reference it from #[ktstr_test(scheduler = ...)] — see Scheduler Definitions and the Test a New Scheduler recipe.

cargo ktstr

cargo ktstr is the host-side command for the whole workflow: it resolves (and if needed builds and caches) the kernel, then drives cargo nextest run so every #[ktstr_test] boots its VM against exactly the kernel you asked for.

Install

cargo install --locked ktstr   # installs both `ktstr` and `cargo-ktstr`

The test-fixture binaries (jemalloc probes, schbench/taobench validators) are behind the non-default integration feature and are not installed by default. To build from a workspace checkout instead: cargo build --bin cargo-ktstr.

Task map

I want to…CommandDepth
Run tests on a kernelcargo ktstr testthis page
Re-run last session’s failurescargo ktstr replaythis page
Manage cached kernelscargo ktstr kernelthis page
Sweep schedulers through the BPF verifiercargo ktstr verifierVerifier
Analyze results, gate regressionscargo ktstr stats / perf-deltaRuns
Narrow CI to affected schedulerscargo ktstr affectedCI
Reproduce a test on bare metalcargo ktstr exportthis page
Debug interactively in a VMcargo ktstr shellktstr shell

Common flags

These flags mean the same thing on every subcommand that takes them.

--kernel ID — one grammar everywhere:

cargo ktstr test --kernel ../linux                    # local source tree (builds + caches)
cargo ktstr test --kernel 6.14.2                      # version (auto-downloads on miss)
cargo ktstr test --kernel 6.14                        # major.minor prefix → latest patch release
cargo ktstr test --kernel 6.14.2-tarball-x86_64-kc... # cache key (from `kernel list`)
cargo ktstr test --kernel 6.12..6.14                  # range: every stable+longterm release inside
cargo ktstr test --kernel git+https://example.com/r.git#tag=v6.14   # git tag (#branch= / #sha= too)
cargo ktstr test --kernel 6.14.2 --kernel 7.0         # repeatable → multi-kernel matrix

Ranges expand against kernel.org’s releases.json; both endpoints are series-inclusive (6.11..6.14 covers every 6.14.N; spell 6.14.2 for an exact bound), and EOL series silently drop out unless you pass --include-eol. Git sources are fetched at the ref, built, and cached; a moved branch tip rebuilds.

When --kernel resolves to two or more kernels, the kernel becomes another gauntlet dimension: each (test × preset × kernel) tuple is a distinct nextest case with a kernel-label suffix (see test name shapes), and result sidecars partition per kernel under target/ktstr/{kernel}-{project_commit}/. Kernel resolution finishes for every requested kernel before any test runs — a missing kernel aborts up front rather than mid-matrix. host_only tests run once regardless of kernel count.

--no-perf-mode — disable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also via KTSTR_NO_PERF_MODE.

--no-skip-mode — convert resource-contention and host-topology-insufficient skips into hard failures (exit 1 instead of 0). The default skips so a contended runner does not fail tests that simply could not start.

The three profiles — independent of each other:

  • --release — the harness build profile. Release mode applies stricter assertion thresholds (gap_threshold_ms 2000 vs debug’s 3000, spread_threshold_pct 15% vs 35%), so tests that barely pass in debug may fail under --release.
  • --profile NAME — the cargo build profile for the scheduler-under-test (a discovered scheduler package). Defaults to release; pass --profile dev for a fast unoptimized scheduler build.
  • --nextest-profile NAME — the nextest test profile from .config/nextest.toml (retries, timeouts, output settings).

--relevant / --base / --base-ref / --default-branch — narrow the run to only the tests whose scheduler your working-tree change touches, against a baseline commit (merge-base with main by default). A broad or unattributable change fails safe and runs everything; a docs-only change runs nothing. The CI-matrix counterpart is affected; both are documented in CI.

test

Build the kernel (if needed) and run tests via cargo nextest run. Also available as cargo ktstr nextest (a clap alias). Arguments after -- are passed through to nextest:

cargo ktstr test --kernel ../linux                         # everything
cargo ktstr test --kernel 7.0 -- -E 'test(my_test)'        # nextest filter
cargo ktstr test --kernel 7.0 -- --retries 2               # nextest retries
cargo ktstr test --kernel 7.0 -- --features integration    # cargo features
cargo ktstr test --relevant                                # only tests my edits affect

A real single-test run (the 7.0 series was already built and cached, so resolution maps 7.0 to the latest patch release and reuses the cached image):

$ cargo ktstr test --kernel 7.0 -- --features integration -E 'test(=ktstr/failure_dump_renders_bss_fields)'
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.22s
────────────
 Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
     Summary [  34.498s] 1 test run: 1 passed, 12531 skipped

cargo ktstr: test outputs
  ~/ktstr/target/ktstr/7.0.14-73730e0-dirty
    (1 stats sidecar(s), 0 wprof trace(s) written this run)

An immediate re-run of the same command took the same ~34 s — with the kernel cached, wall time is the test itself, not infrastructure. For a --kernel <path> source tree, a cache hit is announced on stderr as cargo ktstr: cache hit for {path} ({cache_key}, built {age} ago) and skips the build entirely.

Per-test exit codes

Each #[ktstr_test] process exits with one of three codes, so CI gates and dashboards can triage runs:

CodeVerdictMeaning
0Pass / SkipAssertions passed, or the test never ran (host too small, resource contention, perf mode unavailable). Skips degrade to pass unless --no-skip-mode.
1FailAn assertion failed; an operator --cpu-cap the host cannot satisfy; a skip under --no-skip-mode; or expect_err = true and the test passed.
2InconclusiveA zero-denominator ratio gate could not evaluate — the workload produced no signal to ratio against.

Exit code 2 is the silent-pass guard: a Pass at a ≤ threshold gate run against a 0 / 0 ratio that synthesized to 0.0 would have shipped a false-green CI run. The harness records Inconclusive instead (see Checking) and projects it to a distinct exit code so external tooling can route the run separately from real regressions. The values are exported as EXIT_PASS / EXIT_FAIL / EXIT_INCONCLUSIVE in the prelude.

replay

Re-run only the tests that failed in the last session, from the result sidecars under target/ktstr/:

cargo ktstr replay              # print the nextest filter (dry-run)
cargo ktstr replay --exec       # actually run it
cargo ktstr replay -E starve    # narrow by test-name substring
cargo ktstr replay --dir PATH   # source sidecars from an archived tree

Dry-run is the default: the filter prints to stdout so you can inspect it (or paste it into CI) before committing to the re-run. --profile / --nextest-profile apply with --exec. Distinct from auto-repro, which fires inside the failing test process; replay is post-hoc, across a whole session.

coverage

Run tests with coverage via cargo llvm-cov nextest. Same kernel resolution, multi-kernel semantics, and common flags as test; arguments after -- pass through to cargo llvm-cov nextest:

cargo ktstr coverage --kernel ../linux
cargo ktstr coverage -- --workspace --lcov --output-path lcov.info

Requires cargo-llvm-cov and the llvm-tools-preview rustup component. Guest-side coverage is flushed over shared memory at VM exit and merged into the report automatically; multi-kernel runs merge every variant’s profraw into a single report.

Profraw files accumulate across runs — including host-side files that plain cargo ktstr test writes next to the cargo-ktstr binary (an LLVM_PROFILE_FILE injection that keeps default.profraw out of your kernel source tree; export LLVM_PROFILE_FILE yourself to opt out). To clean up:

cargo ktstr llvm-cov clean --profraw-only          # only *.profraw under target/llvm-cov-target/
rm -f target/debug/llvm-cov-target/default-*.profraw
rm -f target/release/llvm-cov-target/default-*.profraw

Avoid bare cargo ktstr llvm-cov clean (wipes reports too) and --workspace (also runs cargo clean).

llvm-cov

Raw passthrough to cargo llvm-cov for subcommands that don’t fit the coverage flow (report, clean, show-env), with the same kernel-resolution plumbing:

cargo ktstr llvm-cov report --lcov --output-path lcov.info

Always pass a subcommand: a bare cargo ktstr llvm-cov falls through to cargo test, which skips gauntlet variants and verifier cells entirely (they exist only under the nextest harness).

kernel

Manage cached kernel images: list, build, clean. The standalone ktstr kernel subcommands are identical.

How kernels are discovered

There are two flows, and they intentionally differ:

Explicit (--kernel ...) — the full pipeline. For a source-tree path: validate it is a kernel tree, look up the cache (clean git trees only — a cache hit short-circuits straight to tests), auto-configure (make defconfig if needed, append ktstr’s kconfig fragment, make olddefconfig), build with make -j$(nproc) KCFLAGS=-Wno-error, validate that the critical config options survived (CONFIG_SCHED_CLASS_EXT, CONFIG_DEBUG_INFO_BTF, CONFIG_BPF_SYSCALL, tracing options — the kernel build system silently drops options with unmet dependencies, and each failure gets a remediation hint), generate compile_commands.json, store the result in the cache, and run nextest with KTSTR_KERNEL (and KTSTR_KERNEL_LIST for multi-kernel) exported. Dirty or non-git trees build in place, get a _dirty kernel label, and never cache. Version, range, and git identifiers download/fetch + build + cache on miss.

Implicit (no --kernel) — discovery only, no builds. The test framework checks KTSTR_TEST_KERNEL, then KTSTR_KERNEL, then the newest valid cache entry, then local build trees (./linux, ../linux, /lib/modules/$(uname -r)/build), then host images (/boot/vmlinuz*). Whatever pre-built image it finds is used as-is — nothing is compiled and nothing new lands in the cache.

Tip

If you iterate on a local kernel tree, pass --kernel ../linux (or run cargo ktstr kernel build --kernel ../linux once) so the build is cached and reused. Implicit discovery will find the tree but never populate the cache for it.

kernel list

List cached kernels, newest first:

$ cargo ktstr kernel list
cache: ~/.cache/ktstr/kernels
  KEY                                              VERSION      SOURCE   ARCH    BUILT
  7.0.14-tarball-x86_64-kcabd40422                 7.0.14       tarball  x86_64  2026-07-04T23:06:12Z
  local-b4dc42d-x86_64-kcabd40422                  7.1.0        local    x86_64  2026-07-04T23:00:04Z
  local-5123e5a-x86_64-cfg1982ad42-kcabd40422      7.1.0        local    x86_64  2026-07-04T22:20:13Z
...
  local-c5d2724-x86_64-kcabd40422                  7.0.0        local    x86_64  2026-07-04T22:05:35Z
...
  7.0.13-tarball-x86_64-kc5f2631f0                 7.0.13       tarball  x86_64  2026-06-22T21:54:56Z (stale kconfig)
  7.1.1-tarball-x86_64-kc5f2631f0                  7.1.1        tarball  x86_64  2026-06-22T21:46:19Z (stale kconfig)
warning: entries marked (stale kconfig) were built against a different ktstr.kconfig. Rebuild with: kernel build --force --kernel <entry version> (add --extra-kconfig PATH if the entry also carries the (extra kconfig) tag).

(stale kconfig) marks entries built against a different ktstr kconfig fragment (they are rebuilt automatically when requested); (EOL) marks entries whose series has left kernel.org’s active releases list. --json emits the same data for scripting. With --kernel START..END, list switches to preview mode: it prints the versions the range expands to without downloading or building anything — the cheap answer to “what does 6.12..6.16 actually cover?”.

kernel build

Download (or use a source tree / git ref), build, and cache one kernel or a range:

cargo ktstr kernel build                        # latest stable from kernel.org
cargo ktstr kernel build --kernel 6.14.2        # specific version
cargo ktstr kernel build --kernel 6.12          # latest 6.12.x patch release
cargo ktstr kernel build --kernel 6.11..6.14    # every release in the range
cargo ktstr kernel build --kernel ../linux      # local source tree
cargo ktstr kernel build --force --kernel 6.14.2

With no --kernel, it builds the latest stable series that has had at least 8 maintenance releases — keeping CI off brand-new majors whose early builds are more likely to break. Already-cached entries are skipped unless --force.

FlagDescription
--forceRebuild even if cached.
--cleanmake mrproper first (source-tree builds only).
--cpu-cap NReserve exactly N host CPUs for the build (build parallelism and a cgroup sandbox follow). See Resource Budget.
--extra-kconfig PATHExtra kconfig fragment merged over ktstr’s (user values win); lands in its own cache slot.
--skip-sha256Skip tarball checksum verification (emits a warning).
--include-eolWith a range, also build EOL series from the linux-stable mirror.

A bare relative name is read as a cache key — prefix a relative source directory with ./.

kernel clean

cargo ktstr kernel clean --keep 3                 # keep the 3 most recent valid entries
cargo ktstr kernel clean --corrupt-only --force   # drop only broken entries

Corrupt entries (missing metadata or image) never consume a --keep slot. --force skips the confirmation prompt (required non-interactively).

verifier

Sweep every declare_scheduler!-registered scheduler through the real kernel’s BPF verifier across topologies, checking verify + attach + dispatch per cell:

cargo ktstr verifier --kernel 7.0
cargo ktstr verifier --scheduler scx-ktstr     # one scheduler
cargo ktstr verifier --raw                     # no cycle collapse

See BPF Verifier Sweep for the cell model, real output, and the kernels-filter contract.

shell

Shares the VM boot flow and flags with ktstr shell, plus two additions: --kernel also accepts raw image files (bzImage, Image), and --test NAME derives topology, memory, and include files from a registered #[ktstr_test] (mutually exclusive with --topology / --memory-mib; -i is additive):

cargo ktstr shell --test my_failing_test
cargo ktstr shell --kernel ./arch/x86/boot/bzImage

export

Export a registered test as a self-extracting .run file that reproduces the scenario on bare metal, no VM: the ktstr binary, the scheduler binary, and every declared include file, embedded in a bash preamble that validates root, sched_ext support, cgroup2, the no-other-scheduler-attached invariant, and topology compatibility before launching.

cargo ktstr export my_test -o /tmp/my_test.run
wrote /tmp/sched_basic_proportional.run (90074903 bytes archive, 0 include files)

The generated script opens with the frozen test specification and the preflight checks:

#!/bin/bash
# Generated by `cargo ktstr export`. Do not edit; regenerate to update.
set -euo pipefail

# --- frozen test specification ---
KTSTR_TEST_NAME=sched_basic_proportional
KTSTR_SCHED_NAME=ktstr_sched
KTSTR_GIT_HASH=73730e0
NEED_LLCS=1
NEED_CORES_PER_LLC=2
NEED_THREADS_PER_CORE=1
NEED_NUMA_NODES=1
TEST_DURATION_SECS=12
TEST_WATCHDOG_SECS=15

Scheduler choice, scheduler args, and topology are frozen at export time; --duration, --watchdog-timeout, and --quiet can be overridden at .run invocation. --package NAME scopes the workspace search (and disambiguates duplicate test names — otherwise the first matching binary in path order wins); --release builds the embedded binaries with the release profile to match how you will run them.

Not exportable (rejected with actionable errors): host_only tests, tests using bpf_map_write (they need the host-side probe surface), and KernelBuiltin schedulers.

completions

cargo ktstr completions bash >> ~/.local/share/bash-completion/completions/cargo
cargo ktstr completions zsh > ~/.zfunc/_cargo-ktstr
cargo ktstr completions fish > ~/.config/fish/completions/cargo-ktstr.fish

SHELL is one of bash, zsh, fish, elvish, powershell; --binary overrides the completion target (default cargo).

stats

Sidecar analysis for past runs: stats (analysis of the newest run), stats list (run table), stats list-metrics (the regression metric registry), stats list-values (distinct filter values in the pool), stats show-host --run ID (archived host context), and stats explain-sidecar --run ID (why optional fields are absent). cargo ktstr show-host prints the same host context live, and cargo ktstr show-thresholds TEST prints the merged assertion thresholds a test will run with. See Runs and Regression Gates.

perf-delta

Compare performance_mode test metrics between HEAD and a baseline commit, exiting non-zero when enough metrics regress to trip the gate:

cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14

Baseline resolution, the --noise-adjust statistics, and the full flag set are documented in Runs and Regression Gates; the worked walkthrough is A/B Compare Branches.

affected

Emit the scheduler packages a base..HEAD diff affects, as a JSON array for a GitHub Actions dynamic matrix (["scx_lavd","scx_rusty"]). Attribution is fail-safe: any uncertainty widens to the full set, and only a strictly docs-only change emits []. --relevant is the same engine applied locally to narrow test / coverage / perf-delta runs. Both are documented with the complete CI workflow in CI.

locks

Enumerate every ktstr flock held on this host, read-only, naming holder PIDs and cmdlines — the troubleshooting companion when a run is stalled behind a peer’s reservation. Identical to ktstr locks, where the lock roots and real output are documented.

Gauntlet

Some scheduler bugs only exist on topologies you don’t develop on: a per-LLC work-splitting heuristic that breaks on an odd LLC count, an idle-core picker that lands both SMT siblings, a migration policy that never crosses a NUMA boundary. The gauntlet expands every #[ktstr_test] into one variant per topology preset — up to 24 presets (14 on aarch64) — so those bugs surface as a named, re-runnable test case instead of a production report.

Gauntlet variants are prefixed gauntlet/ and ignored by default:

# Run only base tests (default)
cargo ktstr test --kernel ../linux

# Run only gauntlet variants
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'

# Run everything
cargo ktstr test --kernel ../linux -- --run-ignored all

# Run a single variant
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only \
  -E 'test(=gauntlet/my_test/smt-2llc)'

This is what the expansion looks like when nextest lists a test with min_llcs = 1 and default constraints on this host:

ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/medium-4llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/medium-4llc-nosmt
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/odd-3llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/smt-2llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/smt-3llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-1llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-2llc

Under a multi-kernel run a {kernel_label} segment is appended — gauntlet/{name}/{preset}/{kernel}. See Test names and variants for the label format.

Topology presets

Note

Multi-NUMA and scale-boundary presets are opt-in. The default constraints (max_numa_nodes = 1, max_llcs = 12, max_cpus = 192) exclude the five numa* presets plus near-max-llc, max-cpu, and their -nosmt variants — 15 of the 24 presets are active by default. Raise max_numa_nodes, max_llcs, or max_cpus on the test to opt in.

PresetTopologyCPUsLLCsNUMADescription
tiny-1llc1n1l4c1t411Single LLC
tiny-2llc1n2l2c1t421Minimal multi-LLC
odd-3llc1n3l3c1t931Odd CPU count
odd-5llc1n5l3c1t1551Prime LLC count
odd-7llc1n7l2c1t1471Prime LLC count
smt-2llc1n2l2c2t821SMT enabled
smt-3llc1n3l2c2t1231SMT, 3 LLCs
medium-4llc1n4l4c2t3241Medium topology
medium-8llc1n8l4c2t6481Medium, many LLCs
large-4llc1n4l16c2t12841Large, few LLCs
large-8llc1n8l8c2t12881Large, many LLCs
near-max-llc1n15l8c2t240151Near maximum
max-cpu1n14l9c2t252141Near KVM vCPU limit
medium-4llc-nosmt1n4l8c1t3241Medium, no SMT
medium-8llc-nosmt1n8l8c1t6481Medium, many LLCs, no SMT
large-4llc-nosmt1n4l32c1t12841Large, no SMT
large-8llc-nosmt1n8l16c1t12881Large, many LLCs, no SMT
near-max-llc-nosmt1n15l16c1t240151Near maximum, no SMT
max-cpu-nosmt1n14l18c1t252141Near KVM vCPU limit, no SMT
numa2-4llc2n4l4c1t1642Multi-NUMA, 2 nodes
numa2-8llc2n8l8c2t12882Multi-NUMA, 2 nodes, SMT
numa2-8llc-nosmt2n8l16c1t12882Multi-NUMA, 2 nodes, no SMT
numa4-8llc4n8l4c1t3284Multi-NUMA, 4 nodes
numa4-12llc4n12l8c2t192124Multi-NUMA, 4 nodes, SMT

Topology format: {numa_nodes}n{llcs}l{cores_per_llc}c{threads_per_core}t1n2l4c2t is 1 NUMA node, 2 LLCs, 4 cores per LLC, 2 threads per core = 16 CPUs. Note that llcs is the total across the machine, not per node.

aarch64: ARM64 CPUs do not have SMT. Presets with threads_per_core > 1 are excluded on aarch64, leaving 14 presets (the 5 small presets, 6 -nosmt variants, and 3 non-SMT NUMA presets).

Constraint filtering

#[ktstr_test] topology constraints filter which presets a test runs on. A preset is skipped when any constraint is not met:

  • num_numa_nodes() < min_numa_nodes
  • max_numa_nodes is set and num_numa_nodes() > max_numa_nodes
  • num_llcs() < min_llcs
  • max_llcs is set and num_llcs() > max_llcs
  • requires_smt and threads_per_core < 2
  • total_cpus() < min_cpus
  • max_cpus is set and total_cpus() > max_cpus

See The #[ktstr_test] Attribute for the attribute table.

Authoring gauntlet-ready tests

Worked example

A test with min_llcs = 2, requires_smt = true, and default max_numa_nodes = 1 against the preset table above:

  • tiny-1llc (1 LLC): excluded — below min_llcs
  • All non-SMT presets (tiny-2llc, odd-*, *-nosmt): excluded — requires_smt
  • near-max-llc (15 LLCs): excluded — above default max_llcs = 12
  • max-cpu (252 CPUs, 14 LLCs): excluded — above default max_cpus = 192 (also above default max_llcs = 12)
  • All numa* presets: excluded — above default max_numa_nodes = 1

Result: 6 of 24 presets survive (smt-2llc, smt-3llc, medium-4llc, medium-8llc, large-4llc, large-8llc). On aarch64, none survive — all aarch64 presets lack SMT.

Variant count

The total number of gauntlet variants for a test is valid_presets × resolved_kernels: the 6 surviving presets above produce 6 variants under a single kernel and 12 under --kernel A --kernel B.

Tests that skip gauntlet

Entries with host_only = true never produce gauntlet variants — they run on the host without booting a VM, so topology variation carries no signal. Tests whose names start with demo_ are ignored by default, gauntlet variants included.

Operator notes

  • Wall time. Each variant boots its own VM and runs the full scenario, so a sweep costs roughly (surviving presets × the per-run wall time you observe for the base test). nextest runs variants in parallel within your host’s budget. For a coverage-per-second subset under a deadline, use budget-based selection.
  • Memory. Each gauntlet VM gets max(cpus × 64 MiB, 256 MiB, entry.memory_mib) of guest RAM (plus an initramfs-derived floor). For the 252-CPU max-cpu presets that is at least 16128 MiB — the host needs that much free memory to run the variant.

BPF Verifier Sweep

A scheduler that loads on your machine can still be rejected — or attach and then wedge — on a topology you never booted. verified_insns varies with topology whenever topology-derived config like nr_cpus is baked into .rodata: the verifier sees different known constants, walks different branches, and can reach a different verdict. The verifier sweep boots every declared scheduler in a KVM VM across a range of topologies and checks three things against the real kernel: the BPF programs verify, the scheduler attaches as the active sched_ext scheduler, and it dispatches an injected workload.

The verifier that runs is the real verifier in the real target kernel — no host-side BPF loading, no version skew. And there is no subprocess to bpftool or veristat: the host reads per-program verified_insns directly from guest memory via bpf_prog_aux introspection, and applies cycle collapse to verifier logs instead of truncating them.

Quick start

# Every declared scheduler, kernel discovered via KTSTR_KERNEL / cache
cargo ktstr verifier

# Pin the kernel
cargo ktstr verifier --kernel ../linux

# Sweep across kernels (each cell runs against its own)
cargo ktstr verifier --kernel 6.14.2 --kernel 7.0

# One scheduler across topologies
cargo ktstr verifier --scheduler scx-ktstr

# Raw verifier log, no cycle collapse
cargo ktstr verifier --raw

See cargo-ktstr verifier for the flag list.

A healthy sweep

Four small cells, one scheduler, one kernel — each cell boots its own VM, loads the scheduler, and confirms attach + dispatch:

cargo ktstr verifier --kernel 7.0 --scheduler ktstr_sched --test kaslr_axis_e2e tiny-1llc tiny-2llc odd-3llc smt-2llc
cargo ktstr: resolved kernel "7.0"
cargo ktstr verifier: dispatching to nextest (verifier/ cells only) on 1 resolved kernel(s) forwarding to nextest: --test kaslr_axis_e2e tiny-1llc tiny-2llc odd-3llc smt-2llc
...
    Starting 4 tests across 1 binary (55 tests skipped)
        PASS [  12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
        PASS [  12.432s] (2/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/smt-2llc
        PASS [  12.656s] (3/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-1llc
        PASS [  12.929s] (4/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-2llc
────────────
     Summary [  12.929s] 4 tests run: 4 passed, 55 skipped

verifier verified_insns (per scheduler; rows: kernel, cols: BPF program, cell: range across topologies):

ktstr_sched:
 kernel      ktstr_dispatch  ktstr_dump  ktstr_dump_cpu  ktstr_dump_task  ktstr_enqueue  ktstr_exit  ktstr_exit_task  ktstr_init  ktstr_init_task  ktstr_select_cp  ktstr_yield
 kernel_7_0  102             81          13              70               74             25          419              2296        29077            39               8

verifier summary: 4 ✅  0 ❌  0 🇽
 topology   ktstr_sched
 odd-3llc   ✅
 smt-2llc   ✅
 tiny-1llc  ✅
 tiny-2llc  ✅

A cell in the verified_insns table shows a single number when the count is flat across topologies, lo..hi when it varies, and - when that program reported no stats on that kernel. In the grid, ✅ means the scheduler verified, attached, and dispatched on every kernel that ran the cell; ❌ means it failed on every kernel; 🇽 means mixed results across kernels (the 🇽 glyph renders inconsistently in some terminal fonts — the failing-combinations list below the grid is the authoritative record). This 4-cell sweep ran its VMs in parallel and finished in about 13 seconds of test time.

What a cell checks

  1. Verify — inside the VM the scheduler loads its BPF programs; the target kernel’s verifier runs against them. The host reads per-program verified_insns from bpf_prog_aux via guest memory introspection. On load failure, libbpf’s verifier log is forwarded to the host.
  2. Attach (positive confirmation) — the guest confirms the scheduler process survived load and /sys/kernel/sched_ext/state reached enabled. The kernel sets enabled only after ops.init, per-task init, and switching eligible tasks to the sched_ext class, so this proves the scheduler is scheduling, not merely that its BPF loaded. Attach is confirmed only when the guest reaches its post-attach dispatch phase — a guest that vanishes early (e.g. a panic before any frame is emitted) fails rather than passing by default.
  3. Dispatch probe — the verifier VM has no #[ktstr_test] body, so it injects a SpinWait workload sized to the guest’s online CPUs, running as SCHED_EXT. A cell passes only when a worker makes forward progress after attach: a scheduler that attaches but never dispatches a runnable task is a distinct, worse failure the attach gate alone cannot catch.

Every cell boots with performance mode disabled (no_perf_mode) — verified_insns is perf-mode-independent, so cells share LLC reservations instead of serializing on them.

A real rejection

The fixture scheduler ships rejection knobs (see fixture knobs) precisely so this path stays exercised. Here --verify-loop plants an unrolled loop ending in a store through a null pointer — the verifier walks the loop, then rejects the store. Note the collapse markers: the loop body is shown once, not eight times:

cargo ktstr verifier --kernel 7.0 --scheduler ktstr_broken --test verifier_pipeline tiny-1llc
=== ktstr_broken | kernel kernel_7_0 | topology tiny-1llc ===

verifier
  scheduler: NOT ATTACHED — scheduler process exited during BPF load/startup

verifier --- verifier stats ---
  processed=186  states=7/7

verifier --- scheduler log ---
Global function ktstr_dispatch() doesn't return scalar. Only those are supported.
0: R1=ctx() R10=fp0
; if (crash) @ main.bpf.c:423
0: (18) r1 = 0xff5d3bb3000f60dc       ; R1=map_value(map=bpf_bpf.bss,ks=4,vs=280,off=220)
...
; volatile u32 acc = 0; @ main.bpf.c:450
37: (63) *(u32 *)(r10 -8) = r1        ; R1=0 R10=fp0 fp-8=mmmm0
--- 8x of the following 25 lines ---
; u64 t = bpf_ktime_get_ns(); @ main.bpf.c:453
38: (85) call bpf_ktime_get_ns#5      ; R0=scalar()
; acc += (u32)t; @ main.bpf.c:454
39: (61) r1 = *(u32 *)(r10 -8)        ; R1=0 R10=fp0 fp-8=mmmm0
...
--- 6 identical iterations omitted ---
; u64 t = bpf_ktime_get_ns(); @ main.bpf.c:453
171: (85) call bpf_ktime_get_ns#5     ; R0=scalar()
...
--- end repeat ---
190: (b7) r1 = 0                      ; R1=0
; *p = (int)acc; @ main.bpf.c:464
...
192: (63) *(u32 *)(r1 +0) = r2
R1 invalid mem access 'scalar'
processed 186 insns (limit 1000000) max_states_per_insn 0 total_states 7 peak_states 7 mark_read 0
...
verifier summary: 0 ✅  1 ❌  0 🇽
 topology   ktstr_broken
 tiny-1llc  ❌

failing combinations (scheduler / kernel / topology):
  ktstr_broken / kernel_7_0 / tiny-1llc
error: cargo nextest run exited with 100

The interleaved ; source line @ file:line comments name the C statement each instruction group came from — the offending store is *p = (int)acc; at main.bpf.c:464.

Cycle collapse

The kernel verifier unrolls loops, re-verifying each instruction with updated register state. A bounded 8-instruction loop verified 100 times produces 800 near-identical lines that differ only in register-state annotations; naive truncation loses the context you came for. Cycle collapse keeps the structure: first iteration (what the loop does), an omission count, last iteration (final state).

The algorithm normalizes lines by stripping register-state annotations (source comments are preserved as anchors), finds the most frequent normalized line to establish the cycle period (minimum period 5 lines, minimum 3 repetitions), verifies consecutive blocks match, and collapses — iterating up to 5 passes for nested loops. --raw skips all of this and prints the full log.

Matrix dimensions and filters

The sweep matrix is (declared scheduler × kernel × topology preset). Schedulers come from the declare_scheduler! registry (--scheduler NAME narrows to one; EEVDF and kernel-builtin declarations are skipped — no userspace binary to verify). Kernels come from the operator’s --kernel set; with no flag, one auto-discovered kernel is used. The topology axis is the set of gauntlet presets each scheduler’s constraints accept.

Each scheduler’s kernels = [...] declaration filters the operator-supplied kernel set:

  • kernels = [] (or omitted) — accepts every kernel-list entry.
  • Version specs ("6.14.2") — match entries whose label equals the version (raw or sanitized form).
  • Range specs ("6.14..6.16", "6.14..=6.16") — match entries whose version falls in the inclusive range.
  • Path / cache-key / git specs — match by sanitized-label equality.
# Scheduler declares kernels = ["6.14..6.16"]
# Operator passes 6.14.2, 6.15.0, 6.17.0 — the third is filtered out.
# Cells emitted per accepted preset:
#   verifier/<sched>/kernel_6_14_2/<preset>
#   verifier/<sched>/kernel_6_15_0/<preset>
cargo ktstr verifier --kernel 6.14.2 --kernel 6.15.0 --kernel 6.17.0

A cell whose kernel label matches nothing in the resolved set errors with a diagnostic naming the present labels — no silent fallback to an unrelated kernel.

Runtime: total cost is one VM boot per cell — schedulers × kernels × accepted presets. Cells run in parallel under nextest; the 4-cell example above cost ~13 s.

Fixture knobs

The scx-ktstr fixture scheduler ships two flags that make the rejection path testable on demand:

  • --fail-verify — sets a .rodata variable before scx_ops_load! that enables a store through a null pointer in ktstr_dispatch — the invalid access the verifier rejects.
  • --verify-loop — same rejection, preceded by an unrolled 8-iteration loop so the log exercises cycle collapse. It is deliberately not a while(1): the verifier’s infinite-loop analysis could keep scx_ops_load from returning within the host’s scheduler-attach poll.

Pass them via sched_args on a scratch declare_scheduler! — that is exactly how the rejection capture above was produced.

Reading Failure Output

When a test fails, everything ktstr knows lands in the test’s stderr as one bundle: the violated thresholds, the workload statistics, a phase timeline, the scheduler’s own log, the monitor’s summary, and — for scheduler crashes — the kernel’s sched_ext dump and an auto-repro trail. This page walks the sections in the order they appear, using two real failures.

A check-gate failure

This test set an iteration-rate floor its (deliberately slowed) scheduler could not meet. The first line names the test, scheduler, and topology; the indented lines under it are the violated checks:

cargo ktstr test --kernel 7.0 -- --features integration -E 'test(=ktstr/throughput_gate)'
  TRY 1 FAIL [  31.810s] (───) ktstr::docs_demo ktstr/throughput_gate
  stderr ───
...
    ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
      worker 71 iteration rate 41903.3/s below floor 50000000.0/s
      worker 73 iteration rate 37834.5/s below floor 50000000.0/s

    --- stats ---
    2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
      cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
      cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252

    --- timeline ---
    topology: 1n1l2c1t (2 cpus)  scheduler: my_sched  scenario: throughput_gate  duration: 15.0s

    Phase 1: StepStart[0] ops=0 (4960ms, 0 samples):
      imbalance: avg=1.2 max=5.0 | dsq: avg=0 max=0 | nr_run: avg=1.0 | fallback: 0/s | keep_last: 38/s | throughput: 79697 iter/s (stimulus-derived)
      per-cgroup:
        cg_a: off-cpu avg=0.3% min=0.3% max=0.3% spread=0.0% | run-delay mean=915µs worst=915µs | iters=209600 migrations=1 | gap=10ms@cpu0
        cg_b: off-cpu avg=9.0% min=9.0% max=9.0% spread=0.0% | run-delay mean=5654µs worst=5654µs | iters=189252 migrations=1 | gap=21ms@cpu0
      >>> StepStart[0]: ops=0 (2 cgroups, 2 workers)

Reading it:

  • Header + details — the checks that failed, one line each, with the observed value and the threshold in the same line. The detail line is the verdict; everything below is context.
  • --- stats --- — the per-run roll-up: worker count, distinct CPUs touched, migrations, worst per-cgroup spread and off-CPU gap. cg{i} is the positional index of the cgroup in this roll-up, not its name — it lines up with the per-cgroup rows underneath.
  • --- timeline --- — one block per scenario phase with monitor averages and per-cgroup detail (named cgroups here). In this run, cg_b’s 9% off-CPU time and 5.6 ms mean run-delay against cg_a’s 0.3%/0.9 ms is the asymmetry the failed rate gate traces back to.

Two more sections follow every failure when they have content:

    --- scheduler log ---
    libbpf: struct_ops ktstr_ops: member sub_attach not found in kernel, skipping it as it's set to zero
...

    --- monitor ---
    samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
    avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
    events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
    events+: refill_slice_dfl=210
    schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
    bpf: ktstr_select_cp cnt=189 145ns/call
    bpf: ktstr_enqueue cnt=373 34ns/call
    bpf: ktstr_dispatch cnt=584 237ns/call
    verdict: monitor OK

The scheduler log is whatever the scheduler binary printed (libbpf noise included). The monitor block is the host-side observer’s summary — see Monitor for what each line means. verdict: monitor OK here says the monitor’s checks passed; the test still failed on the worker-side rate gate. The two channels are independent.

A scheduler crash

When the scheduler itself dies, the trail grows: a BUG SUMMARY line, a --- diagnostics --- section for the run stage, the kernel’s sched_ext debug dump, and an auto-repro section. From a real scx_bpf_error crash (triggered on purpose by a demo test):

BUG SUMMARY: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)
ktstr_test 'bpf_crash_auto_repro_e2e' [sched=scx-ktstr] [topo=1n1l4c1t] failed:
  scheduler process died unexpectedly during workload (2.2s into test)

--- stats ---
4 workers, 0 cpus, 0 migrations, worst_spread=0.0%, worst_gap=0ms
  cg0: workers=4 cpus=0 spread=n/a gap=0ms migrations=0 iter=0

--- diagnostics ---
stage: payload started but produced no test result
exit_code=1

BUG SUMMARY is the one-line cause, extracted from the kernel’s triggered exit kind emission or the scheduler log. The --- diagnostics --- stage line tells you how far the run got before dying — here the workload started but never reported, because the scheduler died under it.

The scheduler-log section carries the kernel’s full debug dump for scheduler exits — exit kind, backtrace, and (if the scheduler implements ops.dump) its own state:

--- scheduler log ---
...
DEBUG DUMP
================================================================================

swapper/3[0] triggered exit kind 1025:
  scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)

Backtrace:
  scx_exit+0x50/0x70
  scx_bpf_error_bstr+0x78/0x90
  bpf_prog_1fed99378f3a8055_ktstr_dispatch+0x4d/0x1cb
  bpf__sched_ext_ops_dispatch+0x4b/0xa7
  do_pick_task_scx+0x379/0x770
  __schedule+0x5ca/0xfc0
...
ktstr scheduler state:
  stall=0 crash=1 degrade_rt=0

A --- sched_ext dump --- section repeats the same dump as captured from the kernel trace channel, and --- auto-repro --- reports the second VM’s replay of the crash:

--- auto-repro ---
--- probe pipeline ---
  extracted:   10 functions from crash backtrace
  traceable:   7 passed, 3 dropped: bpf_prog_1fed99378f3a8055_ktstr_dispatch, bpf__sched_ext_ops_dispatch, ret_from_fork_asm
...
repro VM duration: 16.9s

See Auto-Repro for how to read the probe pipeline and its output.

Detail-line catalog

The worker-side checks emit a fixed set of detail-line shapes (each format string is pinned by a unit test, so these stay accurate):

  • worker {N} iteration rate {R}/s below floor {F}/s — a benchmark rate gate failed.
  • tid {N} starved (0 work units) — a worker made no progress at all (not_starved).
  • tid {N} stuck {X}ms on cpu{C} at +{T}ms (threshold {N}ms) — a worker’s longest off-CPU gap crossed max_gap_ms.
  • unfair cgroup: spread={P}% ({lo}-{hi}%) {N} workers on {N} cpus (threshold {P}%) — per-cgroup fairness exceeded max_spread_pct.

See Checking for the model behind these and the monitor-side violations.

Artifacts on disk

After the run, cargo ktstr prints where everything landed:

cargo ktstr: test outputs
  ~/ktstr/target/ktstr/7.0.14-73730e0-dirty
    FAILED  throughput_gate  [my_sched 1n1l2c1t]
      failure dump  ~/ktstr/target/ktstr/7.0.14-73730e0-dirty/throughput_gate-2ecd2624f3df7276.failure-dump.json
      stats         ~/ktstr/target/ktstr/7.0.14-73730e0-dirty/throughput_gate-2ecd2624f3df7276.ktstr.json
      replay        cargo ktstr replay --filter throughput_gate --exec

Every failed test writes a {test_name}-{variant_hash:016x}.failure-dump.json next to its result sidecar in the run directory (see Runs for the directory semantics). Auto-repro runs write a sibling .repro.failure-dump.json with the repro VM’s own snapshot. The path is pre-cleared at each dispatch, so a passing rerun never leaves a stale dump behind.

The dump comes in two shapes. When the scheduler attached and its exit path triggered, it is a full post-mortem: BPF map contents with BTF-typed field names, per-vCPU registers, per-program runtime stats — the JSON form of what Snapshots renders. When the failure happened before the BPF probe attached, a placeholder is written instead, and says so:

{
    "schema": "single",
    "maps": [],
    "sdt_alloc_unavailable": "test failed at stage `payload started but produced no test result`; no BPF state captured (probe did not attach before failure)",
    ...
    "is_placeholder": true
}

The JSON is for tooling that walks the run directory; humans should read the stderr — the actionable diagnostics are the BUG SUMMARY line and the --- sched_ext dump --- section.

Investigation workflow

  1. Read the header and detail lines — they name the check and the margin by which it failed.

  2. For check failures, correlate against --- stats --- and --- timeline ---: which cgroup, which phase, migrations or gaps?

  3. For crashes, start from BUG SUMMARY and the backtrace in the debug dump, then read the auto-repro trail for the state on the way to the exit.

  4. Re-run exactly the failing variant — including the gauntlet preset segment if it was a gauntlet case (see test name shapes):

    cargo ktstr test --kernel 7.0 -- -E 'test(=gauntlet/my_test/smt-2llc)'
    

    or re-run everything that failed last session with cargo ktstr replay.

  5. Poke at the same environment interactively: cargo ktstr shell --test my_test boots a VM with the test’s topology, memory, and include files (see ktstr shell).

Verbosity knobs

  • RUST_BACKTRACE=1 — Rust panic backtraces, plus ktstr’s verbose mode: the full guest kernel console is appended to --- diagnostics --- on failure, the auto-repro VM’s console is forwarded live, and the guest boots with loglevel=7.
  • RUST_LOG=ktstr=debug — host-side tracing (probe attach reasons, libbpf errors).
  • --dmesg — streams the guest kernel console in real time; it is a flag on cargo ktstr shell / ktstr shell, not on test.

Environment errors (kernel not found, cgroup controllers missing, flock timeouts) are cataloged in Troubleshooting.

Auto-Repro

The crash log tells you where the scheduler died; auto-repro tells you what the state was on the way there. When a test fails because the scheduler crashed or exited, ktstr boots a second VM, reruns the scenario with BPF probes attached to the functions from the crash backtrace, and prints each probed call with decoded arguments and struct fields — entry and exit values side by side. The trail appears in the --- auto-repro --- section of the failure output (see Reading Failure Output); the end-to-end debugging story is the Investigate a Crash recipe.

Example output

The probe dump shows each function with decoded fields and source locations (DWARF for kernel functions, BPF line info for callbacks). Where fexit captured post-mutation state, changed fields show an arrow between entry and exit values:

cargo ktstr test — auto-repro trail after a scheduler crash
ktstr_test 'bpf_crash_auto_repro_e2e' [sched=scx-ktstr] [topo=1n1l4c1t] failed:
  scheduler process died unexpectedly during workload (2.0s into test)

--- auto-repro ---
=== AUTO-PROBE: scx_exit fired ===

  ktstr_enqueue                                                   main.bpf.c:21
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID
      enq_flags   NONE
      slice       0
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|ENABLED
  do_enqueue_task                                               kernel/sched/ext.c
    rq *rq
      cpu         1
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID          →  SCX_DSQ_LOCAL
      enq_flags   NONE
      slice       20000000
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|DEQD_FOR_SLEEP    →  QUEUED

Reading it: the task entered the scheduler’s enqueue callback with dsq_id = SCX_DSQ_INVALID (on no dispatch queue) and an expired slice. By the time do_enqueue_task returned, the task sat on the local DSQ (SCX_DSQ_INVALID → SCX_DSQ_LOCAL) with a refilled default slice (20000000 ns), and the DEQD_FOR_SLEEP flag had been cleared. That is a healthy enqueue path — captured at the moment scx_exit fired, so you can see exactly what the scheduler did with its last tasks before the error.

After the probe data, the section appends the repro VM’s wall time and, when non-empty, the last lines of its scheduler log, sched_ext dump, failure-dump JSON, and dmesg.

Enabling it — and what it costs

Auto-repro is on by default for every #[ktstr_test] with a scheduler. Opt out per test:

#[ktstr_test(scheduler = MY_SCHED, auto_repro = false)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> { ... }

It fires only when the primary run fails, and it is disabled automatically when expect_err = true (no point probing a deliberately failing test). The cost is a second VM boot plus a full scenario rerun — in the captured demo below, the repro VM added about 17 seconds.

How it works

  1. Stack extraction — function names are parsed from the crash trace in the scheduler log or kernel console. BPF program symbols (bpf_prog_*) are recognized and their short names extracted; generic frames (spinlocks, syscall entry, sched_ext exit machinery, trampolines) are filtered out.
  2. BPF discovery — in the repro VM, loaded struct_ops programs are discovered and added to the probe list along with their kernel-side callers (e.g. enqueuedo_enqueue_task), so the pipeline still probes something when the crash produced no extractable stack.
  3. BTF resolution — signatures come from vmlinux BTF and program BTF; known structs (task_struct, rq, dispatch queues) have curated fields resolved to offsets, and other struct pointers get scalar/enum/cpumask fields auto-discovered.
  4. Probed rerun — the second VM reruns the scenario with kprobes on kernel entry, fentry/fexit on BPF callbacks and kernel exits, and a one-shot trigger on the sched_ext_exit tracepoint that fires at the moment the exit is claimed.
  5. Stitching — events are filtered to the task that triggered the exit, sorted by timestamp, and rendered with decoded values.

If the primary VM failed before the scheduler ever attached and the workload ever ran, the repro has nothing to reproduce — the framework prepends a PRIMARY DID NOT REACH WORKLOAD label to the repro verdict so you chase the primary’s startup failure (see its --- diagnostics --- and --- timeline --- sections) instead of reading the repro as evidence.

Kernel requirement

The probe trigger needs the sched_ext_exit tracepoint, which is currently only in the sched_ext for-7.2 development branch — no released stable kernel has it. On a kernel without it, the rest of the pipeline still runs — the crash call chain is extracted and probes are prepared — but the trigger cannot attach and no events are captured. The --- probe pipeline --- block says exactly that; this is the shape to recognize:

--- auto-repro ---
--- probe pipeline ---
  extracted:   10 functions from crash backtrace
  traceable:   7 passed, 3 dropped: bpf_prog_1fed99378f3a8055_ktstr_dispatch, bpf__sched_ext_ops_dispatch, ret_from_fork_asm
  bpf_discover: 0 programs found
  after_expand: 7 total probe targets
  kprobes:     0 attached
  trigger:     attach failed (skeleton load (retry): No such process (os error 3); original error before retry: No such process (os error 3))
  probe_data:  0 keys, 0 unmatched IPs
  events:      0 captured, 0 after stitch

repro VM duration: 16.9s

The diagnostic tails (repro VM sched_ext dump, dmesg) are still appended, so the repro run remains useful as a crash-reproduction check even without probe events.

Example test

bpf_crash_auto_repro_e2e in ktstr’s tests/scenario_coverage.rs drives the path end to end: a host-side BPF map write sets the fixture scheduler’s crash global, the scheduler calls scx_bpf_error, and the auto-repro VM replays it.

Runs and Regression Gates

Every test run writes machine-readable results — one JSON sidecar per test, grouped into a run directory keyed by (kernel, project commit). That makes “did my change regress anything?” a one-command question: cargo ktstr perf-delta pairs two commits’ sidecars and fails the build when metrics regress past their gates.

The workflow

  1. Run tests — each invocation writes sidecars into target/ktstr/{kernel}-{project_commit}/:

    cargo ktstr test --kernel 6.14
    
  2. List runs:

    $ cargo ktstr stats list
     RUN                   TESTS  DATE                  ARCH
     7.0.14-73730e0-dirty  1      2026-07-04T23:28:34Z  x86_64
     7.1.0-73730e0-dirty   0      -                     -
     7.1.1-73730e0-dirty   0      -                     -
     7.0.0-73730e0-dirty   1      2026-07-04T22:16:24Z  x86_64
     7.1.0-73730e0         6      2026-07-04T21:41:43Z  x86_64
    

    Rows sort by directory mtime, most recent first. DATE is the run’s first sidecar timestamp; ARCH comes from the first sidecar with host context (- when none has one).

  3. Gate a change against a baseline commit:

    cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14         # HEAD vs merge-base(HEAD, main)
    cargo ktstr perf-delta --base abc1234                         # vs an explicit commit, cached sidecars
    cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14 -E cgroup_steady   # narrow the perf set
    

    The canonical WIP-vs-baseline pattern: run perf-delta --base abc1234 from a -dirty working tree against the clean commit you edited from.

  4. Print analysis of the most recent run (gauntlet outliers, BPF verifier stats, callback profile, KVM stats):

    cargo ktstr stats
    
  5. Inspect a run’s archived host context (the fingerprint the host-delta comparison uses — CPU identity, THP policy, sched sysctls):

    cargo ktstr stats show-host --run 6.14-abc1234
    

perf-delta

perf-delta compares performance_mode test metrics between HEAD and a baseline commit, per scenario, using the metric registry’s polarity and thresholds (enumerate it with cargo ktstr stats list-metrics — see Assertable Metrics). Output is one row per compared metric with the baseline and HEAD values, colored red for regressions and green for improvements; the command exits non-zero once enough metrics regress to trip the failure gate — by default 5 or more (--fail-threshold), so a lone noisy regression does not flip CI red, or any metric named in --must-fail. If the baseline produced no performance_mode sidecars at all, it prints a notice and exits 0 — an empty perf set is “nothing to compare”, not a failure.

Baseline resolution (highest precedence first):

  1. --base <commit> — compare HEAD directly against this commit-ish, no merge-base.
  2. --base-ref <ref> — compare against merge-base(HEAD, <ref>).
  3. $GITHUB_BASE_REF (set on pull_request events) — compare against merge-base(HEAD, origin/<ref>).
  4. Otherwise merge-base(HEAD, main) (override the branch with --default-branch).

The resolved baseline is shortened to the 7-hex form sidecars record, and the command bails if it resolves to HEAD (nothing to compare).

Two ways to get the baseline’s numbers:

  • Cached (default) — both sides’ sidecars must already be in the pool (a prior run, or a CI artifact you downloaded). perf-delta only resolves the pair and compares, applying --threshold PCT (uniform gate) or --policy PATH (per-metric JSON) over the registry defaults.
  • --noise-adjust N (requires --kernel, N ≥ 2) — produces both sides fresh: it checks the baseline and HEAD out into scratch checkouts, runs each side’s performance_mode tests N times, and gates on the observed spread. A regression counts only when the two sides are separated (a two-sided Welch t-test at α = 0.05, or fully disjoint min–max bands) and material (past the registry’s absolute + relative dual gate). This is the mode to trust on a noisy machine; budget N × per-side wall time for it.

perf-delta compares on the commit axis. A cross-config question — scheduler A vs scheduler B at the same commit — is answered in-test (see Compare a Scheduler vs EEVDF); the worked A/B walkthrough with real gates is A/B Compare Branches, and the CI perf-gate job lives in CI.

Run directories

target/
└── ktstr/
    ├── 6.14-abc1234/        # kernel 6.14, project commit abc1234 (clean)
    │   ├── test_a.ktstr.json
    │   └── test_b.ktstr.json
    └── 7.0-def5678-dirty/   # kernel 7.0, commit def5678 + uncommitted changes
        ├── test_a.ktstr.json
        └── test_b.ktstr.json

The key is {kernel}-{project_commit}: the resolved kernel version, plus the project tree’s HEAD short hex, suffixed -dirty when the worktree differs from HEAD.

The commit is discovered from the test process’s working directory — for a scheduler crate using ktstr as a dev-dependency, that is the scheduler crate’s commit, not ktstr’s. Run from whichever clone you want the run keyed on.

Warning

A run directory is a last-writer-wins snapshot, not an archive. Re-running the suite at the same kernel and project commit pre-clears the prior sidecars at the new run’s first write. To preserve a run, move the directory out of the runs root (mv target/ktstr/6.14-abc1234 ~/ktstr-archives/...) — a sibling inside target/ktstr/ would still be walked by stats list — or commit your changes so the next run lands under a new key.

Pre-clear is shallow: only *.ktstr.json files at the top level are removed. Subdirectories created by external orchestrators (per-job gauntlet layouts) are left alone but still read by stats, so clean those yourself when reusing them.

Inspecting sidecars

Each sidecar records the test name, topology, scheduler, work type, verdict, per-cgroup stats, monitor summary, verifier and KVM stats, kernel version, host context, and timestamps. Discovery tooling:

  • cargo ktstr stats list-values — the distinct values per filterable dimension (kernel, commit, scheduler, topology, work type, …) across the pool: the upstream answer to “what have I got?” before narrowing a perf-delta.

  • cargo ktstr stats list-metrics — the regression metric registry (names, polarity, default gates, units).

  • cargo ktstr stats explain-sidecar --run ID — why optional fields are absent, per sidecar, with a fix when one exists:

    walked 1 sidecar file(s), parsed 1 valid
    
    test: throughput_gate
      topology: 1n1l2c1t
      scheduler: my_sched
      ...
      populated optional fields (8): resolve_source, project_commit, monitor, kvm_stats, kernel_version, host, cleanup_duration_ms, run_source
      none fields (3):
        scheduler_commit [expected]
          - no SchedulerSpec variant currently exposes a reliable commit source — reserved on the schema for future enrichment (e.g. --version probe or ELF-note read on the resolved scheduler binary)
        payload [expected]
          - test declared no binary payload (scheduler-only test or pure-scenario test that never invokes ctx.payload(...))
        kernel_commit [actionable]
          - KTSTR_KERNEL is unset or empty
          ...
          fix: set KTSTR_KERNEL to a local kernel source tree that is a git repository (e.g. a git clone of the kernel)
    

    expected means None is the steady state; actionable means a different environment would populate the field. --json emits an aggregate object for dashboards.

stats list-values, show-host, and explain-sidecar all take --dir DIR to point at an archived sidecar tree copied off a CI host.

Environment notes

  • Local filesystem required. The runs root must live on ext4 / xfs / btrfs / tmpfs — the advisory lock that serializes concurrent sidecar writes rejects NFS and other remote filesystems.
  • Non-git runs collide. When the test process is not in a git repository, the commit slot is the literal unknown, and every such run shares {kernel}-unknown (with pre-clear between them). Set KTSTR_SIDECAR_DIR or put the tree under git to disambiguate. The sidecar’s own project_commit field stays null for these runs — the dirname sentinel and the JSON field intentionally diverge.
  • KTSTR_SIDECAR_DIR overrides the sidecar directory itself (used as-is, no key suffix) for writes and for bare cargo ktstr stats reads. Pre-clear is skipped under the override — you chose the directory, you own its contents. The stats list / list-values / show-host subcommands do not consult it; use --dir.

Failure artifacts

Failed tests additionally write a failure-dump JSON next to their sidecar — see Reading Failure Output for the path scheme, the placeholder-vs-full distinction, and the investigation workflow.

Core Concepts

ktstr tests compose from three layers:

  1. Scenarios — the scheduling condition the test creates: cgroup layout, CPU partitioning, workloads, mid-run changes.
  2. Work types — what each worker process does, each variant targeting a specific kernel scheduling path.
  3. Checking — how results are judged: starvation, fairness, gaps, monitor thresholds, temporal patterns.

One test, all three layers visible:

#[ktstr_test(
    scheduler = MY_SCHED,             // scheduler under test
    llcs = 2, cores = 4, threads = 1, // topology the VM boots with
    not_starved = true,               // checking: every worker progressed
    max_spread_pct = 20.0,            // checking: fairness bound
)]
fn steady_two_cells(ctx: &Ctx) -> Result<AssertResult> {
    scenarios::steady(ctx)  // scenario: 2 cgroups of CPU-spin workers
                            // (work type: the default SpinWait)
}

The layers compose orthogonally: the same scenario body runs across every topology a gauntlet sweep declares, and the checks apply uniformly to every variant.

Five more concepts round out the picture:

  • Ops, Steps, and Backdrop — the API scenarios are built from. Most tests declare cgroups with CgroupDef; tests that change state mid-run compose Ops into Steps.
  • Topology — the NUMA/LLC/core/thread layout a test declares and the VM actually boots with.
  • MemPolicy — per-worker NUMA memory placement, for tests that measure memory locality.
  • Performance Mode — host-side isolation for noise-sensitive measurements.
  • Resource Budget — how concurrent VMs and kernel builds share host CPUs safely.

Read Scenarios, Work types, and Checking first — every test touches all three. Ops matters once a canned scenario stops being enough, and Topology once placement behavior is under test. Performance Mode and Resource Budget are operational: read them when measurements get noisy or hosts get shared.

Scenarios

A scenario is the scheduling condition a test creates — which cgroups exist, which CPUs they may use, what their workers do, and what changes mid-run. The canned scenarios in scenarios::* exist so those conditions have names: scenarios::steady(ctx) produces the same reproducible condition against every scheduler you point it at, which is what makes results comparable across schedulers and commits.

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    scenarios::steady(ctx)
}

Canned scenarios (scenarios::*)

FunctionCondition testedSetup
steadyBaseline fairness2 cgroups, no cpusets, equal CPU-spin load
steady_llcLLC-boundary scheduling2 cgroups on different LLCs (skips on 1-LLC topologies)
oversubscribedDispatch under oversubscription2 cgroups, 32 mixed workers each
cpuset_applyCpuset assignment on running tasksDisjoint cpusets applied mid-run
cpuset_clearCpuset removal on confined tasksCpusets cleared mid-run
cpuset_resizeCpuset resizing adaptationCpusets shrink then grow
cgroup_addScheduler reaction to a new cgroupCgroups created while others run
cgroup_removeScheduler reaction to cgroup removalCgroups torn down while others run
affinity_changeAffinity mask changesWorker affinities randomized mid-run
affinity_pinnedNarrow-affinity contentionWorkers pinned to a 2-CPU subset
host_contentionCgroup vs host-task fairnessRoot-cgroup workers beside managed cgroups
mixed_workloadsMixed workload fairnessHeavy + bursty + IO cgroups
nested_steadyNested cgroup hierarchyWorkers in nested sub-cgroups
nested_task_moveCross-level task migrationTasks moved between nested cgroups

More specialized custom_* functions live in the ktstr::scenario::{affinity, basic, cpuset, dynamic, interaction, nested, performance, stress} modules — see the API docs.

Start here

Against a new scheduler, run steady first — it is the smallest condition that can fail (two cgroups, spin load, nothing dynamic). Then steady_llc on a 2-LLC topology to see cache-boundary placement, then mixed_workloads and oversubscribed for load diversity. The dynamic scenarios (cpuset_*, cgroup_*, affinity_*) each isolate one reconfiguration path; reach for the one matching the code you changed.

Run parameters

A scenario body does not pick its own duration or topology — the #[ktstr_test] attribute does. The workload runs for the test’s duration_s (see the macro reference), on the topology the attribute declares, and a gauntlet run re-executes the same body across a whole topology matrix. Worker counts and cpusets come from the scenario’s own CgroupDefs.

Every scenario ends the same way: worker reports are collected and the opted-in checks run against them. A run’s stats roll-up looks like:

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252

Reading Failure Output walks the full anatomy.

From canned to custom

Scenarios graduate in three stages; move down only when the stage above can’t express the condition:

  1. Canned — call a scenarios::* function. Zero setup, named, comparable.
  2. Your own cgroup layoutexecute_defs(ctx, vec![...]) with CgroupDefs you declare: your worker counts, work types, cpusets, still one static phase.
  3. Stepsexecute_steps / execute_scenario with Steps and Ops for anything that changes mid-run (cpuset swaps, scheduler replacement, snapshots, kernel-memory reads), plus a Backdrop for state that must outlive the steps.

Ops, Steps, and Backdrop documents stage 2 and 3 — the CgroupDef builder, every Op, and all the execute_* entry points. A custom scenario is just the #[ktstr_test] function body itself; Custom Scenarios covers writing bodies that go beyond Steps entirely.

Ops, Steps, and Backdrop

Dynamic scenarios used to mean hand-written scenario bodies that create cgroups, sleep, poke sysfs, sleep again, and collect — with every ordering rule and error path yours to get right:

// By hand: manual setup, manual timing, manual teardown.
fn shrink_midrun(ctx: &Ctx) -> Result<AssertResult> {
    let (mgr, cgroups, handles) = setup_cgroups(ctx, /* defs */)?;
    std::thread::sleep(first_half);
    mgr.set_cpuset("cg_hot", &ctx.topo.llc_aligned_cpuset(0))?;
    std::thread::sleep(second_half);
    // ...collect every handle, run checks, tear down in order...
}

The ops system expresses the same scenario declaratively — the framework owns timing, teardown, liveness checking, and report collection:

execute_steps(ctx, vec![
    Step::with_defs(
        vec![CgroupDef::named("cg_hot").workers(4)],
        HoldSpec::frac(0.5),
    ),
    Step::with_op(
        Op::set_cpuset("cg_hot", CpusetSpec::llc(0)),
        HoldSpec::frac(0.5),
    ),
])

Op

An Op is one atomic operation on the running scenario. The enum is #[non_exhaustive] — pattern matches must end with ...

OpEffect
AddCgroupCreate an empty cgroup
AddCgroupDefCreate cgroup + cpuset + workers from a CgroupDef, mid-step
RemoveCgroupStop workers and remove a cgroup
StopCgroupStop a cgroup’s workers, keep the cgroup
SetCpuset / ClearCpuset / SwapCpusetsSet, clear, or swap cgroup cpusets
SpawnSpawn workers into a named cgroup or the runner’s own cgroup
SetAffinitySet worker affinity via AffinityIntent
MoveAllTasksMove all tasks from one cgroup to another
FreezeCgroup / UnfreezeCgroupKernel-side cgroup freeze (not SIGSTOP); teardown auto-unfreezes
SteerIrqRe-steer a hardware IRQ to one CPU (system-wide, not cpuset-scoped)
CaptureSnapshotOn-demand host-side snapshot of BPF maps, vCPU registers, per-CPU counters
WatchSnapshotSnapshot every time the guest writes a named kernel symbol
CaptureCgroupProcsRecord a cgroup’s cgroup.procs PIDs under a tag
ReadKernelHot / ReadKernelColdRead kernel memory (symbol, KVA, per-CPU or task field)
WriteKernelHot / WriteKernelColdWrite kernel memory; adjacent cold writes are batched
RunPayload / WaitPayload / KillPayloadLaunch, await, or kill a binary payload
AttachScheduler / DetachScheduler / RestartScheduler / ReplaceSchedulerManage the live scheduler mid-scenario
PinBpfMapHold a BPF map fd open across a scheduler swap

Constructors take string literals directly (no .into()):

Op::add_cgroup("cg_0")
Op::add_cgroup_def(CgroupDef::named("cg_1").workers(4))
Op::set_cpuset("cg_0", CpusetSpec::disjoint(0, 2))
Op::spawn_workers("cg_0", WorkSpec::default().workers(4))
Op::spawn_host(WorkSpec::default().workers(4))
Op::set_affinity("cg_0", AffinityIntent::random_subset([0, 1, 2, 3], 2))
Op::capture_snapshot("after_spawn")
Op::freeze_cgroup("cg_0")

Op::spawn_host puts workers in the test runner’s own cgroup — typically the guest root — to simulate host-level contention beside managed cgroups.

Snapshot and watch ops

CaptureSnapshot pauses every vCPU through the freeze coordinator, reads BPF map state, vCPU registers, and per-CPU counters, then resumes; the report is keyed by the op’s name. With no snapshot bridge installed it fails loudly rather than dropping the capture. WatchSnapshot fires one capture per guest write to the named symbol; the name must match the guest kernel’s vmlinux symbol table verbatim, and at most 3 watch ops fit in a scenario (hardware debug slots; one is reserved for the error-exit trigger). Details and failure modes: Snapshots and Watch Snapshots.

Kernel-memory ops

Hot variants read/write against the running vCPU; Cold variants take a freeze rendezvous first. Targets are KernelTargets: a symbol, a kernel virtual address, a per-CPU field, or a task field.

Payload ops

RunPayload spawns a binary-kind Payload in the background; WaitPayload blocks until it exits naturally, then evaluates its checks and records its metrics; KillPayload does the same after SIGKILL. Payloads are addressed by (name, cgroup); cgroup: None resolves to the unique live copy. WaitPayload has no timeout — pair it with a bounded hold or the payload’s own runtime flag. Scheduler-kind payloads are rejected: the scheduler slot is the #[ktstr_test(scheduler = ...)] attribute.

Scheduler ops

ReplaceScheduler swaps to a different staged scheduler binary (declared via #[ktstr_test(staged_schedulers = [...])]). PinBpfMap keeps a map fd alive so a same-binary swap window’s .bss survives the replacement.

Permissive removal — a footgun

Op::RemoveCgroup and Op::StopCgroup are permitted against any cgroup, including Backdrop-owned ones, and removing a nonexistent cgroup silently succeeds (rmdir on a missing path is a no-op). A typo’d name therefore surfaces later, as the kernel’s No such file or directory on the next op that references the real name. If a later step fails with a missing-cgroup error, grep the test for Op::remove_cgroup calls naming a similar identifier first. Op::MoveAllTasks is the exception: it rejects moves that would strand Backdrop workers in a step-local cgroup.

CpusetSpec

CpusetSpec computes a cpuset from the topology at runtime. Build via constructors — the enum is #[non_exhaustive]:

pub enum CpusetSpec {
    Llc(usize),                               // all CPUs in one LLC
    Numa(usize),                              // all CPUs in one NUMA node
    Range { start_frac: f64, end_frac: f64 }, // fraction of usable CPUs
    Disjoint { index: usize, of: usize },     // equal disjoint partitions
    Overlap { index: usize, of: usize, frac: f64 }, // overlapping partitions
    Exact(BTreeSet<usize>),                   // caller-supplied set
}

CpusetSpec::llc(0), CpusetSpec::numa(0), CpusetSpec::range(0.0, 0.5), CpusetSpec::disjoint(0, 2), CpusetSpec::overlap(0, 2, 0.5), CpusetSpec::exact([0, 1, 2]). Fractional and partition variants operate on usable_cpus(); Llc and Numa cover their full domain.

CgroupDef

CgroupDef bundles the three ops that always travel together — create cgroup, set cpuset, spawn workers — and is the primary way to declare cgroups:

let def = CgroupDef::named("cg_0")
    .cpuset(CpusetSpec::disjoint(0, 2))
    .workers(4)
    .work_type(WorkType::SpinWait);

Builder methods:

  • .cpuset(CpusetSpec) / .cpuset_mems(set) — CPU set, and an explicit cpuset.mems override (default derives from the cpuset’s NUMA nodes).
  • .workers(n) / .workers_pct(p) — worker count, absolute or as a fraction of the resolved cpuset (see below). Setting both is rejected with a diagnostic.
  • .work_type(WorkType) — what workers do (default SpinWait); see Work Types.
  • .work(WorkSpec) — add another worker group; call repeatedly for concurrent groups.
  • .workload(&'static Payload) — run a binary payload inside the cgroup alongside the workers. Panics on a scheduler-kind payload (there is no scenario-level recovery at build time; the step-level Op::RunPayload returns an error instead).
  • .sched_policy(SchedPolicy) — Linux scheduling policy (default Normal); see Scheduling policies.
  • .affinity(AffinityIntent) — per-worker affinity (default Inherit).
  • .mem_policy(MemPolicy) / .mpol_flags(MpolFlags) — NUMA memory placement; see MemPolicy.
  • .nice(n), .comm(name), .pcomm(name), .uid(u) / .gid(g), .numa_node(node) — per-worker identity defaults, merged into every WorkSpec that doesn’t set its own.
  • .swappable(bool) — opt into gauntlet work-type overrides (see below).

Cgroup-v2 controller knobs (default unconstrained): .cpu_quota_pct(pct) / .cpu_quota(quota, period) / .cpu_unlimited(), .cpu_weight(w), .memory_max(b) / .memory_high(b) / .memory_low(b) / .memory_unlimited(), .memory_swap_max(b) / .memory_swap_unlimited(), .io_weight(w), .pids_max(n) / .pids_unlimited().

Cpuset-scaled worker counts

Tests that span topologies need worker counts that scale with the cpuset. Hand-computing couples the test to a manual resolution step:

// Before: hand-computed via Ctx::cpuset_cpus.
let n = (ctx.cpuset_cpus(&CpusetSpec::Llc(0)) as f64 * 0.9).ceil() as usize;
let def = CgroupDef::named("cg_hot").cpuset(CpusetSpec::Llc(0)).workers(n);

// After: resolved from the cgroup's own cpuset at apply time.
let def = CgroupDef::named("cg_hot")
    .cpuset(CpusetSpec::Llc(0))
    .workers_pct(0.9);  // ceil(cpuset_cpus * 0.9)

Fractions above 1.0 are accepted as deliberate oversubscription.

Work-type overrides and swappable

A gauntlet run can sweep work types (--ktstr-work-type=NAME, surfaced as Ctx.work_type_override). The override replaces a def’s work type only when that CgroupDef is marked .swappable(true) (default false), and is skipped when the override is a grouped work type whose group size does not divide the resolved worker count. Non-swappable defs keep their declared type; Op::Spawn always uses the type as given. This is the single override mechanism for both #[ktstr_test] and ops-based scenarios.

Step

A Step is a list of ops plus a hold period:

pub struct Step {
    pub setup: Setup,   // CgroupDefs to create (after ops run)
    pub ops: Vec<Op>,   // operations to apply
    pub hold: HoldSpec, // how long to hold afterward
}

Setup is Defs(Vec<CgroupDef>) or a topology-dependent Setup::with_factory(fn(&Ctx) -> Vec<CgroupDef>).

Constructors:

  • Step::with_defs(defs, hold) — the primary constructor: create cgroups with workers, hold.
  • Step::new(ops, hold) — ops only, no cgroup setup.
  • Step::hold(hold) — hold only; the canonical phase-A shape in an A/B scenario (Step::hold(HoldSpec::frac(0.3)) then Step::with_op(Op::replace_scheduler(&ALT), HoldSpec::frac(0.7))).
  • Step::with_op(op, hold) — one op, then hold.
  • Step::with_payload(payload, hold) — run one binary payload for the hold; it is drained at step teardown, nothing blocks on it.

Builder methods Step::set_ops and Step::set_hold replace their field. The verb prefixes are consistent across the API: set_X replaces, push_X appends one, extend_X appends many — so Step::new(ops).set_ops(more) drops ops, while Backdrop::new().extend_ops(a).extend_ops(b) accumulates both.

HoldSpec

VariantMeaning
Frac(f64)Fraction of the scenario duration
Fixed(Duration)Fixed time
Loop { interval }Re-apply the step’s ops at interval until time runs out

Sugar: HoldSpec::frac(0.5), HoldSpec::fixed(d), HoldSpec::loop_at(d), and HoldSpec::FULL for Frac(1.0).

A Loop step is the natural shape for “every N seconds, do X”:

Step::new(
    vec![Op::capture_snapshot("periodic")],
    HoldSpec::loop_at(Duration::from_secs(2)),
)

The step’s setup runs once at step entry; only the ops repeat. Prior steps’ step-local state was already torn down at their own boundaries, so each loop iteration sees only Backdrop-owned state and this step’s own setup.

Backdrop

Steps tear their state down at each step boundary. A Backdrop is the scenario-wide layer for state that must persist across steps — long-lived cgroups every step references, background payloads that run for the whole scenario, setup ops that seed state once:

let backdrop = Backdrop::new()
    .push_cgroup(CgroupDef::named("bg_cell").cpuset(CpusetSpec::disjoint(0, 2)))
    .push_op(Op::add_cgroup("bg_overflow"))   // empty move-target cgroup
    .push_payload(&BG_LOAD);

execute_scenario(ctx, backdrop, steps)
  • Backdrop::from_cgroups([...]) builds one from CgroupDefs; push_cgroup / extend_cgroups, push_op / extend_ops, and push_payload / extend_payloads compose incrementally.
  • Backdrop cgroups are created in declaration order, before the first step; every CgroupDef spawns at least one worker. Declare empty cgroups (move targets) via push_op(Op::add_cgroup(...)).
  • Backdrop ops run after the cgroups, before the payloads, with full authority — they may remove or stop Backdrop cgroups where step-local ops are restricted.
  • Payloads are spawned once and drained (killed, metrics preserved) at scenario teardown.

Any step can reference Backdrop cgroups by name (Op::MoveAllTasks, Op::SetCpuset, …). The Backdrop tears down after the last step.

Executors

All in the prelude; each returns Result<AssertResult>:

  • execute_defs(ctx, defs) — the one-shot path: create cgroups, run for the full duration, collect. Equivalent to execute_steps(ctx, vec![Step::with_defs(defs, HoldSpec::FULL)]).
  • execute_steps(ctx, steps) — run a step sequence: for each step, apply ops, then setup, then hold (Loop steps run setup once, then repeat ops); check scheduler liveness between steps; collect worker reports and run checks at the end.
  • execute_steps_with(ctx, steps, Some(&assert)) — same, with an explicit Assert overriding ctx.assert for worker checks. None falls back to ctx.assert (the merged scheduler + per-test config).
  • execute_scenario(ctx, backdrop, steps) / execute_scenario_with(ctx, backdrop, steps, checks) — the full composition: Backdrop setup, step sequence with per-step teardown, Backdrop teardown.

Phases

Steps give a scenario its timeline. The framework publishes the active phase as steps progress: captures (periodic samples, watch trips, on-demand snapshots) stamp with the phase active at capture time, and every assertion detail constructed during a step’s hold auto-stamps with that step’s label. Labels render as BASELINE (the settle window before step 0) and Step[k] everywhere — sidecar JSON, the timeline diagnostic, and per-assertion phase fields.

Phase-bucketed metrics are queryable from the result:

let baseline = r.stats.phase(Phase::BASELINE).expect("always populated");
let step_0   = r.stats.phase(Phase::step(0)).expect("Step 0 ran");
let thr      = r.stats.phase_metric(Phase::step(0), "throughput");

Gate on r.stats.has_steps() before assuming step buckets exist — a scenario that bailed in setup returns None from every phase(Phase::step(k)) lookup. PhaseBucket::expect_metric panics with the bucket’s label, sample count, and the metric keys actually present, so a typo’d name and an empty phase are distinguishable at a glance. Temporal Assertions builds per-phase pattern checks on top of this.

The per-phase timeline also renders in every failure report:

--- timeline ---
topology: 1n1l2c1t (2 cpus)  scheduler: my_sched  scenario: throughput_gate  duration: 15.0s

Phase 1: StepStart[0] ops=0 (4960ms, 0 samples):
  imbalance: avg=1.2 max=5.0 | dsq: avg=0 max=0 | nr_run: avg=1.0 | fallback: 0/s | keep_last: 38/s | throughput: 79697 iter/s (stimulus-derived)
  per-cgroup:
    cg_a: off-cpu avg=0.3% min=0.3% max=0.3% spread=0.0% | run-delay mean=915µs worst=915µs | iters=209600 migrations=1 | gap=10ms@cpu0
    cg_b: off-cpu avg=9.0% min=9.0% max=9.0% spread=0.0% | run-delay mean=5654µs worst=5654µs | iters=189252 migrations=1 | gap=21ms@cpu0
  >>> StepStart[0]: ops=0 (2 cgroups, 2 workers)

Work Types

WorkType decides what each worker process does — and each variant targets a specific kernel scheduling path, so a test pins down the code path a regression lives in. ForkExit hammers wake_up_new_task and the exit path (do_group_exit / wait_task_zombie); AffinityChurn drives affine_move_task and migration_cpu_stop. If you know which path your scheduler change touches, there is usually a work type aimed at it.

Choosing a work type

Scheduler behavior to testWork type
Basic load balancing / fairnessSpinWait (default)
Wake placement / sleep-wake cyclesYieldHeavy, FutexPingPong
CPU borrowing / idle balanceBursty, IdleChurn
Cross-CPU wake latencyPipeIo, CachePipe, WakeChain
Cache-aware schedulingCachePressure, CacheYield
Fan-out wake stormsFutexFanOut, FanOutCompute
Broadcast wakeups (thundering herd)ThunderingHerd
epoll exclusive-wake pathsEpollStorm
Timer (hrtimer) wake-to-run latencyTimerLatency
IRQ/softirq wake pathsIrqWake, NetTraffic
Wakeup + request latency (schbench parity)Schbench
KV object-cache request mix (taobench parity)Taobench
Task creation/destruction pressureForkExit
Priority reweighting / nice dynamicsNiceSweep
Affinity churn / forced migrationAffinityChurn, CrossAffinityChurn, NumaMigrationChurn
Scheduling-class transitionsPolicyChurn
Cgroup migration pathsCgroupChurn, CgroupAttachStorm
Page fault / TLB pressurePageFaultChurn
NUMA locality under migrationNumaWorkingSetSweep
Lock contention / convoy effectMutexContention
Priority inversionPriorityInversion
RT starving or preempting CFSRtStarvation, PreemptStorm
Signal delivery pressureSignalStorm
Producer/consumer imbalanceProducerConsumerImbalance
Block-I/O D-state cyclesIoSyncWrite, IoRandRead, IoConvoy
Mixed / phased real-world patternsMixed, Sequence
High-IPC compute, SMT interferenceAluHot, SmtSiblingSpin, IpcVariance
Arbitrary user-defined workloadCustom

Variants by intent

The WorkType enum in ktstr::workload is the source of truth — run cargo doc for full per-variant semantics, parameters, and kernel-path citations. The shape of each family:

CPU primitives. SpinWait — tight spin loop, pure CPU. YieldHeavysched_yield every iteration, exercising wake/sleep paths. Mixed — spin burst then yield. AluHot { width } — parallel multiply chains at high IPC, optionally SIMD. SmtSiblingSpin — paired PAUSE-spin on two SMT siblings. IpcVariance { hot_iters, cold_iters, period_iters } — alternating high-IPC and cache-miss phases.

Block I/O (against /dev/vda; per-worker tempfile fallback when absent). IoSyncWrite — striped O_SYNC pwrites + fdatasync, fsync-heavy D-state cycles. IoRandRead — 4 KB O_DIRECT preads at random offsets, high-IOPS short D-states. IoConvoy — interleaved sequential writes and random reads with periodic fdatasync.

Burst-and-sleep. Bursty { burst_duration, sleep_duration } — CPU burst then sleep, freeing CPUs for borrowing. IdleChurn — burst then nanosleep, exercising hrtimer + idle-class paths.

Cache pressure. CachePressure { size_kib, stride } — strided read-modify-write sized to pressure L1. CacheYield — the same plus sched_yield, testing re-placement with a cache-hot working set.

Wake placement and cross-CPU paths. PipeIo — CPU burst then 1-byte pipe exchange with a partner. FutexPingPong { spin_iters } — paired futex wait/wake (non-WF_SYNC path). CachePipe — cache-hot working set + pipe wake. FutexFanOut { fan_out, spin_iters } — one messenger wakes N receivers, which measure wake-to-run latency. FanOutCompute — fan-out plus matrix-multiply think time per receiver. WakeChain { depth, wake, work_per_hop } — a ring of waker-wakee hops via pipe (WF_SYNC) or futex. AsymmetricWaker — paired workers in mismatched scheduling classes sharing a futex. EpollStorm { producers, consumers, events_per_burst } — eventfd producers + epoll_wait consumers (exclusive autoremove wake). ThunderingHerd { waiters, batches, inter_batch_ms } — N waiters on one futex word, broadcast-woken.

Timer and IRQ wakes (the AF_PACKET variants need #[ktstr_test(network = ...)]). TimerLatency { interval_us } — cyclictest-style absolute-deadline hrtimer wake. NetTraffic — AF_PACKET self-traffic driving virtio-net RX hardirq + NAPI softirq. IrqWake — paired sender/receiver; the receiver blocked in recvfrom is woken from NET_RX softirq context.

Lifecycle and class churn. ForkExit — rapid fork + _exit + waitpid cycles. NiceSweep — nice level cycled -20..19 (reweight_task; negative values skipped without CAP_SYS_NICE). AffinityChurn { spin_iters } — self-directed sched_setaffinity to random CPUs. CrossAffinityChurn — workers rewrite their cgroup siblings’ affinity; needs a dedicated cgroup. PolicyChurnSCHED_OTHERBATCHIDLE (→ FIFO/RR with CAP_SYS_NICE) via __sched_setscheduler. NumaMigrationChurn { period_ms } — affinity rotated across NUMA nodes. CgroupChurn { groups, cycle_ms } — membership cycled between sibling cgroups. CgroupAttachStorm — transient children migrated into a sibling cgroup mid-exit (attach-path leader race).

Memory pressure / NUMA. PageFaultChurn { region_kib, touches_per_cycle, spin_iters } — mmap, fault random 4 KiB pages through do_anonymous_page, MADV_DONTNEED, repeat. NumaWorkingSetSweep — the working set rotated across NUMA nodes via mbind.

Lock contention. MutexContention { contenders, hold_iters, work_iters } — N-way futex mutex contention (convoy effect, lock-holder preemption). PriorityInversion — three priority tiers contending for one lock, PI or plain futex mode.

Signal / preemption pressure. SignalStorm — paired workers fire tkill(partner, SIGUSR2) between bursts. PreemptStorm — one SCHED_FIFO worker preempts CFS spinners at ~kHz. RtStarvationSCHED_FIFO workers monopolize CPUs while CFS workers starve.

Compound. Sequence { first, rest } — ordered WorkPhases (Spin / Sleep / Yield / Io / AluHot, each with a Duration) looped for the run:

WorkType::Sequence {
    first: WorkPhase::Spin(Duration::from_millis(100)),
    rest: vec![
        WorkPhase::Sleep(Duration::from_millis(50)),
        WorkPhase::Yield(Duration::from_millis(20)),
    ],
}

User-supplied. Custom — your own work function, with a fork-safe config payload. The how-to lives in Custom Scenarios.

use ktstr::prelude::*; brings in WorkType, WorkSpec, WorkPhase, SchedPolicy, and the parameter enums (SchbenchConfig, TaobenchConfig, FutexLockMode, WakeMechanism, SchedClass, ReapMode, AluWidth). Note the prelude also exports an unrelated Phase from the assertion layer — WorkType::Sequence uses WorkPhase.

Constructors and defaults

Every parameterized variant has a snake-case constructor — WorkType::bursty(burst, sleep), WorkType::mutex_contention(4, 256, 1024), WorkType::wake_chain(depth, wake, work_per_hop), and so on — with parameter validation where zero values are meaningless. Duration-typed parameters (Bursty, IdleChurn, WakeChain) take std::time::Duration, not raw integers.

WorkType::from_name("FutexPingPong") resolves a PascalCase name to a default-parameterized instance; the per-variant defaults are the constants in the ktstr::workload::defaults module. Sequence and Custom require explicit construction and return None from name lookup. WorkType::ALL_NAMES lists every name; WorkType::name() maps back.

Grouped work types

PipeIo, FutexPingPong, and CachePipe pair workers and require even num_workers. FutexFanOut and FanOutCompute require num_workers divisible by fan_out + 1 (one messenger + N receivers per group). MutexContention requires divisibility by contenders. WorkType::worker_group_size() returns the group size for these variants, None for ungrouped types.

Schbench

Schbench re-expresses schbench’s default mode natively: message threads batch-wake worker threads (wakeup latency), each worker think-sleeps then does matrix work under a per-CPU lock (request latency). SchbenchConfig’s fields map schbench’s -m/-t/-F/-n/-s/-L/-R/-A/-p flags — the rustdoc has the full CLI-parity table, including which knobs ktstr’s topology sets for you. Use a single ktstr worker (workers(1)): the message/worker parallelism is this variant’s internal thread topology, not ktstr worker processes.

Worker teardown and process groups

Every worker calls setpgid(0, 0) after fork, and teardown SIGKILLs the worker’s whole process group — on graceful stop, on escalation, and again at handle drop. Anything a worker spawns that inherits its pgid (a helper binary, a subshell) dies with it. A child that must outlive the worker needs its own process group (setpgid(child_pid, 0)) or an explicit wait before the worker returns.

Clone mode and pcomm

Workers fork by default (CloneMode::Fork: one process per worker); CloneMode::Thread runs them as threads sharing the parent’s thread group. Setting pcomm on a WorkSpec or CgroupDef routes workers through a fork-then-thread path: one forked leader whose comm is the pcomm value hosts the matching workers as its threads — the per-process-leader shape schedulers expect from real applications. See the WorkloadConfig and WorkSpec rustdoc for the mechanics.

WorkloadConfig

WorkloadConfig is the low-level spawn spec CgroupDef builds internally; use it directly only when calling WorkloadHandle::spawn from a custom scenario. Its default is one SpinWait worker with inherited affinity and policy. The composed field carries secondary WorkSpec groups spawned alongside the primary; reports identify them by group_idx. Topology-aware AffinityIntent variants (SingleCpu, LlcAligned, CrossCgroup, SmtSiblingPair) need scenario context and are rejected at the direct-spawn gate. See Workers and Workloads.

Scheduling policies

pub enum SchedPolicy {
    Normal,
    Batch,
    Idle,
    Fifo(u32),       // priority 1-99
    RoundRobin(u32), // priority 1-99
    Deadline { runtime: Duration, deadline: Duration, period: Duration },
    Ext,             // SCHED_EXT — route through the loaded BPF scheduler
}

Fifo, RoundRobin, and Deadline require CAP_SYS_NICE. A malformed Deadline (runtime <= deadline <= period violated) fails with a diagnostic before the syscall. Ext is SCHED_EXT: it routes the worker through the loaded sched_ext scheduler even under a SCX_OPS_SWITCH_PARTIAL scheduler that leaves other tasks in fair, and requires CONFIG_SCHED_CLASS_EXT in the guest kernel.

Checking

ktstr judges scheduler behavior through two channels: worker-side telemetry (every worker process reports what happened to it) and host-side monitoring (the monitor reads guest kernel state from outside). Both channels always measure; nothing asserts until the test opts in — a test with no checking attributes passes as long as the VM boots and the scenario completes.

Which API to reach for:

  • #[ktstr_test] attributes — cover most tests: not_starved, max_gap_ms, max_spread_pct, min_iteration_rate, and every other threshold below has an attribute (see the macro reference).
  • Verdict + claim! — labeled assertions on values you compute inside a custom scenario body.
  • AbsoluteThresholds — a one-call multi-field bound check against collected reports, bypassing the config merge.
  • assert_scx_events_clean — bounds on SCX event counters (“no fallbacks fired”).

Worker checks

After each scenario, ktstr collects a WorkerReport from every worker and runs the opted-in checks against them:

  • Starvation (not_starved) — any worker with zero work units fails: tid N starved (0 work units).
  • Scheduling gaps (max_gap_ms) — the longest wall-clock gap observed at work-unit checkpoints. A violation renders as tid N stuck Xms on cpuY at +Zms (threshold Nms).
  • Fairness (max_spread_pct) — workers in one cgroup should get similar CPU time; the spread (max off-CPU% − min off-CPU%) must stay below the bound.
  • Cpuset isolation (isolation) — workers may only run on CPUs in their assigned cpuset; any excursion fails.
  • Throughputmax_throughput_cv bounds the coefficient of variation of per-worker work rate (some workers quietly slower); min_work_rate sets an absolute floor (all workers equally slow).
  • Benchmarkingmax_p99_wake_latency_ns and max_wake_latency_cv bound wake-to-run latency for work types that block and measure it (see Work Types for which do); min_iteration_rate floors outer-loop iterations per second per worker.

The loop, end to end

A test sets a threshold, the run violates it, the failure output names the check, the value, and the bound:

#[ktstr_test(
    scheduler = MY_SCHED,
    llcs = 1, cores = 2, threads = 1,
    min_iteration_rate = 50_000_000.0,  // deliberately unreachable floor
)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("cg_a").workers(1).cpuset(CpusetSpec::disjoint(0, 2)),
        CgroupDef::named("cg_b").workers(1).cpuset(CpusetSpec::disjoint(1, 2)),
    ])
}
cargo ktstr test --kernel 7.0
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
  worker 71 iteration rate 41903.3/s below floor 50000000.0/s
  worker 73 iteration rate 37834.5/s below floor 50000000.0/s

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK

Both channels report: the worker check that tripped, and the monitor verdict that did not. The full failure anatomy — timeline, scheduler log, dump sections — is in Reading Failure Output.

Monitor checks

The host-side monitor samples guest per-CPU runqueue state (via BTF offsets, no guest instrumentation) roughly every 100ms and evaluates:

  • Imbalance ratiomax(nr_running) / max(1, min(nr_running)) across CPUs.
  • Local DSQ depth — per-CPU dispatch queue depth.
  • Stall detectionrq_clock not advancing on a CPU with runnable tasks; idle CPUs and preempted vCPUs are exempt.
  • Event ratesselect_cpu_fallback and dispatch_keep_last counters per second.

Monitor violations always land in the failure report’s --- monitor --- section, but they flip the test result only when the test enforces them — set the corresponding attributes, call .with_monitor_defaults() on an Assert, or set enforce_monitor_thresholds. A monitor that produced no usable signal (empty samples, uninitialized guest memory) reports inconclusive, never a silent pass — a CI gate can always tell “verified OK” from “never measured”.

The defaults with_monitor_defaults() applies:

ThresholdDefaultRationale
max_imbalance_ratio4.0max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions.
max_local_dsq_depth50Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks.
fail_on_stalltrueFail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt.
sustained_samples5At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration.
max_fallback_rate200.0/sselect_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure.
max_keep_last_rate100.0/sdispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation.

Every monitor threshold uses the sustained_samples window — a violation must persist for N consecutive samples before it counts.

NUMA checks

For workers with a MemPolicy, three thresholds gate page placement:

  • min_page_locality — minimum fraction of pages on the expected NUMA nodes (the cgroup’s cpuset nodes, derived at evaluation time). Zero observed pages counts as zero locality, not a vacuous pass.
  • max_cross_node_migration_ratio — bound on migrated pages relative to allocated pages (from /proc/vmstat deltas).
  • max_slow_tier_ratio — bound on the fraction of pages landing on memory-only (CXL-tier) nodes.

Default thresholds

not_starved = true also enables the built-in fairness and gap checks at these defaults:

CheckReleaseDebug
Scheduling gap2000 ms3000 ms
Fairness spread15%35%

Debug builds run with higher scheduling overhead, so thresholds are relaxed.

How configuration merges

Assert is the threshold-config struct; every field is an Option where None means “inherit”. Three layers merge, last-Some wins: the baseline (all None), then the scheduler’s assert, then the per-test attributes — so a scheduler-wide bound applies to every test and any single test can override or disable it. enforce_monitor_thresholds is the one sticky field: once any layer sets it, it stays set. Worked override recipes live in Customize Checking.

execute_steps_with(ctx, steps, Some(&assert)) bypasses the merged config with an explicit Assert for that scenario’s worker checks.

Verdicts and outcomes

Every assertion produces one of four outcomes, and a result’s terminal verdict is the fold over all of them, most severe first: Fail > Inconclusive > Pass > Skip.

OutcomeMeaning
Passthe assertion ran and the value satisfied the bound
Failthe assertion ran and the value violated the bound
Inconclusivethe assertion ran but had no signal to evaluate
Skipthe scenario couldn’t run (unmet precondition)

Inconclusive exists for instrument-derived denominators — a ratio whose denominator (iterations, samples, wall-clock interval) legitimately reached zero because the workload produced no signal. Policy-derived denominators stay Fail on zero: under MemPolicy::Bind the policy says pages will exist, so their absence is a defect, not “couldn’t measure”.

CI gates read the verdict through four accessors:

if r.is_pass() { /* ship */ }
if r.is_fail() { /* block; surface r.failure_details() */ }
if r.is_skip() || r.is_inconclusive() { /* no verdict — triage */ }

is_pass() is deliberately strict: inconclusive and all-skip both read false.

Beyond attributes

  • Verdict + claim! — the claim accumulator for custom scenario bodies. Labels come from the code itself (stringify!-derived), so they cannot drift from the value they describe:

    let mut v = Assert::default_checks().verdict();
    stats.claim_max_gap_ms(&mut v).at_most(100);
    claim!(v, iter_delta).at_least(1000);
    let result = v.into_result();
  • AbsoluteThresholds — flat per-run bounds (max_p99_wake_latency_ns, max_iteration_cost_p99_ns, max_migrations, min_work_units) checked in one call: assert_thresholds(&reports, &AbsoluteThresholds::strict()). Empty report slices return a skip rather than a vacuous pass.

  • assert_scx_events_clean(events, bound) — SCX event counters under a cap (None = exactly zero); negative counts always fail.

  • CompositionAssertResult::merge accumulates results in a loop; all_of / any_of fold sibling results as AND / OR.

Signatures, comparators, and construction details are in the ktstr::assert rustdoc. For phase-scoped checks over a stepped scenario, see Phases and Temporal Assertions.

Topology

Schedulers make placement decisions across LLC and NUMA boundaries — where to wake a task, when a migration is worth the cache cost. Each ktstr test declares the topology those decisions should be tested against, and the VM it runs in actually has it: the declared NUMA nodes, cache domains, and SMT siblings are what the guest kernel sees.

The notation

Topologies render as {n}n{l}l{c}c{t}t — NUMA nodes, LLCs, cores per LLC, threads per core. One quirk to internalize:

Note

The l count is the total LLC count across the VM, not per-node. 2n4l4c2t is 2 NUMA nodes and 4 LLCs total (2 per node), 4 cores per LLC, 2 threads per core = 4 × 4 × 2 = 32 vCPUs.

Containment is strict — threads in a core, cores in an LLC, LLCs in a NUMA node — and guest CPUs are numbered sequentially through it. 1n2l4c2t (16 vCPUs) lays out as:

node 0
├─ LLC 0                      ├─ LLC 1
│  ├─ core 0: cpu 0, 1        │  ├─ core 4: cpu 8,  9
│  ├─ core 1: cpu 2, 3        │  ├─ core 5: cpu 10, 11
│  ├─ core 2: cpu 4, 5        │  ├─ core 6: cpu 12, 13
│  └─ core 3: cpu 6, 7        │  └─ core 7: cpu 14, 15

Most tests use one NUMA node; multi-NUMA topologies matter when the scheduler weighs memory locality. The gauntlet sweeps a test across a whole preset matrix of these shapes.

What a test declares — and what it gets

The #[ktstr_test] attributes numa_nodes, llcs, cores, threads declare the shape (see the macro reference for defaults and inheritance). The run output echoes the topology the guest booted with — the [topo=...] tag in failure headers and the timeline header:

ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
...
topology: 1n1l2c1t (2 cpus)  scheduler: my_sched  scenario: throughput_gate  duration: 15.0s

To see a host’s physical layout in the same vocabulary, ktstr topo:

CPUs:       64
LLCs:       4
NUMA nodes: 1
  LLC 0 (node 0): [0, 1, 2, 3, 4, 5, 6, 7, 32, 33, 34, 35, 36, 37, 38, 39]
  LLC 1 (node 0): [8, 9, 10, 11, 12, 13, 14, 15, 40, 41, 42, 43, 44, 45, 46, 47]
  LLC 2 (node 0): [16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55]
  LLC 3 (node 0): [24, 25, 26, 27, 28, 29, 30, 31, 56, 57, 58, 59, 60, 61, 62, 63]

(Host CPU numbering differs from the guest’s sequential scheme — here SMT siblings sit 32 apart — which is exactly why tests declare a topology instead of inheriting the host’s.)

Cpusets from topology

Scenarios don’t hard-code CPU lists; a CpusetSpec resolves against the test’s topology at runtime. On 1n2l4c2t, CpusetSpec::Llc(0) resolves to CPUs 0-7, so the cgroup’s cpuset.cpus is written as 0-7; Llc and Numa cover their full domain, while the fractional and partition variants (Range, Disjoint, Overlap) slice the usable-CPU pool.

Querying topology from a scenario

Ctx.topo is a TestTopology. The queries scenario authors actually use:

  • total_cpus(), num_llcs(), num_numa_nodes() — sizes, e.g. for skip guards (if ctx.topo.num_llcs() < 2 { return Ok(AssertResult::skip(...)) }).
  • usable_cpus() / usable_cpuset() — CPUs available for workload placement. On topologies with more than 2 CPUs the last CPU is reserved for the root cgroup (on 8 CPUs: usable = 0-6). Built-in scenarios and fractional CpusetSpecs use this pool automatically.
  • llc_aligned_cpuset(idx) / numa_aligned_cpuset(node) — the CPU set of one LLC or one node’s LLCs.
  • numa_nodes_for_cpuset(cpus) — which nodes a CPU set touches; this derives the expected-node set for NUMA checks.
  • numa_distance(from, to) — kernel conventions: 10 local, higher is farther, 255 unreachable/unknown. VM topologies without explicit distances report 10 local / 20 remote.
  • node_meminfo(node) / is_memory_only(node) — per-node memory and CXL-style memory-only node detection.

Ctx::cpuset_cpus(&spec) returns the CPU count a spec resolves to — useful for sizing worker counts by hand. Its denominator is the topology-level cpuset, not any cgroup’s currently-effective one; for cgroup-aware sizing prefer CgroupDef::workers_pct, which resolves against the cgroup’s own cpuset at apply time.

The full method catalog (construction, LlcInfo, CPU-list parsing) is in the TestTopology rustdoc.

  • Gauntlet — preset topology matrices and the constraints that filter them.
  • MemPolicy — NUMA memory placement to pair with multi-node topologies.
  • Resource Budget — how the host’s topology is carved up when tests run concurrently.

MemPolicy

Testing whether a scheduler keeps tasks near their memory requires a measurable locality signal — workers whose pages verifiably live on specific NUMA nodes, so that placement decisions show up as page counts instead of guesswork. MemPolicy creates that signal: it wraps set_mempolicy(2) per worker (applied after fork, before the work loop), and the NUMA checks then gate on where the pages actually landed. Pair it with multi-NUMA gauntlet presets to sweep the same test across node counts.

pub enum MemPolicy {
    Default,
    Bind(BTreeSet<usize>),
    Preferred(usize),
    Interleave(BTreeSet<usize>),
    Local,
    PreferredMany(BTreeSet<usize>),
    WeightedInterleave(BTreeSet<usize>),
}
  • Default — inherit the parent’s policy; no syscall made.
  • Bind(nodes) (MemPolicy::bind([0, 1])) — allocate only from these nodes (MPOL_BIND); allocation fails with ENOMEM when they are exhausted.
  • Preferred(node) (::preferred(0)) — prefer one node, fall back silently when it is full (MPOL_PREFERRED).
  • Interleave(nodes) (::interleave([0, 1])) — round-robin allocations across the nodes (MPOL_INTERLEAVE).
  • Local — nearest node to the allocating CPU (MPOL_LOCAL).
  • PreferredMany(nodes) (::preferred_many([0, 1])) — prefer any of the nodes, fall back when all are full (MPOL_PREFERRED_MANY, kernel 5.15+).
  • WeightedInterleave(nodes) (::weighted_interleave([0, 1])) — interleave proportional to the per-node weights in /sys/kernel/mm/mempolicy/weighted_interleave/ (MPOL_WEIGHTED_INTERLEAVE, kernel 6.9+).

Node-set constructors accept any IntoIterator<Item = usize>. MemPolicy::node_set() returns the referenced nodes (empty for Default / Local).

MpolFlags

Optional mode flags OR’d into the set_mempolicy mode:

FlagMeaning
NONENo flags
STATIC_NODESNodemask is absolute — not remapped when the task’s cpuset changes
RELATIVE_NODESNodemask is relative to the task’s current cpuset
NUMA_BALANCINGEnable NUMA-balancing optimization for this policy

Flags combine with |. STATIC_NODES | RELATIVE_NODES is rejected at setup time (the kernel would return EINVAL), as is any unknown bit. The kernel accepts NUMA_BALANCING only alongside MPOL_BIND or MPOL_PREFERRED_MANY — ktstr does not pre-validate that pairing, so other combinations surface as EINVAL from the worker’s set_mempolicy call.

Usage

WorkSpec and CgroupDef both take .mem_policy() and .mpol_flags():

let def = CgroupDef::named("cg_0")
    .cpuset(CpusetSpec::numa(0))
    .workers(4)
    .mem_policy(MemPolicy::bind([0]));

Cpuset validation

When a cgroup has a cpuset and no remapping flag is set, ktstr validates at setup time that the policy’s nodes are reachable from that cpuset — MemPolicy::Bind([1]) on a cgroup confined to node 0 fails before the run starts, not as a mystery ENOMEM mid-run.

The check is flag-aware: STATIC_NODES swaps it for a node-exists-on-host check (the nodemask is absolute and deliberately allowed outside the cpuset), and RELATIVE_NODES bypasses it (the kernel remaps the ordinals internally). Policies without a node set (Default, Local) skip validation.

What gets checked

Locality results feed the NUMA checking thresholdsmin_page_locality, max_cross_node_migration_ratio, max_slow_tier_ratio. The expected node set is derived from the cgroup’s cpuset at evaluation time, not from the worker’s MemPolicy; in the common case where memory is bound to the same nodes the cpuset pins, the two coincide. A locality violation renders with the observed fraction, the threshold, and the page counts (format from the assertion source):

page locality <observed> (<pct>%) below threshold <min> (<pct>%) (<local>/<total> pages local)

Example: NUMA-aware locality test

use ktstr::prelude::*;

#[ktstr_test(
    numa_nodes = 2, llcs = 4, cores = 4, threads = 1,
    min_numa_nodes = 2, max_numa_nodes = 2,
    min_page_locality = 0.8,
)]
fn numa_locality(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("node0")
            .cpuset(CpusetSpec::numa(0))
            .workers(4)
            .mem_policy(MemPolicy::bind([0])),
        CgroupDef::named("node1")
            .cpuset(CpusetSpec::numa(1))
            .workers(4)
            .mem_policy(MemPolicy::bind([1])),
    ])
}

Each cgroup’s workers are pinned to one NUMA node’s CPUs via CpusetSpec::numa() and their allocations bound to the same node via MemPolicy::bind(); the test fails if less than 80% of pages land where they were bound.

Node-set policies only mean something on multi-NUMA topologies. The constraint pair min_numa_nodes = 2, max_numa_nodes = 2 keeps gauntlet expansion on two-node presets — single-node presets are filtered out rather than failing. Both bounds are needed: the default constraints cap at one NUMA node, and an inverted pair (min above max) is rejected at validation time. See Gauntlet for the preset matrix.

Performance Mode

Without performance mode, a 50ms scheduling gap in a measurement could be host noise; with it, the same gap indicates a scheduler problem. Performance mode removes host-side variance — vCPU threads pinned to dedicated cores, hugepage-backed guest memory, NUMA-local allocation, real-time scheduling — so timing thresholds measure the scheduler under test, not the host it happens to share.

Usage

#[ktstr_test(
    llcs = 2,
    cores = 4,
    threads = 2,
    performance_mode = true,
)]
fn my_perf_test(ctx: &Ctx) -> Result<AssertResult> {
    scenarios::steady(ctx)
}

The VM builder API takes the same switch: KtstrVm::builder().performance_mode(true).

When to use

Performance mode is for tests where host-side scheduling noise affects results — fairness spread measurements, scheduling gap detection, imbalance ratio checks. It is not needed for correctness tests (cpuset isolation, starvation detection) where pass/fail is binary.

The gauntlet runs many VMs in parallel. Performance mode on parallel VMs can oversubscribe the host if scheduled naively. Avoid performance_mode unless the host has enough CPUs for the topology matrix.

With stable measurements, tests can set tight thresholds (max_gap_ms, min_iteration_rate, max_p99_wake_latency_ns) to catch regressions against a fixed bar; cargo ktstr perf-delta builds on the same tests to catch regressions against a previous commit. Perf-mode results are comparable only against runs on the same host — guest-side jitter from shared caches and memory bandwidth remains.

What it does

On x86_64:

  • vCPU pinning — each virtual LLC maps to a physical LLC group and vCPU threads are pinned to cores within it, so the host scheduler cannot migrate them across cache domains mid-measurement.
  • Hugepages — guest memory is allocated from 2MB hugepages when enough are free, eliminating host-side TLB pressure.
  • NUMA mbind — guest memory is bound (MPOL_BIND, strict — no silent fallback to remote nodes) to the NUMA nodes of the pinned vCPUs.
  • RT scheduling — vCPU threads run SCHED_FIFO priority 1; the monitor and watchdog run at priority 2 on a dedicated host CPU no vCPU shares, so sampling and timeout enforcement can always preempt a vCPU thread.
  • PAUSE and HLT exit suppression — guest spinlock PAUSE loops and idle HLT normally trap to the hypervisor so it can schedule other vCPUs; with dedicated cores that reschedule is pure overhead, so both exits are disabled. (HLT disable is skipped when the host’s SMT-RSB mitigation forbids it; PAUSE alone is still disabled.)
  • KVM_HINTS_REALTIME — a CPUID hint telling the guest kernel its vCPUs own dedicated cores; the guest drops paravirt yield paths and polls briefly before halting instead of paying wakeup latency.

On aarch64, the four host-side items apply (pinning, hugepages, NUMA mbind, RT scheduling); the x86-specific exit suppression and CPUID hint do not exist there.

Prerequisites

Sufficient host CPUs — at least (llcs * cores * threads) + 1 online CPUs; the extra CPU hosts the monitor and watchdog threads. The host also needs at least as many physical LLC groups as the test declares virtual LLCs.

2MB hugepages (optional) — check /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages. Without them guest memory uses regular pages and a warning is printed.

CAP_SYS_NICE or an rtprio limit (optional) — SCHED_FIFO requires root or RLIMIT_RTPRIO at or above the requested priority. For non-root use:

# /etc/security/limits.conf
username  -  rtprio  99

Log out and back in for the limit to take effect. Without it, RT scheduling is skipped with a warning and results may be noisier.

Sizing the host

A single perf-mode test needs (llcs * cores * threads) + 1 online CPUs and llcs free physical LLC groups — the test holds an exclusive lock on one host LLC group per virtual LLC for the run’s duration. To run K perf-mode tests concurrently without contention skips, the host needs K * llcs free LLC groups; with fewer, the excess tests skip with ResourceContention and nextest retries them after a holder releases. The vm-perf test group in .config/nextest.toml caps how many run at once.

Failure modes

Performance mode never runs unisolated: if the host cannot honor the guarantee, the build fails before boot and the test skips visibly rather than shipping a measurement that does not match what was asked for.

  • PerfModeUnavailable — permanent host insufficiency: too few CPUs or LLC groups for the topology, no satisfiable pinning plan, or no free CPU left for service threads. Skips by default with a visible banner (ktstr: SKIP: <reason> on stderr, exit 0, skip recorded in the run sidecar); promoted to a hard fail under KTSTR_NO_SKIP_MODE for runs that demand execution.
  • ResourceContention — transient: another run holds a lock on a needed LLC or CPU (the reason names it, e.g. LLC 3 busy). Skips with the same SKIP: banner; a retry after the holder finishes succeeds.
  • Warnings (non-fatal) — insufficient free hugepages (regular pages used); high host load (procs_running above half the vCPU count — results may be noisy); unstable TSC (x86_64, common in nested virtualization — timing variance is higher).

The full skip-vs-fail model — which requester gets a skip, which gets a hard error, and what the default path does instead — is in Resource Budget.

Disabling performance mode

--no-perf-mode (or KTSTR_NO_PERF_MODE=1) forces performance_mode = false and routes the run through the budgeted coordination path: a shared LLC reservation sized to a CPU budget, enforced by a cgroup cpuset instead of pinning — none of the isolation features above apply. The mode comparison, the CPU budget, and the --cpu-cap flag live in Resource Budget.

Resource Budget

ktstr boots KVM VMs and builds kernels on hosts that are usually doing other things at the same time — more tests, kernel builds, a developer session. The resource budget is how concurrent ktstr processes share host CPUs without silently corrupting each other’s measurements: every run reserves host LLCs through advisory file locks, and budgeted runs are additionally confined to an exact CPU count by a cgroup v2 cpuset sandbox.

When to use it

  • Multi-tenant CI hosts where unbounded parallelism starves concurrent jobs but the full performance-mode contract (RT scheduling, hugepages, NUMA mbind) is too heavy.
  • Kernel builds beside perf-mode tests — the build’s shared lock coordinates with the perf-mode exclusive lock, so make never stomps a measurement in progress.
  • Concurrent no-perf-mode VMs — a cap of N CPUs bounds how much capacity each run reserves; peers wait instead of racing for CPU.

The three coordination modes

Every VM run takes one of three coordination paths, selected by two switches: performance_mode on the test or builder, and --no-perf-mode / KTSTR_NO_PERF_MODE (any non-empty value).

ModeSelected byLLC lockfilesPer-CPU lockfilesEnforcement
Performance modeperformance_mode = trueexclusive (LOCK_EX), one per virtual LLCnone — the exclusive LLC lock covers its CPUsvCPU pinning, RT scheduling, hugepages, NUMA mbind
Budgeted (no-perf-mode)--no-perf-mode / KTSTR_NO_PERF_MODEshared (LOCK_SH) on the planned LLC setnone — the cgroup cpuset is the enforcement layercgroup v2 cpuset sandbox + soft affinity mask
Defaultneithershared (LOCK_SH) on the 1:1 plan’s LLCsexclusive (LOCK_EX), one per assigned host CPUnone — reservation only

Lockfiles live at {KTSTR_LOCK_DIR or /tmp}/ktstr-llc-{N}.lock and {KTSTR_LOCK_DIR or /tmp}/ktstr-cpu-{C}.lock. The modes compose: shared holders coexist with each other, an exclusive holder blocks every shared acquirer and vice versa. So any number of budgeted and default runs share LLCs among themselves, a perf-mode run waits for all of them to release, and while a perf-mode run holds its LLCs nobody else touches those CPUs. Default runs additionally exclude each other per CPU, so two default VMs never time-slice the same host CPU. Kernel builds take the budgeted path.

When the default path cannot map its topology 1:1 onto the host it does not fail: if a plan exists but every slot is busy, the run skips with ResourceContention and nextest retries; if no plan can exist (host too small), the run proceeds overcommitted — every vCPU thread masked to the allowed CPUs — and warns when that means oversubscription (see below).

Too-small hosts: who asked determines the verdict

The outcome of an unsatisfiable request depends on where the request came from — an explicit guarantee must never silently degrade, and an operator typo must not look like a host limitation:

RequestErrorOutcome
performance_mode = true, host can’t honor isolationPerfModeUnavailableskip (fail under KTSTR_NO_SKIP_MODE)
Per-test cpu_budget above the allowed CPUsTopologyInsufficientskip (fail under KTSTR_NO_SKIP_MODE)
Operator --cpu-cap / KTSTR_CPU_CAP above the allowed CPUsCpuBudgetUnsatisfiablehard fail
Default mode, no 1:1 placement possibleruns overcommitted, warns

A test attribute is a capability requirement a bigger host would satisfy, so it skips. An operator-typed number that does not exist on this host is a misconfiguration, so it fails. The over-cap error names both numbers:

--cpu-cap N = 96 exceeds the 64 CPUs this process is allowed on (from
sched_getaffinity / Cpus_allowed_list). Pick a value ≤ 64, release the
cgroup/taskset constraint restricting this process, or omit --cpu-cap
to use the auto-sized default (30% of the allowed set for kernel
builds; the vCPU count, floored at 30%, for VMs).

The default-mode overcommit warning fires only when the allowed CPU set is genuinely smaller than the vCPU count (a CI runner or systemd slice can be narrower than the online host):

ktstr: WARNING: only 8 host CPUs available for 16 vCPUs (2.0x
oversubscription) — the process cpuset is smaller than the guest, so
the auto-sized CPU budget collapsed to it. NOTHING opted into this.
The host time-slices the vCPU threads, confounding guest-scheduler
measurement (absolute work scales ~1/2; timing metrics are host
artifacts). Widen the process cpuset, or shrink the guest topology.

The stamped cpu_budget in the run’s sidecar also drops below the vCPU count, so an A/B comparison against an overcommitted run is flagged rather than silently confounded.

The CPU budget

The budget is resolved in precedence order:

  1. --cpu-cap N on the command line.
  2. KTSTR_CPU_CAP=N when the flag is absent (empty string = unset).
  3. Neither: kernel builds get 30% of the allowed CPUs (rounded up, minimum 1); no-perf-mode VMs get max(30%, min(vcpus, allowed)) so a wide VM’s vCPU threads are not host-oversubscribed by the 30% mask — an oversubscribed guest measures host contention, not its own scheduler. An explicit cap below the vCPU count is the deliberate opt-in to oversubscription for contention testing.

0 is rejected with --cpu-cap must be ≥ 1 CPU (got 0) — zero is a scripting sentinel, not a silent “no cap”.

The reference set is the calling process’s allowed CPUs (sched_getaffinity, with a /proc/self/status fallback), not the host’s online count — so the reservation stays valid under cgroup-restricted CI runners. An empty allowed set is a hard error: guessing on a misconfigured host is worse than failing visibly.

A per-test cpu_budget attribute on #[ktstr_test] overrides the auto-size for that test; an operator --cpu-cap / KTSTR_CPU_CAP wins over both.

Flag availability

  • --no-perf-mode: cargo ktstr test / coverage / llvm-cov / shell, and ktstr shell. KTSTR_NO_PERF_MODE (any non-empty value) works everywhere.
  • --cpu-cap N: ktstr shell, ktstr kernel build, cargo ktstr shell, cargo ktstr kernel build — and it requires --no-perf-mode (perf mode already holds whole LLCs exclusively, so a cap would double-reserve). For cargo ktstr test / coverage / llvm-cov set KTSTR_CPU_CAP=N instead.

How a reservation is planned

Budgeted acquisition runs three phases:

  1. Discover — stat every LLC lockfile and read /proc/locks once to snapshot current holders. No locks taken.
  2. Plan — rank LLCs: prefer LLCs that already have holders (consolidation packs shared runs together), seed on the best-ranked LLC’s NUMA node, and greedily fill that node before spilling to nearest-by-distance neighbors. Accumulate LLCs until their allowed CPUs cover the budget.
  3. Acquire — non-blocking shared locks on every selected LLC, all-or-nothing. If any lock is busy, every held lock is dropped and the whole cycle retries a few times with short ascending backoff; after the final attempt it bails with a ResourceContention error naming the winning holders.

The lock granularity is per-LLC, but the reserved CPU list holds exactly the budget — the last selected LLC typically contributes only a prefix of its CPUs. When the plan spans more than one NUMA node, stderr warns:

ktstr: reserving LLCs [0 (node 0), 2 (node 1)] across 2 NUMA nodes
(preferred single-node contiguous unavailable). Build will run;
memory-access latency may be higher.

Cgroup v2 cpuset sandbox

Budgeted runs write the reserved CPUs and their NUMA nodes into a child cgroup — cpuset.cpus, then cpuset.mems, then the pid into cgroup.procs, in that order because the kernel may kill a task migrated into a cgroup whose cpuset.mems is still empty. After each write the effective value is read back: narrowing by a parent cgroup (a systemd slice, a container limit) is a fatal error under an explicit --cpu-cap and a warning otherwise. Kernel builds inside the sandbox also get their make -j width set to the reserved CPU count — without that, make -j$(nproc) fans gcc children out to a width the cpuset then has to time-slice, silently defeating the budget in scheduling terms.

Observing locks

ktstr locks (or cargo ktstr locks) prints every ktstr lock currently held on the host — LLC, per-CPU, kernel-cache, and run-dir locks — with each holder’s PID and command line. It is read-only and takes no locks itself. Use it when an acquire fails with ResourceContention: the error names the busy LLCs, the snapshot shows every contending peer at once. The full output and flags are in ktstr (standalone).

KTSTR_BYPASS_LLC_LOCKS — escape hatch

Setting KTSTR_BYPASS_LLC_LOCKS=1 skips lock acquisition entirely: the VM boots or the build starts immediately, with no coordination against concurrent runs. Use it only when measurement noise is acceptable — an isolated workstation, or a CI queue that already serializes jobs at a higher layer. It is mutually exclusive with --cpu-cap / KTSTR_CPU_CAP at every entry point; the rejection message always contains "resource contract" so it is greppable.

Filesystem requirement

Every lockfile path must live on a local filesystem — tmpfs, ext4, xfs, btrfs, f2fs, and bcachefs are the accepted set. NFS, CIFS/SMB, CephFS, AFS, and FUSE mounts are rejected at open time: flock(2) coordination or /proc/locks holder enumeration is unreliable on these configurations, and ktstr refuses to run on a lock it cannot trust. The error names the offending filesystem and the fix: move the lockfile path (KTSTR_LOCK_DIR, the cache root, or the runs root) to a local filesystem. Unknown-but-local filesystems (zfs, erofs, …) pass through.

Recipes

Task-oriented walkthroughs. Each recipe is self-contained: pick the one that matches your problem and follow it top to bottom. For the model behind the commands, read Core Concepts; for flag-by-flag detail, the Running Tests chapters.

Note

Two binaries appear below. cargo ktstr <subcommand> is the host-side cargo wrapper for test workflows; bare ktstr is the guest-init binary that doubles as a host CLI for a few tools (ctprof, topo, locks). Both install with cargo install ktstr. See cargo ktstr and ktstr (standalone).

Which recipe do I want?

SymptomRecipe
I have a scheduler binary and no testsTest a New Scheduler
A test failed and the scheduler diedInvestigate a Crash
Default checks don’t fit my scheduler — or nothing is checked at allCustomize Checking
I want gates that catch performance regressions — and proof they fireBenchmark Gates and Negative Tests
Is my scheduler at least as good as the kernel default?Compare a Scheduler vs EEVDF

Three recipes compare two runs. They answer different questions:

Two runs differ because…Recipe
…the scheduler source changed (branch vs baseline commit)A/B Compare Branches
…a workload got slower even though tests still passDiagnose a Slow Scheduler with ctprof
…the host changed (machine, reboot, sysctl drift)Capture and Compare Host State

All recipes

In rough lifecycle order:

Test a New Scheduler

End-to-end workflow: define a scheduler, write tests, run them, sweep the BPF verifier. At the end you have a scheduler that boots under real kernels on declared topologies, a test suite that fails when its behavior regresses, and (optionally) all of it hosted in your own crate.

1. Define the scheduler

declare_scheduler! generates a pub static MY_SCHED: Scheduler and registers it so cargo ktstr verifier discovers it automatically. Tests reference the bare MY_SCHED ident via #[ktstr_test(scheduler = MY_SCHED)].

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 2, 4, 1),
    kernels = ["6.14", "6.15..=7.0"],
    sched_args = ["--exit-dump-len", "1048576"],
});

topology = (1, 2, 4, 1) is 1 NUMA node, 2 LLCs, 4 cores per LLC, 1 thread per core — written 1n2l4c1t in test names and output. See Topology for the notation and Scheduler Definitions for every supported field.

2. Write integration tests

Tests inherit the scheduler’s topology. Override with explicit llcs, cores, or threads when needed.

use ktstr::prelude::*;

#[ktstr_test(scheduler = MY_SCHED)]
fn basic_steady(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits 1n2l4c1t from MY_SCHED
    scenarios::steady(ctx)
}

#[ktstr_test(scheduler = MY_SCHED, threads = 2)]
fn smt_steady(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits llcs=2, cores=4; overrides threads to exercise SMT
    scenarios::steady(ctx)
}

While iterating on a single test, mark the others with #[ktstr_test(scheduler = MY_SCHED, ignore = true)]: nextest skips them by default, but they stay registered so the verifier sweep still sees them. Clear the attribute when the test is ready to land — leaving it on permanently silently drops coverage.

3. Build a kernel

Build a kernel with sched_ext support:

cargo ktstr kernel build

See Getting Started for version selection and local source builds.

4. Run

cargo ktstr test resolves the kernel from KTSTR_KERNEL, the cache, or an explicit --kernel <spec> — a version like 7.0, a cache key from cargo ktstr kernel list, or a path to a kernel source tree. (It does not accept a prebuilt bzImage/Image; only cargo ktstr shell does.)

cargo ktstr test                    # auto-discover from cache / KTSTR_KERNEL
cargo ktstr test --kernel 7.0       # pin to a version (latest 7.0.x)
cargo ktstr test --kernel ../linux  # pin to a local source checkout

A run looks like this — each PASS line is a fresh VM that booted, ran the scenario, and shut down:

cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
 Nextest run ID 24c18577-cd34-43bd-9d14-b0197701c187 with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
     Summary [  34.490s] 1 test run: 1 passed, 12531 skipped
...

The run footer names the output directory (target/ktstr/{kernel}-{project_commit}) where per-test stats sidecars land — see Runs and Regression Gates.

5. Sweep the BPF verifier

The verifier sweep loads your scheduler’s BPF programs under the real kernel verifier, on every accepted topology preset, and reports per-program verified-instruction counts. Instruction counts vary with topology when nr_cpus bakes into .rodata — a scheduler can attach on one topology and wedge on another, which is exactly what the sweep catches.

cargo ktstr verifier                        # kernel from KTSTR_KERNEL / cache
cargo ktstr verifier --kernel ../linux      # pin one kernel
cargo ktstr verifier --kernel 6.14 --kernel 7.0   # sweep several

Each scheduler’s kernels = [...] declaration filters the operator-supplied set; an empty or omitted kernels field runs against every kernel in the sweep.

cargo ktstr verifier --kernel 7.0 --scheduler ktstr_sched
cargo ktstr: resolved kernel "7.0"
...
    Starting 4 tests across 1 binary (55 tests skipped)
        PASS [  12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
        PASS [  12.432s] (2/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/smt-2llc
        PASS [  12.656s] (3/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-1llc
        PASS [  12.929s] (4/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-2llc
────────────
     Summary [  12.929s] 4 tests run: 4 passed, 55 skipped

verifier verified_insns (per scheduler; rows: kernel, cols: BPF program, cell: range across topologies):

ktstr_sched:
 kernel      ktstr_dispatch  ktstr_dump  ktstr_dump_cpu  ktstr_dump_task  ktstr_enqueue  ktstr_exit  ktstr_exit_task  ktstr_init  ktstr_init_task  ktstr_select_cp  ktstr_yield
 kernel_7_0  102             81          13              70               74             25          419              2296        29077            39               8

verifier summary: 4 ✅  0 ❌  0 🇽
 topology   ktstr_sched
 odd-3llc   ✅
 smt-2llc   ✅
 tiny-1llc  ✅
 tiny-2llc  ✅

One glance shows where the complexity lives (ktstr_init_task at ~29k verified instructions dwarfs every other program) and that all four topologies attach cleanly. See BPF Verifier Sweep for the output format, cycle collapse on rejections, and the kernel-matching contract. Kernel cache hygiene (kernel list / kernel clean) lives in the cargo ktstr reference.

6. Debug failures

Boot an interactive shell with the scheduler binary packed into the guest. -i (--include-files) adds host-side files to the guest’s /include-files/ directory:

cargo ktstr shell -i ./target/debug/scx_my_sched

Inside the guest, run /include-files/scx_my_sched manually to inspect behavior. Use --exec CMD to run a single command non-interactively instead. See ktstr (standalone) and the cargo ktstr reference for all flags, and Investigate a Crash for the crash-report workflow.

7. Write a crash test

Schedulers ship their own failure-handling paths; a negative test pins them. The pattern: define a BpfMapWrite constant naming a .bss global in your scheduler, have ktstr write a trigger value into it after the scheduler loads, and have the scheduler’s error path read the global and call scx_bpf_error(...) with a known message. The test passes only when that exact error fires — a wrong or missing error fails, so silent regressions in the error path become visible.

use ktstr::prelude::*;

// ".bss" names the libbpf-named .bss map; "crash" is the global the
// host writes; 1 is the trigger value. The field offset is resolved
// from the map's program BTF at write time.
static BPF_CRASH: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);

#[ktstr_test(
    scheduler = MY_SCHED,
    bpf_map_write = BPF_CRASH,
    expect_err = true,
    expect_scx_bpf_error_contains = "my_sched: host-triggered crash",
)]
fn crash_path_emits_expected_error(ctx: &Ctx) -> Result<AssertResult> {
    ktstr::scenario::basic::custom_sched_mixed(ctx)
}

The substring contract is yours to define — the framework only enforces that what you declare matches what the scheduler emits. Use expect_scx_bpf_error_matches = r"…" for regex matching; the full matcher semantics live in the #[ktstr_test] reference.

8. Host a ktstr test in an external scheduler crate

A scheduler that lives in its own crate (outside the ktstr workspace) can host ktstr tests directly. Gate ktstr behind a feature so it never enters a normal build — ktstr pulls in Linux-only, heavyweight dependencies (KVM, libbpf, a kernel loader, guest memory) that would break non-Linux development and bloat ordinary CI.

In the scheduler crate’s Cargo.toml:

[dependencies]
# Pin the exact installed cargo-ktstr version — ktstr is pre-1.0 and
# a minor bump can break the test-facing API (see the README's
# version-compatibility note). Must be an optional [dependencies]
# entry, not a dev-dependency: `dep:ktstr` in [features] only
# resolves an optional normal dep (Cargo has no optional
# dev-dependencies).
ktstr = { version = "=X.Y.Z", optional = true }

[features]
ktstr-tests = ["dep:ktstr"]

[dev-dependencies]
# Only if a test body uses raw libc (e.g. fork / _exit); the
# prelude does not re-export libc.
libc = "0.2"

The test file gates its whole contents on the feature, so the crate compiles to nothing extra when the feature is off:

#![cfg(feature = "ktstr-tests")]

use ktstr::prelude::*;

// `MY_SCHED` is your `declare_scheduler!(MY_SCHED, { ... })` constant
// (section 1). Drop the `scheduler =` attribute to run under the
// kernel's default scheduler instead.
#[ktstr_test(scheduler = MY_SCHED)]
fn my_sched_runs(ctx: &Ctx) -> Result<AssertResult> {
    ktstr::scenario::basic::custom_sched_mixed(ctx)
}

Build a kernel once (section 3), then run the gated tests. The feature flag rides the nextest passthrough after --:

cargo ktstr kernel build --kernel /path/to/linux
cargo ktstr test --kernel /path/to/linux -- --features ktstr-tests

cargo ktstr test forwards everything after -- to cargo nextest run, which routes --features ktstr-tests to the test compile.

Investigate a Crash

When the scheduler under test dies mid-run, the test fails with a structured crash report: the scx exit reason, a kernel-side backtrace, per-CPU scheduler state, and — by default — the results of a second VM that ktstr boots automatically to replay the crash with probes attached. This recipe walks a real crash from report to regression test.

First step: rerun with full diagnostics

RUST_BACKTRACE=1 cargo ktstr test --kernel 7.0 -- -E 'test(my_test)'

RUST_BACKTRACE=1 (or full) does two things: it appends the --- diagnostics --- section (init stage, VM exit code, kernel console tail) to every failure — not only scheduler deaths — and it boots the guest with a verbose console (the same switch KTSTR_VERBOSE=1 flips).

The crash report

A failure prints as a header line plus sections, each present only when relevant: --- stats --- (per-cgroup worker results), --- diagnostics ---, --- timeline --- (phases with monitor samples), --- scheduler log --- (scheduler stdout+stderr, including the kernel’s DEBUG DUMP when the scheduler died), --- monitor --- (host-side observations and verdict), --- sched_ext dump --- (the same dump as traced by the guest kernel), and --- auto-repro ---. See Reading Failure Output for the full anatomy; this recipe focuses on the crash workflow.

A real crash, top to bottom

The crash below is real: ktstr’s fixture scheduler was told to call scx_bpf_error() via a host-written .bss trigger (the same mechanism step 7 of Test a New Scheduler uses). Trimmed to the load-bearing sections:

cargo ktstr test --kernel 7.0 -- -E 'test(=ktstr/bpf_crash_auto_repro_e2e)'
BUG SUMMARY: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)
ktstr_test 'bpf_crash_auto_repro_e2e' [sched=scx-ktstr] [topo=1n1l4c1t] failed:
  scheduler process died unexpectedly during workload (2.2s into test)
...
--- scheduler log ---
...
DEBUG DUMP
================================================================================

swapper/3[0] triggered exit kind 1025:
  scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)

Backtrace:
  scx_exit+0x50/0x70
  scx_bpf_error_bstr+0x78/0x90
  bpf_prog_1fed99378f3a8055_ktstr_dispatch+0x4d/0x1cb
  bpf__sched_ext_ops_dispatch+0x4b/0xa7
  do_pick_task_scx+0x379/0x770
  __schedule+0x5ca/0xfc0
  schedule+0x44/0x1b0
  worker_thread+0xa2/0x2d0
  kthread+0xf3/0x130
  ret_from_fork+0x19b/0x260
  ret_from_fork_asm+0x1a/0x30

ktstr scheduler state:
  stall=0 crash=1 degrade_rt=0
  rodata: degrade=0 slow=0 scattershot=0 verify_loop=0 fail_verify=0
  ktstr_alloc_count=87 degrade_cnt=0 slow_cnt=0
...
Error: EXIT: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)

Three lines identify the cause:

  • The BUG SUMMARY / exit line carries the message and C source line your scheduler passed to scx_bpf_error().
  • The backtrace names the BPF program that raised it (…ktstr_dispatch) and shows it fired from the kernel’s pick path (do_pick_task_scx inside __schedule).
  • The ktstr scheduler state block is the scheduler’s own dump callback output — whatever your scheduler prints in its .dump() op appears here. In this case crash=1 confirms the host-written trigger was read.

Auto-repro

auto_repro defaults to true in #[ktstr_test]. When the scheduler crashes, ktstr automatically:

  1. Captures the crash stack trace from the scenario output.
  2. Boots a second VM with kprobes (kernel functions) and fentry probes (BPF callbacks) on each function in the crash chain, plus a tp_btf/sched_ext_exit tracepoint trigger.
  3. Reruns the scenario to capture function arguments at each crash point.

The --- auto-repro --- section starts with the probe pipeline’s own accounting, so you can always see how much of the chain attached. From the run above:

--- auto-repro ---
--- probe pipeline ---
  extracted:   10 functions from crash backtrace
  traceable:   7 passed, 3 dropped: bpf_prog_1fed99378f3a8055_ktstr_dispatch, bpf__sched_ext_ops_dispatch, ret_from_fork_asm
  bpf_discover: 0 programs found
  after_expand: 7 total probe targets
  kprobes:     0 attached
  trigger:     attach failed (skeleton load (retry): No such process (os error 3); original error before retry: No such process (os error 3))
  probe_data:  0 keys, 0 unmatched IPs
  events:      0 captured, 0 after stitch

repro VM duration: 16.9s

Read the trigger: line honestly: the probe trigger needs the tp_btf/sched_ext_exit tracepoint, which not every kernel carries — on this 7.0.14 guest the attach failed, so no per-function argument events were captured. The section still delivers value: after the pipeline accounting it appends the repro VM’s diagnostics (in this run, its sched_ext dump and dmesg tail — a scheduler log and the freeze-coordinator failure dump appear when the repro run produces them), and here those reproduced the same crash. On kernels with the tracepoint, the pipeline instead lists each probed function with the argument values captured on the way to the crash. Cost: one extra VM boot plus a scenario replay (16.9 s here). Auto-repro is skipped when expect_err = true — an expected failure is not worth a repro VM — and can be turned off with auto_repro = false. See Auto-Repro for how the two-VM cycle works and its kernel requirements.

Pin the bug as a regression test

Once the crash is understood, pin its signature so the same bug fails the next CI run instead of silently regressing. Two #[ktstr_test] attributes attach matchers to the captured scx_bpf_error text (the combined scheduler log and --- sched_ext dump --- corpus):

  • expect_scx_bpf_error_contains = "literal" — substring match. Use for the common case of pinning an exact error fragment without escaping regex metacharacters.
  • expect_scx_bpf_error_matches = "regex" — full regex match via the regex crate. Use for anchored patterns, character classes, and wildcards.

Both require expect_err = true, and both compose with AND semantics — set both to require both to match. The test body must actually set up the trigger; here the host-side BpfMapWrite from the crash above does it:

use ktstr::prelude::*;

// Host writes 1 into the scheduler's `crash` .bss global after
// load; the scheduler's dispatch path reads it and calls
// scx_bpf_error(...) — the crash this test pins.
static BPF_CRASH: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);

#[ktstr_test(
    scheduler = MY_SCHED,
    bpf_map_write = BPF_CRASH,
    expect_err = true,
    expect_scx_bpf_error_contains = "host-triggered crash",
    expect_scx_bpf_error_matches = r"src/bpf/main\.bpf\.c:\d+",
)]
fn crash_regression(ctx: &Ctx) -> Result<AssertResult> {
    ktstr::scenario::basic::custom_crash_light(ctx)
}

The test fails if either matcher misses. A passing run means the scheduler still hits the pinned bug; a failure means the error text drifted (update the matcher) or the bug was fixed (delete the regression test).

Regex anchors use string-boundary semantics against the whole captured corpus: ^/$ match its start and end, and . does not cross \n. Opt in to line-level anchoring with (?m) and to dotall with (?s). Pattern-validity edge cases (empty patterns, trivially-matching patterns, builder vs attribute validation) are covered in the #[ktstr_test] reference.

A/B Compare Branches

One command answers “did my branch make the scheduler slower?”: cargo ktstr perf-delta runs the same performance_mode scenarios against your branch and its baseline commit, diffs every metric, and exits non-zero when enough metrics regress to trip the failure gate. (For host-context diffs or per-thread profiling instead, see the compare picker.)

Automated: perf-delta --noise-adjust

cd ~/src/my-sched                  # the scheduler crate under test

cargo ktstr perf-delta --noise-adjust 5 --kernel ../linux                    # HEAD vs merge-base(HEAD, main)
cargo ktstr perf-delta --noise-adjust 5 --kernel ../linux --base-ref release # vs merge-base(HEAD, release)
cargo ktstr perf-delta --noise-adjust 5 --kernel ../linux -E cgroup_steady   # narrow the perf set

perf-delta resolves the baseline as merge-base(HEAD, <ref>) (or a $GITHUB_BASE_REF PR target), then --noise-adjust N checks both commits out into their own plain checkouts, runs each side’s performance_mode tests N times, and compares from the observed spread — no manual worktree bookkeeping.

A single run per side cannot tell a real regression from run-to-run noise, so --noise-adjust gates a confident regression on two conditions: the sides must be separated (a Welch two-sample t-test, or fully disjoint [min, max] bands) and the delta must be material (each metric’s registry significance gate). N must be at least 2 — variance needs two samples — and 5 or more is recommended for a well-powered test. Budget wall time accordingly: the command produces 2×N full runs of your performance_mode set, so at N=5 a one-minute suite costs about ten minutes.

The command exits non-zero once enough metrics regress to trip the failure gate — 5 or more by default, so a lone noisy regression does not flip CI red. --fail-threshold tunes the count; --must-fail M1,M2 fails on the named metrics regardless of count. This drops straight into a CI perf-gate on a pull request — see CI for the workflow.

Manual: compare already-pooled runs

Every cargo ktstr test run writes one stats sidecar per test into target/ktstr/{kernel}-{project_commit}/; the accumulated sidecars are the pool that perf-delta compares. When you want control over the worktrees or test selection — or you already have both runs’ sidecars from CI artifacts — run the two branches yourself and point perf-delta --base at the baseline commit. It compares the cached pool without producing new runs (so it needs no --kernel).

cd ~/src/my-sched

# Baseline: check out and run the baseline branch's suite.
git worktree add ~/src/my-sched-main upstream/main
cd ~/src/my-sched-main
cargo ktstr test --kernel ../linux -- -E 'test(/performance_mode/)'

# Experimental: run HEAD's suite.
cd ~/src/my-sched
cargo ktstr test --kernel ../linux -- -E 'test(/performance_mode/)'

# Compare the pooled sidecars: HEAD vs the baseline commit.
cargo ktstr perf-delta --base <baseline-short-hex>

The {project_commit} half of the sidecar directory is the project tree’s HEAD short hex captured at first sidecar write (suffixed -dirty when the worktree differs from HEAD), so two branches with distinct HEADs land in distinct directories and coexist under one runs root. perf-delta --base <hex> partitions that pool by project_commit: the baseline commit’s sidecars are side A, HEAD’s are side B.

Warning

The two runs must be at distinct commits. If both checkouts share the same HEAD they land in the same directory and the second run’s pre-clear overwrites the first — the comparison degenerates to an identical pool. Confirm distinct commits with git -C ~/src/my-sched rev-parse HEAD before the second run.

The project commit is discovered by walking up from the test process’s current working directory to the enclosing .git, so the cd steps are load-bearing: without them the probe records the wrong commit. Use cargo ktstr stats list-values to see the project_commit values a pool actually carries before choosing --base.

Comparing configurations (not commits)

perf-delta compares on the commit axis (HEAD vs a baseline). A cross-config question — scheduler A vs scheduler B, or two tunings, at the same commit — is answered in-test: run both configurations as phases of one scenario and assert the relationship directly (e.g. VmResult::better_across_phases), so the verdict travels with the test rather than a separate compare invocation. Compare a Scheduler vs EEVDF is the worked example of that pattern.

Cleanup

git worktree remove ~/src/my-sched-main

Capture and Compare Host State

When a gauntlet run passes on one machine and fails on another — or passes on Monday and fails on Wednesday — the first thing to check is whether the host itself changed. cargo ktstr show-host captures a snapshot of the kernel, CPU, memory, scheduler tunables, and kernel cmdline; cargo ktstr perf-delta surfaces the changes between two runs in a host-delta section so you can see what moved. (For per-thread profiling see ctprof; for scheduler-behavior diffs between commits see A/B Compare Branches.)

Live vs archived

Two subcommands print host context; pick the one whose target matches your question:

  • cargo ktstr show-host reads the live host (/proc, /sys, uname()) at invocation time. Use it to inspect the current machine — before a benchmark, after a sysctl change, or to confirm what the next run here would record.
  • cargo ktstr stats show-host --run RUN_ID prints the archived host context captured at sidecar-write time for a past run (run keys from cargo ktstr stats list). Use it when investigating a regression in a past run — what looked like a code change might trace back to a host change.

Both render through the same formatter, so the two outputs are byte-for-byte comparable when the host is unchanged.

Capture: show-host

cargo ktstr show-host

Prints a key: value report, one field per line, in a fixed order (the order and field set are pinned by unit tests):

kernel_name, kernel_release, arch — the uname() triple · cpu_model, cpu_vendor — first /proc/cpuinfo entry · total_memory_kib, hugepages_total, hugepages_free, hugepages_size_kib — from /proc/meminfo · online_cpus, numa_nodes — node count from the CPU→node mapping (memory-only nodes are not counted) · thp_enabled, thp_defrag — transparent hugepage policy with the bracketed selection preserved verbatim · kernel_cmdline/proc/cmdline verbatim · task_delayacct — delay-accounting state (on, runtime-off, or config-off); gates which taskstats delay fields populate · config_task_xacctCONFIG_TASK_XACCT build state; gates the taskstats memory-watermark fields · cpufreq_governor — one line per CPU · sched_tunables — every /proc/sys/kernel/sched_* sysctl, one entry per line · heap_state — the process’s own jemalloc allocator state (rarely relevant to host comparison).

Missing-value rendering is consistent everywhere: a field that failed to populate prints (unknown); a map that was captured but empty prints (empty); in per-key diffs, a key present on only one side prints (absent). The distinction matters — (empty) means the dimension was inspected and had nothing, (unknown) means the capture itself failed.

The output is human-oriented. The same data, same schema, is attached to every gauntlet-run sidecar under its host field:

jq '.host' path/to/sidecar.ktstr.json

Compare: perf-delta’s host-delta section

cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14

perf-delta picks the first sidecar with a populated host field from each side and prints one of: nothing (neither side carried host context), host: captured in 'A' only, delta unavailable (a one-sided capture failure — your capture broke, not the host), host: identical between 'A' and 'B' (arch: x86_64), or a diff. The diff suppresses fields that match — it is a diff, not a snapshot — and renders one row per changed field, in this shape:

host delta ('<baseline>' → '<candidate>'):
  <field>: <baseline value> → <candidate value>
  ...

An unchanged host is the precondition for a clean A/B of scheduler behavior. A CI perf-gate that runs perf-delta on a pull request surfaces this section automatically whenever any host field differs — treat its appearance as a signal the comparison may not hold, and fail or annotate the PR.

Typical hits

Each row names the show-host field carrying the signal, so you can cargo ktstr show-host | grep <field> or jq '.host.<field>' sidecar.ktstr.json directly.

FieldSymptomFix / interpretation
thp_enabled / thp_defragLatency-sensitive regressions that come and go between runsCompare the bracket position (the active setting), not the whole string; pin via transparent_hugepage= on the kernel cmdline
sched_tunables.*Idle-steal pressure shifted on scx_* schedulers that read these sysctlsRestore the changed sysctl; the captured set is whatever /proc/sys/kernel/sched_* lists at capture time
kernel_cmdlineWhole scheduling surface changed (isolcpus=, nohz_full=, mitigations=, numa_balancing= are all boot-time)Reboot the host to match — the only remediation that makes the comparison hold
kernel_release (with kernel_name, arch)Everything is suspectCross-kernel comparison; rebuild the baseline on the same kernel
hugepages_total / hugepages_free / hugepages_size_kibperformance_mode throughput flips when the 2 MiB pool shrinksRestore the hugepage reservation
numa_nodesCross-node migration and locality signals mean different things across the runsHardware/firmware or topology reconfiguration; note memory-only nodes are not counted
cpu_model / cpu_vendorCache-sensitive benchmarks movedDifferent machine — inspect alongside kernel_cmdline, which will usually differ too

Two disambiguations worth knowing:

  • The field is named kernel_cmdline (not cmdline) in both the printed output and the sidecar JSON, to distinguish it from SidecarResult.kargs — the extra kargs the ktstr VMM appended when booting the guest, not the running host’s boot line.
  • The CFS/EEVDF tuning knobs (base_slice_ns, migration_cost_ns, latency_ns, …) live in debugfs (/sys/kernel/debug/sched/), where they moved in Linux 5.13 — not in /proc/sys/kernel. show-host reads only /proc/sys/kernel, so sched_tunables never captures them on any kernel.

The three-line investigation

The shape this recipe usually takes: a suite regresses with no scheduler change in the diff. Compare cargo ktstr stats show-host --run <old-run> against live cargo ktstr show-host; one delta appears — say, the thp_enabled bracket moved after a distro update. Pin the setting on the kernel cmdline, rerun, baseline restored: it was never a scheduler bug.

Diagnose a Slow Scheduler with ctprof

When a scheduler change makes the workload slower but the test suite still passes, the regression is usually buried in per-thread off-CPU time. ktstr ctprof capture snapshots every live thread’s scheduling, memory, I/O, and taskstats delay counters; ktstr ctprof compare diffs two snapshots and surfaces the buckets where time went. This recipe walks through a typical A/B comparison.

See the ctprof reference for the full metric registry, aggregation rules, derived-metric formulas, and taskstats kconfig gating.

Capture before and after

Before capturing on a fresh boot, run the taskstats pre-flight below — it saves a capture round-trip.

# Baseline: scheduler A loaded, workload running.
ktstr ctprof capture --output baseline.ctprof.zst

# Switch schedulers, restart workload, wait for steady state.
# ...

# Candidate: scheduler B, same workload.
ktstr ctprof capture --output candidate.ctprof.zst

capture walks /proc once and writes the snapshot. The scheduling, I/O, and taskstats delay data are read from procfs and netlink (genetlink) with no kernel tracing. The jemalloc memory counters require a ptrace(PTRACE_SEIZE) attach that briefly stops each probed thread — an observer effect bounded per thread; every counter is cumulative-from-birth, so the recorded values are unbiased by attach timing. The default capture covers every live tgid; on a busy host this is hundreds of threads. The snapshot is zstd-compressed JSON, typically a few MB, and capturing is fast: in the session below (≈800 processes, ≈1,200 threads, no jemalloc-probed targets), each snapshot completed in under a second.

Reading the compare table

Each compared metric renders as a baseline → candidate arrow cell plus delta, %, and %uptime (what fraction of the snapshot interval the group was alive) — the default arrow layout; --display-format full splits baseline and candidate into separate columns. Rows group by process family and sort so the biggest movers land on top. This excerpt brackets a compile job, and the table points straight at it — the mm_percpu_wq kworkers did ~10× more waiting:

ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst --limit 24
## Primary metrics
 comm                              threads  metric             value                delta      %         %uptime 
 kworker/{N}:{N}-mm_percpu_wq
     kworker/{N}:{N}-mm_percpu_wq  11→37    voluntary_csw      8.697K → 101.154K    +92.457K   +1063.1%  93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    timeslices         8.699K → 101.166K    +92.467K   +1063.0%  93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    wait_time_ns       2.684s → 27.653s     +24.969s   +930.2%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    stime_clock_ticks  22ticks → 217ticks   +195ticks  +886.4%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    run_time_ns        243.378ms → 2.320s   +2.077s    +853.4%   93%
...
 kworker/{N}:{N}-events
     kworker/{N}:{N}-events        87→60    nonvoluntary_csw   22 → 11              -11        -50.0%    95%
     kworker/{N}:{N}-events        87→60    timeslices         222.140K → 127.813K  -94.327K   -42.5%    95%
     kworker/{N}:{N}-events        87→60    voluntary_csw      222.118K → 127.802K  -94.316K   -42.5%    95%
     kworker/{N}:{N}-events        87→60    wait_time_ns       64.861s → 39.243s    -25.618s   -39.5%    95%
...

Large positive deltas on a process that should not have moved are the suspects — here wait_time_ns grew 930% (runqueue wait, not work) while run_time_ns grew less, the signature of queueing pressure. The taskstats-delay lens below renders rows in exactly the same shape.

Compare with the taskstats lens

The taskstats-delay section bundles the eight kernel delay-accounting buckets (CPU, blkio, swapin, freepages, thrashing, compact, wpcopy, irq) plus their nine derived metrics (avg_*_delay_ns per bucket, and the total_offcpu_delay_ns rollup). Filter the output down to just the off-CPU view:

ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
    --sections taskstats-delay \
    --sort-by total_offcpu_delay_ns:desc

The sort puts the processes with the largest absolute off-CPU growth at the top. The total_offcpu_delay_ns derivation is:

cpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing)

max(swapin, thrashing) rather than swapin + thrashing because every thrashing event is also a swapin event from the syscall perspective; summing both would double-count.

Drill into the per-bucket averages

If total_offcpu_delay_ns jumped on a process, the per-bucket avg_*_delay_ns derivations identify which off-CPU phase grew (the --sections taskstats-delay filter keeps the raw counters and all nine derivations together):

BucketAverage derivationMeaning
CPU runqueue waitavg_cpu_delay_nsTime waiting for the scheduler to pick the task
Block I/O waitavg_blkio_delay_nsSynchronous block-device wait; the canonical delay-accounting reading, distinct from schedstat iowait_sum
Swap-in / Thrashingavg_swapin_delay_ns / avg_thrashing_delay_nsMemory pressure; the two overlap (a thrashing event is also a swapin)
Direct memory reclaimavg_freepages_delay_nsAllocator hit the __alloc_pages slowpath
Memory compactionavg_compact_delay_nsHigh-order allocation stalled on compaction
CoW page-faultavg_wpcopy_delay_nsWrite-protect-copy fault, e.g. fork-then-write
IRQ handlingavg_irq_delay_nsTime charged to the task by IRQ accounting

One caveat on avg_cpu_delay_ns: the kernel updates its count and total locklessly, so a reader can catch one ahead of the other — the quotient is approximate at the sub-event scale and stable at the integrated scale.

A growing avg_cpu_delay_ns with flat blkio/swap/freepages suggests the new scheduler is making poor placement choices — the task is queueing more often or for longer, but no other subsystem is to blame. A growing avg_blkio_delay_ns with flat avg_cpu_delay_ns points away from the scheduler entirely (disk, network filesystem, or a userspace lock pattern).

Cross-reference the primary table

Once a bucket is identified, look at the underlying counters without the section filter:

ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
    --metrics nr_wakeups,nr_migrations,wait_sum,wait_count,run_time_ns,timeslices

Useful pairings when the suspect bucket is CPU runqueue wait:

  • wait_sum / wait_count — schedstat’s average wait per scheduling event (the avg_wait_ns derivation). If this confirms avg_cpu_delay_ns, both delay-accounting paths agree.
  • nr_migrations — the new scheduler may be moving the task more aggressively; cross-CPU migrations cost wall-clock time even when run_time_ns is identical.
  • nr_wakeups_affine / nr_wakeups_affine_attempts — the affine_success_ratio derivation (CFS-only). A large drop with growing avg_cpu_delay_ns is a strong signal for cache-unfriendly placement.

Confirm taskstats data is actually populated

Pre-flight (save a capture round-trip): verify the host has the delayacct runtime toggle enabled before capturing on a fresh boot:

sysctl kernel.task_delayacct       # must report `kernel.task_delayacct = 1`
zcat /proc/config.gz | grep -E 'CONFIG_TASKSTATS|CONFIG_TASK_DELAY_ACCT'
                                   # both must read `=y` for delayacct to fire

If task_delayacct is 0, set it with sysctl -w kernel.task_delayacct=1 (or persist via /etc/sysctl.d/) before capture. If the kconfig lines are absent, the running kernel was built without delayacct support and re-capturing won’t help. capture also logs read-failure tallies with a hint when a whole class of files is unreadable — a real example from a kernel missing two kconfigs:

2026-07-04T22:00:48.765819Z  INFO ktstr::ctprof: ctprof parse: 1215 tids walked, 1646 read failures (dominant: io); hint: schedstat / io read failures dominate — kernel may be built without CONFIG_SCHED_INFO and/or CONFIG_TASK_IO_ACCOUNTING

If every taskstats column reads zero after the pre-flight, the snapshot likely hit a gating problem rather than a real “no delay” reading. The snapshot records a structured per-capture tally; no subcommand prints it, but the snapshot is zstd-compressed JSON, so read it directly:

zstd -dc candidate.ctprof.zst | \
    jq '{taskstats: .taskstats_summary, tids_walked: .parse_summary.tids_walked}'
  • eperm_count > 0 — the capturing process lacked CAP_NET_ADMIN. Re-run as root, or grant cap_net_admin+eip via setcap.
  • esrch_count near tids_walked — every tid raced exit before the per-tid query landed. Lengthen the workload’s steady-state window and re-capture.
  • ok_count == 0 and eperm_count == 0 — the netlink open failed, almost always meaning the kernel was built without CONFIG_TASKSTATS. Rebuild with the kconfig.
  • ok_count > 0 but every delay column reads zero — kernel built with the kconfigs but launched without the runtime toggle. Add delayacct to the kernel cmdline, or set sysctl kernel.task_delayacct=1 and re-capture.

What next

  • If the candidate scheduler is a branch of the baseline, gate the regression with cargo ktstr perf-delta so CI catches the next one.
  • If the question is “is this scheduler worth its overhead at all”, run the scheduler-vs-EEVDF comparison — one test, both schedulers, per-phase deltas.
  • ctprof reference — full metric registry and gating documentation.

Customize Checking

Override checking thresholds for schedulers that tolerate higher imbalance, different gap thresholds, or relaxed event rates — and opt in to the checks that are off by default.

Warning

Assert::default_checks() is Assert::NO_OVERRIDES — every field None. Until a scheduler-level or per-test override sets a threshold, no worker assertions run. A green suite with no overrides proves only that the VM booted and the scheduler didn’t crash.

What a tripped gate looks like

Here a test set min_iteration_rate to a floor the workload could never meet (deliberately, to force the failure). The report names each worker that missed the gate, with the measured rate and the floor it was compared against:

    ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
      worker 71 iteration rate 41903.3/s below floor 50000000.0/s
      worker 73 iteration rate 37834.5/s below floor 50000000.0/s

    --- stats ---
    2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
      cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
      cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
    --- monitor ---
    samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
    avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
    events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
    events+: refill_slice_dfl=210
    schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
    bpf: ktstr_select_cp cnt=189 145ns/call
    bpf: ktstr_enqueue cnt=373 34ns/call
    bpf: ktstr_dispatch cnt=584 237ns/call
    verdict: monitor OK

Note the two channels: the worker gate tripped (the two below floor lines) while the monitor verdict is OK — worker checks and host-side monitor checks are evaluated independently. See Checking for the model. The fix is whichever of these matches the intent: set a floor the scheduler can actually meet, or fix the scheduler until it meets the floor you wrote.

Scheduler-level overrides

Declare a scheduler with assertion overrides that apply to every test using it:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(RELAXED, {
    name = "relaxed",
    binary = "scx_relaxed",
    assert = Assert::NO_OVERRIDES
        .max_imbalance_ratio(5.0)    // tolerate 5:1 imbalance
        .max_fallback_rate(500.0)    // higher fallback rate ok
        .fail_on_stall(false),       // don't fail on stall
});

These are the first layer that can carry an actual check — without them (or a per-test override), nothing asserts.

Per-test overrides

Attributes on #[ktstr_test] merge last and win:

#[ktstr_test(
    scheduler = RELAXED,
    not_starved = true,
    max_gap_ms = 5000,
    max_imbalance_ratio = 10.0,
    sustained_samples = 10,
)]
fn high_imbalance_test(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits topology from RELAXED
    Ok(AssertResult::pass())
}

not_starved = true enables the starvation, fairness-spread, and scheduling-gap checks as a group; each threshold can still be overridden independently. The full attribute list and default thresholds live in the #[ktstr_test] reference.

Merge order

The runtime evaluates Assert::default_checks().merge(&scheduler.assert).merge(&test.assert) — three layers, last-Some-wins per field. Worked example:

  • scheduler layer: max_imbalance_ratio(5.0)
  • test layer: max_imbalance_ratio = 10.0
  • effective: 10.0 — the test’s Some wins; every field the test leaves None falls through to the scheduler layer, then to the (all-None) defaults.

To see the merged result for a registered test without reading source:

cargo ktstr show-thresholds high_imbalance_test

It prints every threshold field of the exact Assert the runtime will evaluate, with none for unset fields.

Using Assert directly in ops scenarios

fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let checks = Assert::NO_OVERRIDES
        .check_not_starved()
        .max_gap_ms(3000);

    let steps = vec![/* ... */];
    execute_steps_with(ctx, steps, Some(&checks))
}

execute_steps_with applies the given Assert for worker checks, overriding the merged config. execute_steps (without _with) passes None and falls back to ctx.assert — the merged three-layer config above. Reaching for _with when you meant to add to the merged config is a classic trap: the explicit Assert replaces ctx.assert, it does not compose with it.

See Ops, Steps, and Backdrop for the step execution model.

Benchmark Gates and Negative Tests

A performance gate you have never seen fail proves nothing. This recipe builds gates in pairs: a positive test that holds the scheduler to a floor, and a negative twin that degrades the scheduler on purpose and asserts the same gate trips. (Extracting metrics from benchmark payloads like schbench or fio is covered in Payloads and Included Files; cross-commit regression gates are A/B Compare Branches.)

Positive: gate a scenario

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 1, 2, 1),
});

#[ktstr_test(
    scheduler = MY_SCHED,
    performance_mode = true,
    duration_s = 5,
    sustained_samples = 15,
)]
fn perf_positive(ctx: &Ctx) -> Result<AssertResult> {
    let checks = Assert::default_checks()
        .min_iteration_rate(5000.0)
        .max_gap_ms(500);
    let steps = vec![Step::with_defs(
        vec![CgroupDef::named("cg_0").workers(2)],
        HoldSpec::FULL,
    )];
    execute_steps_with(ctx, steps, Some(&checks))
}

Key points:

  • performance_mode = true pins vCPUs to reserved host cores so the measurement isn’t host noise — see Performance Mode.
  • Assert::default_checks() is all-None: every gate is opt-in, and nothing fails until you chain a setter. Threshold layering and merge order live in Customize Checking.
  • execute_steps_with applies the Assert during worker checks.

A passing gated run looks like any passing run:

cargo ktstr: resolved kernel "7.0"
...
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  22.158s] (1/1) ktstr::ktstr_sched_tests ktstr/sched_basic_proportional
────────────
     Summary [  22.197s] 1 test run: 1 passed, 12531 skipped

Choosing the floor

Don’t guess thresholds. Run the scenario once with no gates, read the observed per-cgroup iteration counts from the --- stats --- block (or the run’s stats sidecar), and set the floor at a healthy margin below the observed rate — far enough down that run-to-run noise never trips it, close enough up that a real regression does. Tighten later once perf-delta --noise-adjust has shown you the actual run-to-run spread.

Negative: prove the gate fires

expect_err = true inverts the harness: the test must fail. An Ok return panics with expected test to fail but it passed, so a gate that silently stopped firing turns the negative test red. Skips are handled separately: a run that could not boot a kernel or lost the host-resources race emits a SKIP banner and does not count as the expected failure — an environment problem is never mistaken for proof that the gate fired (the skip-vs-fail taxonomy lives in Troubleshooting). Auto-repro is disabled automatically for expected-error tests.

#[ktstr_test(
    scheduler = MY_SCHED,
    performance_mode = true,
    duration_s = 5,
    extra_sched_args = ["--degrade"],
    expect_err = true,
)]
fn perf_negative(ctx: &Ctx) -> Result<AssertResult> {
    let checks = Assert::default_checks()
        .min_iteration_rate(5000.0)
        .max_gap_ms(500);
    let steps = vec![Step::with_defs(
        vec![CgroupDef::named("cg_0").workers(2)],
        HoldSpec::FULL,
    )];
    execute_steps_with(ctx, steps, Some(&checks))
}

extra_sched_args passes CLI args to the scheduler binary. --degrade is a real knob on ktstr’s fixture scheduler (scx-ktstr) that deliberately worsens its scheduling; the fixture also exposes --slow, --stall-after, and --fail-verify for other failure classes. Substitute your own scheduler’s equivalent — a degradation flag your scheduler ships for exactly this purpose is a feature, not a wart.

When the degraded run trips the gate, the failure the harness expects looks like the real thing, because it is:

    ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
      worker 71 iteration rate 41903.3/s below floor 50000000.0/s
      worker 73 iteration rate 37834.5/s below floor 50000000.0/s

The in-repo pattern to copy is tests/assert_gate_matrix.rs: a macro stamps out a positive and a negative variant for each of ten worker gates (max_p99_wake_latency_ns, max_wake_latency_cv, min_iteration_rate, max_gap_ms, …), with the negative variant passing --degrade and setting expect_err. Each gate in the matrix is proven to fire by its own negative test — hold your gates to the same standard.

Compare a Scheduler vs EEVDF

A standard regression guard for a sched_ext scheduler: does it match (or beat) the kernel default (EEVDF) on the same workload — not just for throughput, but for latency and CPU overhead too? Run the workload under the scheduler in one phase, detach the scheduler mid-run so the kernel default takes over for a second phase, then compare the two phases metric by metric.

The workload must persist across the detach — a Backdrop population, not per-step workers — so its cumulative counters span both phases. That shared, continuous measurement is what makes a per-phase delta meaningful (per-step workers reset each phase and read ~0).

Two readers cover the comparison, both on the &VmResult a post_vm callback receives (the host-side hook that runs after the VM exits):

  • VmResult::throughput_ratio(a, b) — iterations/sec from the stimulus timeline. The timeline carries per-step boundaries independent of the periodic-capture pipeline, so throughput works even for --cell-parent-cgroup schedulers.
  • VmResult::phase_metric(phase, name) — any other per-phase metric by its registry name (see Checking): CPU overhead (system_time_ns, user_time_ns) and scheduling quality (avg_imbalance_ratio, avg_dsq_depth). Wake-latency and run-delay distributions are run-level — pooled across cgroups into one whole-run value — so they cannot be split into the scheduler phase vs the EEVDF phase; to compare them, run the scheduler and EEVDF as two separate tests and read each run’s run-level metric. Everything else flows through the one per-phase bucket pipeline, so a new metric becomes comparable here the moment it lands in that pipeline.
use anyhow::{ensure, Result};
use ktstr::assert::{AssertResult, Phase};
use ktstr::ktstr_test;
use ktstr::prelude::{Backdrop, VmResult};
use ktstr::scenario::Ctx;
use ktstr::scenario::ops::{execute_scenario, CgroupDef, HoldSpec, Op, Step};
use ktstr::test_support::{Scheduler, SchedulerSpec};

// Built directly rather than via declare_scheduler! so this comparison
// harness stays out of the verifier sweep (manual consts are not
// registered for sweeping). Use declare_scheduler! for the scheduler
// definition you ship.
const MY_SCHED: Scheduler =
    Scheduler::named("my_sched").binary(SchedulerSpec::Discover("scx_my_sched"));

// Runs on the host after the VM exits; the &VmResult carries the stimulus
// timeline and the per-phase metric buckets the comparison reads.
fn compare_vs_eevdf(result: &VmResult) -> Result<()> {
    let sched = Phase::step(0); // first Step ran under the scheduler under test
    let eevdf = Phase::step(1); // second Step ran under EEVDF, after the detach

    // Throughput: > 1.0 means the scheduler out-throughputs EEVDF; < 1.0
    // is a regression.
    let throughput = result
        .throughput_ratio(sched, eevdf)
        .ok_or_else(|| anyhow::anyhow!("no per-phase throughput — did both phases run?"))?;
    ensure!(
        throughput >= 0.8,
        "my_sched throughput is {throughput:.2}x EEVDF (below the 0.8x floor)"
    );

    // Scheduling quality: any per-phase metric compares the same way via
    // phase_metric. Skip the gate when a phase has no reading (None)
    // rather than failing. (Wake-latency / run-delay distributions are
    // run-level and not readable here — see the reader list above.)
    if let (Some(s), Some(e)) = (
        result.phase_metric(sched, "avg_imbalance_ratio"),
        result.phase_metric(eevdf, "avg_imbalance_ratio"),
    ) {
        ensure!(s <= e * 1.5, "my_sched imbalance {s:.2} is >1.5x EEVDF {e:.2}");
    }

    // CPU overhead: per-phase kernel (system) CPU time.
    if let (Some(s), Some(e)) = (
        result.phase_metric(sched, "system_time_ns"),
        result.phase_metric(eevdf, "system_time_ns"),
    ) {
        ensure!(s <= e * 2.0, "my_sched system time {s:.0}ns is >2x EEVDF {e:.0}ns");
    }

    Ok(())
}

#[ktstr_test(
    scheduler = MY_SCHED,
    duration_s = 10,
    watchdog_timeout_s = 10,
    post_vm = compare_vs_eevdf,
)]
fn scheduler_vs_eevdf(ctx: &Ctx) -> Result<AssertResult> {
    // Persistent Backdrop population: runs across both phases so its
    // cumulative counters span the detach.
    let backdrop = Backdrop::new().push_cgroup(CgroupDef::named("cg").workers(4));
    let steps = vec![
        // Phase A: workload under the scheduler under test.
        Step::new(vec![], HoldSpec::frac(0.5)),
        // Phase B: detach -> the kernel default (EEVDF) takes over.
        Step::new(vec![Op::detach_scheduler()], HoldSpec::frac(0.5)),
    ];
    execute_scenario(ctx, backdrop, steps)
}

The 0.8x / 1.5x / 2.0x bounds above are illustrative, not recommendations. Calibrate yours: run the test a few times with generous bounds, note the observed ratios (each ensure! message prints them; a run’s failure output leads with whichever message tripped), and set each floor just outside the observed noise band. A gate inside the noise band fails honest runs; one far outside it never fails at all.

Notes:

  • Op::detach_scheduler() cleanly hands the workload to the kernel default. Each step emits its own boundary, so no trailing closer step is needed, and the intentional detach is not promoted to a scheduler-died failure.
  • Phases are keyed by Phase: Phase::step(0) is the first scenario Step, Phase::step(1) the second. Phase::BASELINE is the pre-Step settle window. Use Phase rather than the raw stimulus step_index.
  • phase_metric returns None when a phase has no reading for a metric, so gate inside if let (Some(..), Some(..)) rather than unwrapping — a metric that did not populate skips its gate instead of failing the run.
  • For cross-cell balance rather than a phase-vs-phase comparison, read result.stats.cgroup_balance_ratio() in the test body (the test body’s AssertResult carries stats).

This test gates scheduler-vs-EEVDF within one run. To gate your scheduler against its own past self across commits, use cargo ktstr perf-delta — the two nets catch different regressions, and CI wants both.

Architecture Overview

A scheduler bug rarely returns an error code — it wedges a CPU, strands a runqueue, or panics the kernel. ktstr’s architecture follows from that: every test boots its own KVM microVM, so a crash takes down a disposable guest instead of your machine; the CPU topology is whatever the test declared; and the kernel is the exact build you targeted.

ktstr has three execution domains:

  1. Host process — the test binary running on the host. Manages VM lifecycle, monitors guest memory, evaluates results.

  2. Guest process — the same test binary running inside the VM as PID 1. Mounts filesystems, starts the scheduler, creates cgroups, forks workers, runs scenarios, and writes results back to the host.

  3. Monitor thread — runs on the host while the guest executes. Reads guest VM memory directly to observe scheduler state without instrumenting it.

Execution flow

Host                          Guest
----                          -----
test binary                   
  |                           
  +-- build initramfs         
  |   (test binary as /init   
  |    + optional scheduler)  
  |                           
  +-- boot KVM VM             
  |                           test binary (PID 1 init)
  |                             |
  +-- start monitor thread      +-- mount filesystems
  |   (reads guest memory)      +-- start scheduler (if any)
  |                             +-- create cgroups
  |                             +-- fork workers
  |                             +-- move workers to cgroups
  |                             +-- signal workers to start
  |                             +-- poll scheduler liveness
  |                             +-- stop workers, collect reports
  |                             +-- evaluate results
  |                             +-- write result to virtio-console port 1
  |                           
  +-- read result from virtio-console port 1
  +-- evaluate monitor data   
  +-- report pass/fail        

Results travel on virtio-console port 1; panics, crashes, and other non-blockable diagnostics fall back to the COM2 serial port (see VMM — guest–host transports).

From the host, a passing run looks like this:

cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
     Summary [  34.490s] 1 test run: 1 passed, 12531 skipped

Key design decisions

Same binary, two roles. The test binary serves as both host controller and guest test runner. The initramfs embeds the binary as /init; when the binary finds itself running as PID 1, it executes the guest lifecycle (mounts, scheduler start, test dispatch, reboot) instead of the host one. One cargo build produces everything needed for both sides — there is no separate guest agent to version or ship.

Forked workers (default), threads optional. The default Fork clone mode spawns each worker as its own process so cgroup placement via cgroup.procs is tgid-granular. The Thread clone mode shares the harness’s tgid and routes placement through cgroup.threads instead — useful when workers need a shared address space or when measuring thread-only scheduler paths. See Workers and Workloads.

Host-side monitoring. The monitor reads guest memory via KVM, avoiding BPF instrumentation of the scheduler under test. This eliminates observer effects on scheduling decisions.

Where to go next

  • VMM — how VMs boot, topology modeling, guest–host transports.
  • Monitor — what is observed from the host and how violations become verdicts.
  • Workers and Workloads — worker lifecycle and the telemetry each worker reports.
  • CgroupManager / CgroupGroup — cgroup plumbing and RAII cleanup inside the guest.

VMM

ktstr includes a purpose-built VMM (virtual machine monitor) that boots Linux kernels in KVM for testing.

Why a purpose-built VMM

Three requirements rule out reusing a general-purpose VMM:

  • Direct guest-memory access. The monitor reads scheduler state straight out of guest DRAM through a host-side pointer into the VM’s memory mapping. Owning the VMM means owning that mapping — no guest agent, no hypercall surface, no negotiation with someone else’s memory model.
  • Topology is the product. Tests declare NUMA nodes, LLCs, cores, and SMT threads, and the guest must actually have that shape — down to asymmetric node sizes, inter-node distances, and CXL memory-only nodes. The VMM builds the ACPI tables to the declared shape rather than approximating it with generic knobs.
  • Boot cost is paid per test. Every #[ktstr_test] boots a fresh VM, so setup has to be cheap. From a real run (2-vCPU guest, warm caches):
  initramfs spawn: 55.583µs
  kvm+kernel: 867.005µs
  setup_memory (joins initramfs): 1.409360963s
  setup_vcpus: 1.409565321s
VM setup total: 1.409619773s

Creating the KVM VM and loading the kernel costs under a millisecond; the dominant cost is populating guest memory, which joins the cached initramfs build (below). After setup, the guest still has to boot the kernel — total wall-clock per test is dominated by the scenario’s own duration.

KtstrVm builder

let result = vmm::KtstrVm::builder()
    .kernel(&kernel_path)
    .init_binary(&ktstr_binary)
    .topology(Topology::new(numa_nodes, llcs, cores_per_llc, threads_per_core))
    .memory_mib(4096)
    .run_args(&["run".into(), "--ktstr-test-fn".into(), "my_test".into()])
    .build()?
    .run()?;

Test authors do not touch this directly — #[ktstr_test] drives it — but every attribute on the macro (topology dims, memory, kargs) lands here.

Topology

The VM topology is specified as (numa_nodes, llcs, cores_per_llc, threads_per_core). On x86_64, the VMM creates ACPI tables (MADT, SRAT, SLIT, and HMAT when numa_nodes > 1) and MP tables. On aarch64, topology is expressed via FDT cpu nodes with MPIDR-derived reg properties.

pub struct Topology {
    pub llcs: u32,
    pub cores_per_llc: u32,
    pub threads_per_core: u32,
    pub numa_nodes: u32,
    pub nodes: Option<&'static [NumaNode]>,
    pub distances: Option<&'static NumaDistance>,
}

total_cpus() = llcs × cores_per_llc × threads_per_core.

When nodes is None (the default), memory and LLCs are distributed uniformly across NUMA nodes with default 10/20 distances. When Some, each NumaNode specifies its LLC count, memory size, and optional HMAT attributes (latency_ns, bandwidth_mbs, mem_side_cache). A NumaNode with llcs = 0 models a CXL memory-only node.

NumaDistance is an NxN inter-node distance matrix. Diagonal entries must be 10 and off-diagonal > 10 (ACPI SLIT requirements); ktstr additionally requires the matrix to be symmetric.

Use Topology::new(numa_nodes, llcs, cores, threads) for uniform topologies, or Topology::with_nodes(cores, threads, &nodes) for explicit per-node configuration. The test-author view of all this is Topology.

initramfs

The VMM builds a cpio initramfs containing:

  • The test binary (as /init)
  • Optional scheduler binary (as /scheduler)
  • Shared library dependencies (resolved via ELF DT_NEEDED parsing)

The initramfs is split into a cached base plus a per-run suffix. The base cache key is derived from the payload’s shared-library set and the content hashes of the packed scheduler/probe/worker binaries and include files — not the test binary’s own bytes, which ride the per-run suffix. So recompiling your tests keeps the base cache warm, while recompiling the scheduler invalidates it. The cached base lives in a shared-memory segment that concurrent VMs map zero-copy, sharing physical pages across parallel tests.

Guest–host transports

TransportCarries
COM1 (serial)Guest kernel console. Forwarded to stderr with --dmesg.
COM2 (serial)Crash diagnostics only: the guest panic hook writes PANIC: <info> plus a backtrace here.
/dev/hvc0 (console port 0)Interactive console for ktstr shell.
Console port 1The primary guest-to-host data channel: test results, exit codes, scenario markers, payload metrics, coverage data, scheduler-exit notifications.
Console port 2Transparent byte relay for scx_stats requests/responses between the host and the in-guest scheduler.

Two details worth internalizing:

  • COM2 is crash-only. Ordinary guest stdout/stderr does not use COM2 — it travels over the port-1 stream as framed messages. COM2 exists for diagnostics that must get out even when the framed transport can’t be trusted (panics, fatal signals). The host parses the PANIC: header and surfaces the backtrace in test failure output.
  • Port 1 frames are integrity-checked. Each frame on the port-1 stream carries a CRC32, so a corrupted result is detected rather than mis-parsed.

Performance mode

When performance mode is enabled, the VMM applies host-side isolation (vCPU pinning, hugepages, NUMA mbind, RT scheduling), guest-visible hints (KVM_HINTS_REALTIME CPUID), and KVM exit suppression. Non-performance-mode VMs set the KVM halt-poll interval to 200µs; overcommitted topologies set it to 0. See Performance Mode.

Dual-role dispatch

The same test binary is the host controller and the guest /initArchitecture Overview tells the story. The mechanics: a constructor function runs before main() in every ktstr-linked binary. Running as PID 1, it executes the guest init path (mounts, scheduler start, test dispatch, reboot); given --ktstr-test-fn plus a topology argument, it boots a VM as the host side; given only --ktstr-test-fn, it runs the test function directly because it is already inside a VM.

Boot process

  1. Load the kernel (bzImage on x86_64, Image on aarch64).
  2. Create KVM vCPUs matching the declared topology. High vCPU counts add measurable boot latency — see Performance Mode for sizing.
  3. Build and load the initramfs.
  4. Set up serial devices (COM1 kernel console, COM2 crash diagnostics), the virtio console, and virtio block/net devices for disk- and network-shaped workloads.
  5. Boot the kernel.
  6. The kernel starts /init (the test binary); PID 1 detection routes into the guest lifecycle: mount filesystems, start the scheduler, dispatch the test function, reboot.

Monitor

When your scheduler leaves a CPU starving, the monitor is what notices. It runs on the host while the guest executes and reads scheduler state directly out of guest memory — the scheduler under test never executes an extra instruction to be observed, and no BPF probe perturbs the decisions being measured.

Every test report ends with the monitor’s summary. From a real run:

--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
events+: refill_slice_dfl=210
schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
bpf: ktstr_select_cp cnt=189 145ns/call
bpf: ktstr_enqueue cnt=373 34ns/call
bpf: ktstr_dispatch cnt=584 237ns/call
verdict: monitor OK

Reading it: samples is how many point-in-time snapshots the monitor took; max_imbalance/max_dsq_depth/stuck are the peaks the threshold checks evaluate; events are sched_ext event-counter rates (select-cpu fallbacks, keep-last dispatches); the bpf: lines are per-callback invocation counts and mean cost from the guest kernel’s BPF program-runtime stats; verdict is what folds into the test result.

What it reads

The monitor resolves kernel structure offsets from the guest kernel’s BTF — nothing is hardcoded per kernel version. Per CPU, it reads the runqueue’s nr_running, scx_nr_running, rq_clock, local_dsq_depth, and scx_flags, plus the sched_ext event counters (select-cpu fallback, dispatch keep-last, bypass activity, and the rest of the family). When the guest kernel has CONFIG_SCHEDSTATS, it also reads per-CPU struct rq schedstat fields (run_delay, pcount, ttwu_count, …).

It also walks the struct sched_domain tree — from rq->sd up the sd->parent chain — whenever the BTF exposes it, capturing per-level topology metadata (level, name, flags, span_weight) and runtime fields (balance_interval, nr_balance_failed, max_newidle_lb_cost), plus load-balancing stats when CONFIG_SCHEDSTATS is enabled. Fields that newer kernels added (the proportional-newidle counters newidle_call / newidle_success / newidle_ratio, new in 7.0 with some stable backports) resolve as optional: on kernels whose BTF lacks them they are simply absent, and the rest of the walk proceeds.

Sampling

The monitor takes periodic snapshots (MonitorSample) of all per-CPU state; each sample is a point-in-time view of every CPU. MonitorSummary aggregates samples into peak values (max imbalance ratio, max DSQ depth, stall detection), per-sample averages, and event-counter deltas. Averages are computed over valid samples only (excluding uninitialized guest memory — see below).

Threshold evaluation

MonitorThresholds defines the pass/fail conditions:

ThresholdDefaultTrips when
max_imbalance_ratio4.0max/min per-CPU nr_running exceeds the ratio
max_local_dsq_depth50any CPU’s local DSQ exceeds the depth
fail_on_stalltruea CPU’s rq_clock stops advancing (exemptions below)
max_fallback_rate200.0/ssustained select-cpu-fallback event rate
max_keep_last_rate100.0/ssustained dispatch-keep-last event rate
sustained_samples5— window: a violation must persist this many consecutive samples

A violation must persist for sustained_samples consecutive samples before it counts — at the ~100ms sample interval, the default 5 means roughly 500ms of sustained violation. This filters transient spikes from cpuset transitions and cgroup creation/destruction. The reasoning behind each default value lives in Checking.

Test authors do not construct MonitorThresholds directly: the #[ktstr_test] threshold attributes (max_imbalance_ratio, fail_on_stall, …) and Assert::with_monitor_defaults() feed it — see the macro reference.

enforce is the on/off gate for the threshold-violation path. The default is report-only: monitor evaluations record every violation in the verdict’s details, but the verdict passes. Opting in — via Assert::with_monitor_defaults(), which fills unset threshold fields and sets enforce — promotes recorded violations to failures. Setting a field like fail_on_stall without enforcement is a no-op for the violation path: the violation appears in the monitor report, the verdict still passes, and the summary carries a report-only advisory flagging the missing enforcement.

The no-signal arms bypass enforce entirely. An empty sample buffer, or data that fails the plausibility check below, always produces a verdict with passed: false, inconclusive: true, which folds into the test’s AssertResult as Inconclusive (exit code 2). “Couldn’t evaluate” is not the same as “evaluated and OK,” so the no-signal path always surfaces distinct from Pass. Only threshold violations are gated by enforce.

Stall detection

A stall is detected when a CPU’s rq_clock does not advance between consecutive samples. Three exemptions prevent false positives:

  • Idle CPUs: when nr_running == 0 in both the current and previous sample, the CPU has no runnable tasks. The kernel stops the tick (NOHZ) on idle CPUs, so rq_clock legitimately does not advance.
  • Preempted vCPUs: when the vCPU thread’s CPU time did not advance past the preemption threshold between samples, the host preempted the vCPU — the guest never got a chance to run, which is not the scheduler’s fault.
  • Sustained window: stall detection uses per-CPU consecutive counters and the sustained_samples threshold, matching the other checks. A single stuck sample does not trigger failure.

Uninitialized memory detection

Before the guest kernel initializes per-CPU structures, monitor reads return garbage. Two layers handle this: summary computation skips individual samples where any CPU’s local_dsq_depth exceeds a plausibility ceiling (10,000), and threshold evaluation checks the whole report — if all rq_clock values are identical across every CPU and sample, or any sample exceeds the ceiling, the report is classified “not yet initialized” and no per-threshold checks run (this is one of the Inconclusive arms above).

The monitor never instruments the guest

Everything above is passive memory reading. The one ktstr feature that does load BPF probes into a guest is auto-repro — and it runs them in a separate, disposable repro VM booted after the original test failed, never in the VM whose behavior is being measured. The run your verdict is based on is unperturbed.

How guest memory is read

Three address-translation modes cover the kernel’s address spaces:

  • Text/data/bss — linear offset from the kernel’s static map, for statically-linked kernel variables.
  • Direct mappingkva - PAGE_OFFSET, for SLAB allocations and per-CPU data.
  • Vmalloc/vmap — a real page-table walk through the guest’s CR3 (4- and 5-level paging on x86_64; 4/16/64 KB granules on aarch64), for BPF maps and vmalloc’d memory.

All reads are bounds-checked and volatile (the guest modifies memory concurrently), and the runtime KASLR offset is recovered at startup so ELF symbols from the matching vmlinux resolve to live guest addresses. vmlinux must match the guest kernel — it supplies both the symbol table and the BTF. The implementing types (GuestMem, GuestKernel, GuestMemMapAccessor) are documented in the monitor module’s rustdoc (cargo doc --document-private-items).

BPF map access

The monitor also discovers and reads/writes the scheduler’s BPF maps directly through guest physical memory — no guest cooperation, no BPF syscalls. Maps are found by walking the kernel’s map_idr and matched by name suffix (".bss" matches "mitosis.bss"); values are read at BTF-resolved offsets, including per-CPU array maps. When a map carries program BTF, the dump renderer uses it to render the value struct field by field — which is why failure dumps show your BPF globals by name.

Tests use this through BpfMapWrite: a host-side write to a BPF map during VM execution. The test runner waits for the scheduler to load (the map becomes discoverable), writes the value, then signals the guest to start the scenario:

const BPF_CRASH: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);

#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}

The field’s byte offset and width are resolved from the map’s program BTF at write time (which also disambiguates same-suffix maps by picking the one whose BTF names the field). Only array maps and 4-byte scalar fields are supported. For reading map state from test code, see Snapshots.

What a dump shows: cast analysis

When a test fails, the failure dump renders the scheduler’s BPF map state — with BTF names, not raw bytes. From a real failure:

map bpf_bpf.bss (type=array, value_size=448, max_entries=1)
.bss:
  scx_arena_verify_once=true   ktstr_alloc_count=76   nr_dispatched=907
  nr_enqueued=495              nr_select_cpu=372      stats_magic=6004496034161779060
...
    root 0x100000006000 → sdt_desc:
      nr_free=512
      chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{}
  ktstr_bss_arena_holder ktstr_bss_arena_holder:
    bss_plain_counter=76
    arena_target 0x10000000aa80 (cast→arena) [chase: arena chase: STX-flow path tagged slot as Arena with deferred resolve; bridge had no entry for 0x10000000aa80]

BPF schedulers frequently store kernel and arena pointers in u64 fields, because BTF cannot express a pointer to a per-allocation type. Without help, the renderer would print those as meaningless integers. The cast analyzer closes that gap by analyzing the scheduler binary’s BPF bytecode to learn which u64 fields actually hold pointers, so the renderer can chase them. The annotations tell you what happened:

  • (cast→arena) / (cast→kernel) — the pointer was recovered by cast analysis and chased into arena or kernel memory; the rendered fields after it are real dereferenced state, visually distinct from natively BTF-typed pointers.
  • (sdt_alloc) — the chase resolved the pointee’s type through a live arena-allocator slot (the common pattern for scx_task_data()-style per-task state).
  • [chase: …] — the chase stopped, and this is why. A stopped chase falls back to showing the raw value; the analyzer is deliberately conservative, preferring a raw u64 over chasing garbage.

The analysis is unconditional — no opt-in, no test-author configuration — and applies to every snapshot, periodic capture, and failure dump. Failure dumps are also written machine-readable as a JSON artifact next to the test outputs, carrying the same BTF-resolved field names and cast annotations:

{
  "name": "bpf_bpf.bss",
  "map_kva": 18400526959283003096,
  "map_type": 2,
  "value_size": 448,
  "max_entries": 1,
  "value": {
    "kind": "struct",
    "type_name": ".bss",
    "members": [
      {
        "name": "scx_arena_verify_once",
        "value": {
          "kind": "bool",
          "value": true
        }
      },
...
      {
        "name": "nr_dispatched",
        "value": {
          "kind": "uint",
          "bits": 64,
          "value": 849
        }
      },
...

Every field name here was resolved on the host from the guest’s BTF — the guest wrote nothing but its normal map state. See Reading Failure Output for the full anatomy of a failure report.

Workers and Workloads

Workers are the processes that generate load for scenarios. They run inside the VM, each placed in a cgroup, and each one reports detailed telemetry (WorkerReport) when the workload stops. WorkloadHandle is the RAII handle that owns their whole lifecycle: spawn → place → start → stop and collect → drop.

Spawning

let config = WorkloadConfig {
    num_workers: 4,
    work_type: WorkType::Mixed,
    ..Default::default()
};
let mut handle = WorkloadHandle::spawn(&config)?;

Set only the fields that matter for the test and let ..Default::default() fill in the rest — WorkloadConfig’s default is a known-good single-worker SpinWait baseline, and the spread form keeps examples pinned to intent as fields are added. Consult the WorkloadConfig rustdoc for the current field list. (Do not extrapolate this to every ktstr type: CgroupDef deliberately has no Default because a derived empty name would silently produce an invalid cgroup — use CgroupDef::named(...).)

The worker-creation primitive is selected by CloneMode:

  • CloneMode::Fork (default): forks N child processes; each child installs a SIGUSR1 handler, then blocks on a pipe waiting for the start signal. Each worker has its own tgid, so cgroup.procs placement is per-worker.
  • CloneMode::Thread: spawns N threads inside the harness; each blocks on a rendezvous channel until start(). Workers share the harness’s tgid, so cgroup placement must go through cgroup.threads (see Placement below).

For grouped work types (PipeIo, FutexPingPong, FutexFanOut, MutexContention, ThunderingHerd, and the rest of the communicating families), spawn() validates that num_workers is divisible by the variant’s group size and sets up the inter-worker plumbing the variant requires (pipes, shared futex pages). See Work Types for choosing a variant.

pcomm containers are not created by spawn() — it bails when a composed WorkSpec::pcomm is set, pointing at WorkloadHandle::spawn_pcomm_cgroup (or the CgroupDef::pcomm path), which spawns one thread-group-leader process hosting N worker threads so the group’s comm matches the pcomm name.

Two-phase start

Workers wait for a “start” signal after spawn:

  1. Parent spawns the worker (fork or thread), which blocks.
  2. Parent moves the worker to its target cgroup.
  3. Parent calls start(), releasing all workers at once.

This ensures workers run inside their target cgroup from the first instruction of their workload — there is no window where load runs in the wrong cgroup and pollutes the measurement.

Placement

// 1. Spawn workers (blocked, waiting for start signal)
let mut handle = WorkloadHandle::spawn(&config)?;

// 2. Move workers into their target cgroup. `cgroup.procs` is
//    tgid-scoped, so use `worker_pids_for_cgroup_procs()` — it
//    bails for Thread-mode workers (whose pids share the harness's
//    tgid) and points at `cgroup.threads` instead. Plain
//    `worker_pids()` returns the raw pid set without that check.
ctx.cgroups.move_tasks("cg_0", &handle.worker_pids_for_cgroup_procs()?)?;

// 3. Signal workers to start
handle.start();

// 4. Wait for the workload duration
std::thread::sleep(ctx.duration);

// 5. Stop workers and collect telemetry
let reports: Vec<WorkerReport> = handle.stop_and_collect();

Step 2’s Thread-mode bail exists because the kernel resolves any pid written to cgroup.procs to its thread-group leader — writing a Thread-mode worker’s pid there would migrate the entire harness into the test cgroup.

Which placement tool, when:

You want toUse
Pin one worker to CPUshandle.set_affinity(idx, cpus)
Pin a whole cgroup of workersCgroupGroup::add_cgroup (writes cpuset.cpus once, RAII-removes on drop)
A cgroup that outlives the current scopeCgroupManager directly

Start and observing progress

start() signals all workers to begin (a start-pipe byte for fork children, a channel send for threads). Idempotent — the second call is a no-op. Call it after cgroup placement.

snapshot_iterations() reads every worker’s current iteration count from a shared-memory region without stopping anything. Call it periodically during the run window to detect stalls or compute instantaneous rates; final totals come from stop_and_collect().

Stop and collect

stop_and_collect(self) signals workers to stop (SIGUSR1 flips a stop flag in fork children; a per-thread flag for thread workers), then collects each worker’s WorkerReport — read from a report pipe under a shared 5-second deadline for fork children, returned from the thread join for thread workers. It auto-starts workers if start() was never called, and consumes the handle — workers cannot be restarted.

A worker that fails to produce a report (died, timed out, wrote corrupt data) gets a zeroed sentinel report: completed: false, work_units: 0, and exit_info: Some(_) preserving how it ended (Exited(code) / Signaled(sig) / TimedOut / WaitFailed / Panicked). Live-worker reports always carry exit_info: None, so consumers can distinguish “ran to completion and did nothing” from “died before reporting” — and the starvation gate counts dead workers as starved instead of silently passing.

After collection, SIGKILL is delivered to each fork worker’s process group unconditionally to reap stragglers.

Warning

The teardown SIGKILL is a process-group sweep. Every worker calls setpgid(0, 0) after fork, so any child a Custom work function spawns (a helper via execv, a subshell) inherits the worker’s pgid and is SIGKILLed at teardown. A child that must outlive the worker needs setpgid(child_pid, 0) after fork, or an explicit wait before the worker returns its report. Details in Work Types — Custom.

Drop behavior

Dropping a WorkloadHandle without calling stop_and_collect() sends SIGKILL to all child processes (the same process-group sweep) and waits for them, so error paths never leak orphaned workers. Shared mmap regions (futex pages, iteration counters) are unmapped on drop. The type is #[must_use] — an accidentally dropped handle tears its workload down immediately.

Telemetry: WorkerReport

Each worker produces one WorkerReport. The fields you will actually assert on:

FieldMeaningPopulated by
work_unitsCumulative work counter; feeds the starvation gateEvery framework work type
iterationsOuter-loop count; feeds throughput ratesEvery framework work type
cpu_time_ns / wall_time_ns / off_cpu_nsOn-CPU vs total vs off-CPU timeEvery framework work type
migration_count, migrations, cpus_usedCross-CPU movementChecked every 1024 work units
max_gap_ms (+ _cpu, _at_ms)Longest wall-clock gap between checkpoints — the starvation/preemption tellEvery framework work type
wake_latencies_ns + wake_sample_totalPer-wakeup latency samplesBlocking work types only (futex, pipe, I/O, yield, sleep)
iteration_costs_ns + iteration_cost_sample_totalPer-iteration wall-clock costPure-compute variants (AluHot, SmtSiblingSpin, IpcVariance)
timer_latencies_ns + timer_sample_totalTimer-wake jitter vs absolute deadlineTimerLatency only
schedstat_run_delay_ns / schedstat_run_count / schedstat_cpu_time_ns/proc/self/schedstat deltas over the work loopEvery framework work type
numa_pages, vmstat_numa_pages_migratedPer-node residency and migration countersEvery framework work type; feed the NUMA checks
completed, exit_infoNatural end vs sentinel (see above)Framework
affinity_error, sched_policy_errorSetup calls that failed; worker ran anywayFramework

Consult the WorkerReport rustdoc for the full field list and per-field semantics — the table above summarizes, the rustdoc is authoritative.

Semantics worth knowing before asserting:

  • Sampling caps. wake_latencies_ns is reservoir-sampled and capped at 100,000 entries; wake_sample_total keeps counting past the cap. Report “total wakeups” from the total; compute percentiles from the vector. (The cap is pinned by a unit test — max_wake_samples_pins_doc_value — so this paragraph cannot silently rot.)
  • schedstat_run_count is pcount, not context switches. It increments each time the scheduler picks the task to run; a task that keeps running on one CPU does not advance it. For true context-switch counts read /proc/<pid>/status.
  • Checkpoint cadence. Migration and gap checks run when work_units is a multiple of 1024, so a variant contributing N units per outer iteration checks every 1024 / gcd(N, 1024) iterations. Per-variant unit contributions live in the worker source and its rustdoc; the key defaults are pinned by unit tests.
  • Custom populates nothing. The framework fills no telemetry for WorkType::Custom — migration tracking, gap detection, schedstat deltas, and iteration counts exist only if the user’s run function fills them.

What the reports become

Test output rolls WorkerReports up per cgroup. From a real failing run:

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252

iter sums iterations, gap is the worst max_gap_ms, migrations sums migration_count, and cpus counts distinct cpus_used entries. Reading this one: both cgroups made steady progress with sub-25ms worst gaps — the workers were scheduled fine; this failure came from a throughput floor, not starvation. A report showing migrations=0 plus a growing gap on a multi-CPU cpuset would tell the opposite story: the scheduler is not spreading.

How reports become verdicts — thresholds, defaults, and the merge rules — is Checking’s territory.

CgroupManager

sched_ext schedulers see cgroups — weights, cpusets, hierarchy — so ktstr scenarios create, mutate, and destroy cgroups mid-test, and the cleanup has to survive kernel-side hangs a buggy scheduler can cause. CgroupManager is that layer: cgroup v2 filesystem operations under a parent directory, with timeouts and failure caps where the kernel can wedge.

Scenarios reach it through Ctx.cgroups. The typical pattern pairs it with the RAII guard:

fn custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let mut guard = CgroupGroup::new(ctx.cgroups);
    guard.add_cgroup("cg_0", &cpuset)?;

    let mut h = WorkloadHandle::spawn(&config)?;
    ctx.cgroups.move_tasks("cg_0", &h.worker_pids_for_cgroup_procs()?)?;
    h.start(); // workers block until start() is called

    // ... run workload ...

    // `guard` drops at end of scope and removes cg_0 even on error.
    Ok(result)
}

Bypass CgroupGroup only when the cgroup’s lifetime must outlive the current scope; the RAII wrapper removes the cgroup on every error path, not just the happy one.

Construction

use std::collections::BTreeSet;
use ktstr::cgroup::Controller;

let cgroups = CgroupManager::new("/sys/fs/cgroup/ktstr");
let mut controllers = BTreeSet::new();
controllers.insert(Controller::Cpuset);
controllers.insert(Controller::Cpu);
cgroups.setup(&controllers)?; // create parent dir, enable cpuset + cpu

setup() creates the parent directory, checks each requested controller (Cpuset, Cpu, Memory, Pids, Io) against /sys/fs/cgroup/cgroup.controllers, and enables the requested set on every ancestor down to and including the parent. A missing controller fails early with a diagnostic of this shape rather than a later ENOENT:

cgroup controller 'memory' not available at /sys/fs/cgroup/cgroup.controllers;
cgroup.controllers reports {...}. CONFIG_MEMORY_CONTROLLER may be unset, or
the controller is masked at this level of the hierarchy

Walk root. By default the ancestor walk and task-drain destination is /sys/fs/cgroup (a root-owned tree). with_walk_root(root) retargets both for cgroup-v2 user delegation (systemd Delegate=yes, container nsdelegate): the walk stops at the delegated subtree, and the constructor validates that parent sits at or below it.

Routine operations

MethodEffect
create_cgroup(name)Create a child directory; idempotent; supports nested paths
set_cpuset(name, cpus) / clear_cpuset(name)Write cpuset.cpus as a compact range string ("0-3,5"); clear inherits the parent
set_cpuset_mems / clear_cpuset_memsNUMA-node analogue (cpuset.mems)
move_task(name, pid)Write one PID to the child’s cgroup.procs
set_cpu_max / set_cpu_weightcpu controller knobs
set_memory_max / set_memory_high / set_memory_low / set_memory_swap_maxmemory controller knobs
set_io_weight / set_pids_max / set_freezeio / pids / freezer knobs

The CgroupDef builder routes its per-controller setters through these, and the CgroupOps trait abstracts the surface so scenarios consume &dyn CgroupOps (test doubles substitute cleanly). Cgroup names are validated at every entry point: empty names, leading slashes, NUL bytes, and ../. components are rejected.

Warning

For nested paths ("nested/leaf"), only +cpuset is propagated to intermediate cgroups’ subtree_control+cpu, +memory, +pids, and +io are not. A nested leaf exposes cpuset.* knobs, but driving a memory/pids/io knob on it (e.g. CgroupDef::named("nested/leaf").memory_max(N)) fails with ENOENT at apply-setup time. See Troubleshooting for the operator-facing diagnostic.

Operations with a story

move_tasks(name, pids) — moves a batch of PIDs into a child cgroup. Tolerates ESRCH (a task exited between listing and migration) with a warning, but bails when every supplied pid vanished — silence there would mask a dead-worker cascade. Retries transient EBUSY from sched_ext cgroup_prep_move callbacks up to 3 attempts with 100ms backoff, then propagates. And it refuses to write cgroup.procs at all when the destination has cpuset.cpus set but cpuset.mems.effective reads empty — a half-configured cgroup whose kernel behavior is path-dependent; the refusal names the fix (set_cpuset_mems or widen an ancestor).

remove_cgroup(name) — auto-unfreezes frozen tasks (a frozen task cannot be reparented), drains tasks to the walk root, waits for cgroup.events to report populated 0 (inotify-driven, 1s deadline), then removes the directory. Draining targets the walk root because the parent has subtree_control set, and the kernel’s no-internal-process constraint rejects task writes to a cgroup with active controllers. Removing a cgroup that does not exist is Ok.

drain_tasks(name) / cleanup_all() — the pieces of the above: drain one cgroup’s tasks to the walk root; or recursively remove every child under the parent, depth-first, draining at each level.

Failure modes

Write timeout. Every cgroup filesystem write runs under a 2-second timeout in a helper thread. A write the kernel never completes (scheduler bug, wedged freezer) errors with

cgroup write to <path> timed out after 2000ms

instead of hanging the test forever.

Stuck-cgroup cap. Each failed remove increments an outstanding-removes counter (successful removes decrement it). Past 10 outstanding, further remove_cgroup calls fail fast with a message of this shape:

remove_cgroup 'cg_42' refused: 11 cgroups outstanding (cap 10); cgroup.procs
draining wedged or churn loop outpacing the kernel's RCU grace period —
bailing to avoid unbounded cgroupfs accumulation

This bounds the leak from a churn scenario outrunning the kernel’s cleanup instead of accumulating writer threads without limit. outstanding_removes() exposes the count for diagnostics.

See also: CgroupGroup for RAII cleanup, Workers and Workloads for worker lifecycle, Topology for cpuset generation.

CgroupGroup

CgroupGroup is an RAII guard that removes cgroups on drop. It prevents cgroup leaks when workload spawning or any other operation fails between cgroup creation and cleanup.

#[must_use = "dropping a CgroupGroup immediately destroys the cgroups it manages"]
pub struct CgroupGroup<'a> { /* ... */ }

The #[must_use] is deliberate: binding the guard to _ (rather than _guard) drops it immediately and destroys the cgroups before the workload runs.

Methods

new(cgroups: &dyn CgroupOps) — creates an empty group bound to any CgroupOps implementor (CgroupManager in production, an in-memory fake in tests).

add_cgroup(name, cpuset) — creates a cgroup and sets its cpuset. Auto-enables the Cpuset controller on the parent’s cgroup.subtree_control first — the difference that matters vs add_cgroup_no_cpuset, which creates the cgroup without a cpuset and without touching controllers. Both track the cgroup for removal on drop.

names() — the names of all tracked cgroups.

Drop behavior

On drop, the group calls remove_cgroup() on each tracked cgroup in reverse insertion order, so nested children are removed before their parents (a parent still holding child directories fails with ENOTEMPTY).

ENOENT is the one errno the drop swallows silently: it means the directory is already gone, so the post-condition already holds and no cleanup is owed. (It can legitimately appear via a narrow race between the existence check and remove_dir.) Every other error surfaces as a tracing::warn! record carrying the cgroup name and the full error chain — the drop never panics, but teardown failures are visible in logs rather than silently swallowed. The record’s shape:

CgroupGroup::drop: remove_cgroup returned non-ENOENT error
  cgroup=<name> err=<error chain>
  hint=EBUSY: cgroup still has live tasks — workloads were not drained before teardown

EBUSY at drop means exactly what the hint says: something is still running in the cgroup — typically a WorkloadHandle that outlives the guard, so its workers were never stopped before teardown. Drop (or stop_and_collect) the handle before the guard goes out of scope. EACCES gets its own hint pointing at cgroup ownership and delegation.

Usage

CgroupGroup is the standard cgroup-lifecycle pattern for custom scenarios — CgroupManager shows the full worked example. The shape in brief:

let mut guard = CgroupGroup::new(ctx.cgroups);
guard.add_cgroup("cg_0", &cpuset_a)?;
guard.add_cgroup("cg_1", &cpuset_b)?;
// If anything below fails, `guard` drops and removes both cgroups.

The helper setup_cgroups(ctx, n, &wl) bundles the pattern: it creates n cgroups, spawns workers in each, and returns the handles alongside the guard.

See also: CgroupManager for filesystem operations, Workers and Workloads for worker lifecycle.

CI

ktstr boots KVM microVMs and builds Linux kernels, so a CI job has two unusual needs: runners that expose /dev/kvm, and aggressive caching — the first run on a fresh runner downloads and compiles a full Linux kernel, by far the slowest step in any workflow. Once the kernel cache is warm, the kernel resolves in under a second and wall-clock is dominated by the tests themselves:

cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.23s
────────────
 Nextest run ID 24c18577-cd34-43bd-9d14-b0197701c187 with nextest profile: default
    Starting 1 test across 121 binaries (12531 tests skipped)
        PASS [  34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
     Summary [  34.490s] 1 test run: 1 passed, 12531 skipped

Everything below is a variation on: get KVM, cache the kernel, run the tests, keep the stats. This repo’s own CI is the living reference: .github/workflows/ci.yml.

Runner requirements

GitHub-hosted ubuntu-latest runners do not expose /dev/kvm. Use self-hosted runners with project-specific labels (this repo uses ktstr-x64 and ktstr-arm64; substitute your own pool’s labels):

runs-on: [ktstr-x64]    # x86_64 self-hosted KVM runner
runs-on: [ktstr-arm64]  # aarch64 self-hosted KVM runner

See Troubleshooting: /dev/kvm not accessible for diagnosing KVM on runners, including cloud-VM nested virtualization setup (GCP, AWS, Azure). Runners also need the build dependencies from Getting Started and at least 5 GB of free disk for kernel sources, build artifacts, and cached images. Gauntlet topology presets go up to 252 vCPUs; tests whose preset exceeds the runner’s capacity skip cleanly (or fail under --no-skip-mode), so small runners run a subset rather than breaking.

A minimal workflow

Builds a kernel, caches it, runs the tests:

name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  test:
    runs-on: [ktstr-x64]
    env:
      KTSTR_GHA_CACHE: "1"
    steps:
      - uses: actions/checkout@v5
      - uses: dtolnay/rust-toolchain@stable
      - uses: taiki-e/install-action@v2
        with:
          tool: cargo-nextest
      - name: Install ktstr
        run: cargo install --path . --locked --features remote-cache
      - name: Cache kernel images
        uses: actions/cache@v4
        with:
          path: ~/.cache/ktstr/kernels
          key: ktstr-kernels-x64-${{ hashFiles('ktstr.kconfig') }}
          restore-keys: ktstr-kernels-x64-
      - name: Build test kernel
        run: cargo ktstr kernel build
      - run: cargo ktstr test -- --profile ci --features integration

The load-bearing lines: KTSTR_GHA_CACHE: "1" enables a remote kernel-cache layer on top of the local one (Caching); the actions/cache key hashes ktstr.kconfig, so a kconfig change invalidates cached kernels; --profile ci selects the nextest profile tuned for contended runners (Nextest CI profile); --features integration enables ktstr’s full end-to-end suite when testing ktstr itself — in a scheduler repo, pass your own crate’s feature flags or drop it. The test harness auto-discovers the built kernel; to pin versions, use the matrix below.

Kernel pinning

Pin kernel versions via the matrix strategy (this repo’s CI tests 6.14 and 7.1 this way):

strategy:
  fail-fast: false
  matrix:
    kernel-version: ['6.14', '7.1']
# then, in steps:
  - run: cargo ktstr kernel build --kernel ${{ matrix.kernel-version }}
  - run: cargo ktstr test --kernel ${{ matrix.kernel-version }} -- --profile ci --features integration

--kernel tells cargo ktstr test which cached kernel to use at runtime. A major.minor prefix (e.g. 6.14) resolves to the highest patch release in that series; see cargo ktstr kernel for the full resolution chain.

The cache-key footgun: when testing multiple kernel versions, add ${{ matrix.kernel-version }} to the cache key and restore-keys — the minimal workflow’s version-less key would make matrix cells evict each other’s kernels.

Caching

actions/cache persists ~/.cache/ktstr/kernels across runs. KTSTR_GHA_CACHE=1 adds a remote layer that shares kernels across jobs and workflow runs; remote failures are non-fatal and the local cache is authoritative. The remote layer is compiled in only with --features remote-cache (off by default) — without it the variable is a no-op, which is why the install steps above pass the feature.

If you set a global RUSTC_WRAPPER: sccache for compile caching (as this repo’s CI does), sccache must be on $PATH on every targeted runner — x64 and arm64 alike — or the first cargo invocation fails.

Dynamic matrix: cargo ktstr affected

On a fleet repo with many schedulers, cargo ktstr affected emits the scheduler packages a base..HEAD diff touches, as a flat JSON array for a GitHub Actions dynamic matrix — one job per affected scheduler instead of building and testing everything on every push:

cargo ktstr affected                    # vs merge-base(HEAD, main)
# -> e.g. ["scx_lavd","scx_rusty"]

Attribution is the union of the cargo dependency closure (shared Rust library changes) and per-scheduler dep-info parsing of the compiled BPF sources (shared .bpf.c / header includes). The design is fail-safe: a false negative — silently skipping an affected scheduler — is the worst outcome, so every uncertainty (unresolvable base, diff failure, build-graph or Cargo.lock change, unattributable non-docs path) widens to the full testable set, never to a skip. Only a strictly docs-only change (or base == HEAD) emits [].

Only Discover (cargo-package) schedulers appear in the array — package-less schedulers (EEVDF, kernel-builtin) have no package to key a matrix cell on and need a separate unconditional CI leg. On a pull_request event the baseline defaults to merge-base(HEAD, origin/$GITHUB_BASE_REF); check out with full history so the merge-base exists.

jobs:
  matrix:
    runs-on: [ktstr-x64]
    outputs:
      schedulers: ${{ steps.affected.outputs.schedulers }}
    steps:
      - uses: actions/checkout@v5
        with:
          fetch-depth: 0          # merge-base needs history
      - name: Install ktstr
        run: cargo install ktstr --locked
      - id: affected
        run: echo "schedulers=$(cargo ktstr affected)" >> "$GITHUB_OUTPUT"

  test:
    needs: matrix
    if: needs.matrix.outputs.schedulers != '[]'
    runs-on: [ktstr-x64]
    strategy:
      fail-fast: false
      matrix:
        scheduler: ${{ fromJSON(needs.matrix.outputs.schedulers) }}
    steps:
      - uses: actions/checkout@v5
      # ... install ktstr + nextest, restore the kernel cache, and
      #     `cargo ktstr kernel build` as in the minimal workflow ...
      # Adjust the filter to how your repo organizes per-scheduler tests.
      - run: cargo ktstr test -- --profile ci -E 'package(${{ matrix.scheduler }})'

The local counterpart is cargo ktstr test --relevant, which runs the same attribution against your working tree — see cargo-ktstr.

Perf gate on pull requests

cargo ktstr perf-delta --noise-adjust runs the performance_mode tests at HEAD and at the PR’s merge-base, then exits non-zero when metrics regress with statistical confidence — a performance gate in one step. On pull_request events the baseline resolves from $GITHUB_BASE_REF automatically:

perf-gate:
  if: github.event_name == 'pull_request'
  runs-on: [ktstr-x64]
  steps:
    - uses: actions/checkout@v5
      with:
        fetch-depth: 0          # merge-base needs history
    # ... install ktstr + nextest as in the minimal workflow ...
    - run: cargo ktstr perf-delta --noise-adjust 5 --kernel 7.0

Budget for it: --noise-adjust 5 runs every performance_mode test ten times (five per side). Narrow with -E or --relevant, and add --must-fail <metric> for metrics that must never regress. See Runs and Regression Gates for how the verdict is computed and A/B Compare Branches for the local equivalent.

Budget-based test selection

Set KTSTR_BUDGET_SECS (e.g. "300") on the test step to bound a smoke-test job: the selector greedily picks the tests that maximize feature coverage within the time budget. See Running Tests for the selection model.

Coverage

Same job shape as the minimal workflow, with the llvm-tools-preview rustup component and cargo-llvm-cov added, and the test step swapped for:

- run: cargo ktstr coverage -- --profile ci --lcov --output-path lcov.info --features integration --exclude-from-report scx-ktstr

--exclude-from-report <crate> keeps scheduler crates out of the coverage report — the example excludes scx-ktstr, ktstr’s own fixture scheduler.

Test statistics

- name: Test statistics
  if: ${{ !cancelled() }}
  run: cargo ktstr stats

stats reads the sidecar JSON files under target/ktstr/ and prints gauntlet analysis, BPF verifier stats, callback profile, and KVM stats (Runs and Regression Gates). if: !cancelled() collects stats even when the test step failed — which is exactly when you want them.

aarch64

aarch64 runners use the same workflows with two substitutions: runner labels ([ktstr-arm64] or your pool’s) and the cache-key prefix (arm64 instead of x64). The guest image name differs (Image instead of bzImage) but ktstr handles that internally.

Performance mode

CI runners often lack CAP_SYS_NICE, rtprio limits, or enough host CPUs for exclusive LLC reservation. Set KTSTR_NO_PERF_MODE: "1" on the test step to disable performance mode; tests with performance_mode=true are then skipped entirely. See Performance Mode, and Tests pass locally but fail in CI for the wider skip/fail triage.

Nextest CI profile

The workspace ships a ci profile in .config/nextest.toml. VM boots on a contended runner run slower and flake differently than on a dev box, so the CI profile trades latency for stability — longer slow-timeouts, one more retry, deferred failure output, and no fail-fast:

[profile.ci]
slow-timeout = { period = "90s", terminate-after = 3 }
retries = { backoff = "exponential", count = 6, delay = "1s", jitter = true, max-delay = "3s" }
failure-output = "final"
fail-fast = false

# Heavier test classes get their own budgets, e.g.:
[[profile.ci.overrides]]
filter = "test(verifier_)"
slow-timeout = { period = "180s", terminate-after = 3 }

Use it with --profile ci. If a test in your repo drives an unusually slow boot (huge topology, nested VM), give it its own override rather than raising the profile-wide timeout.

The CI-relevant environment variables are KTSTR_GHA_CACHE, KTSTR_BUDGET_SECS, KTSTR_NO_PERF_MODE, KTSTR_KERNEL, KTSTR_CI (tags sidecars as CI-produced), and KTSTR_CACHE_DIR — see the full reference.

Troubleshooting

Find your error message, jump to its section:

You seeGo to
clang: No such file or directoryBuild errors
pkg-config: command not foundBuild errors
autoreconf: command not foundBuild errors
busybox build requires 'make'Build errors
no BTF source foundBTF errors
failed to obtain busybox sourcebusybox download failure
/dev/kvm not found / permission denied/dev/kvm not accessible
no kernel foundNo kernel found
scheduler 'NAME' not foundScheduler not found
scheduler process died unexpectedlyScheduler died
scheduler did not turn on + verifier logScheduler fails the BPF verifier
libbpf: … func_proto … incompatible with vmlinuxScheduler cannot load: kfunc BTF mismatch
send_sys_rdy failed within boot budgetsend_sys_rdy timeout
no 2MB hugepages availableInsufficient hugepages
tid N stuck … / unfair cgroup: spread=…Worker assertion failures
cgroup-state-snapshot: …Cgroup name typos
requires +cpu in parent cgroup.subtree_controlCgroup controller not enabled
CpusetSpec validation failedCpusetSpec errors
requires num_workers divisible byWorker count mismatches
(corrupt: metadata.json malformed…)Cache corruption
HOME is unset; cannot resolve cache directoryCache directory not found
entries marked (stale kconfig)Stale kconfig
fetch https://www.kernel.org/releases.json: …Kernel auto-download failures
version X not found / RC tarball not foundKernel download failures
stdin must be a terminal / -i NAME: not foundShell mode issues
flock LOCK_EX … timed out / filesystem NFS is not supportedFlock timeout / NFS rejection
test marked SLOW, then killed by nextestTest hangs / nextest timeout
green locally, red in CITests pass locally but fail in CI

Build errors

clang not found

error: failed to run custom build command for `ktstr`
  ...
  clang: No such file or directory

The BPF skeleton build (libbpf-cargo) invokes clang to compile .bpf.c sources. Install clang:

  • Debian/Ubuntu: sudo apt install clang
  • Fedora: sudo dnf install clang

pkg-config not found

error: failed to run custom build command for `libbpf-sys`
  ...
  pkg-config: command not found

libbpf-sys uses pkg-config during its vendored build. Install it:

  • Debian/Ubuntu: sudo apt install pkg-config
  • Fedora: sudo dnf install pkgconf

autotools errors (autoconf, autopoint, aclocal)

autoreconf: command not found
aclocal: command not found
autopoint: command not found

The vendored libbpf-sys build compiles bundled libelf and zlib from source using autotools. These libraries are not system dependencies – they ship with libbpf-sys – but the autotools toolchain is needed to build them. Install:

  • Debian/Ubuntu: sudo apt install autoconf autopoint flex bison gawk
  • Fedora: sudo dnf install autoconf gettext-devel flex bison gawk

make or gcc not found

busybox build requires 'make' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
busybox build requires 'gcc' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)

The build script compiles busybox from source for guest shell mode.

  • Debian/Ubuntu: sudo apt install make gcc
  • Fedora: sudo dnf install make gcc

BTF errors

no BTF source found. Set KTSTR_KERNEL to a kernel build directory,
or ensure /sys/kernel/btf/vmlinux exists.

build.rs generates vmlinux.h from kernel BTF data. It searches the kernel discovery chain (KTSTR_KERNEL, ./linux, ../linux, installed kernel) for a vmlinux file, falling back to /sys/kernel/btf/vmlinux. Most distros ship /sys/kernel/btf/vmlinux with CONFIG_DEBUG_INFO_BTF enabled.

Fixes:

  • Verify BTF is available: ls /sys/kernel/btf/vmlinux
  • If missing, set KTSTR_KERNEL to a kernel build directory that contains a vmlinux with BTF: export KTSTR_KERNEL=/path/to/linux
  • Build a kernel with CONFIG_DEBUG_INFO_BTF=y.
  • Some minimal/cloud kernels strip BTF. Use a distro kernel or build your own.

busybox download failure

failed to obtain busybox source after 4 attempts.
  tarball (https://github.com/mirror/busybox/archive/refs/tags/1_36_1.tar.gz): ...
  Remediation:
    • Check network connectivity (the build script needs HTTPS access to github.com to fetch the upstream tarball).
    • If behind a proxy, ensure HTTP_PROXY/HTTPS_PROXY environment variables are set.
    • Or set KTSTR_BUSYBOX_TARBALL=<path> to point at a pre-fetched local copy.
    • Or set KTSTR_SKIP_BUSYBOX_BUILD=1 to skip the busybox compile entirely (shell mode will be unavailable).

build.rs downloads the busybox tarball on first build (4 attempts with backoff); subsequent builds use the cached binary. Follow the remediation lines in the error itself — after one successful build, no network access is needed unless cargo clean removes the cached binary.

/dev/kvm not accessible

The host-side pre-flight emits one of the following, depending on whether the device node is missing or merely unreadable:

/dev/kvm not found. KVM requires:
  - Linux kernel with KVM support (CONFIG_KVM)
  - Access to /dev/kvm (check permissions or add user to 'kvm' group)
  - Hardware virtualization enabled in BIOS (VT-x/AMD-V)
/dev/kvm: permission denied. Add your user to the 'kvm' group:
  sudo usermod -aG kvm $USER
  then log out and back in.

ktstr boots Linux kernels in KVM virtual machines. The host must have KVM enabled and the user must have read+write access to /dev/kvm.

Diagnose:

  • ls -l /dev/kvm — typical output: crw-rw---- 1 root kvm 10, 232 ....
  • getent group kvm — confirm the group exists and see its members.

Fixes:

  • Load the KVM module: modprobe kvm_intel or modprobe kvm_amd.
  • Follow the group-membership hint in the error text (log out and back in afterward).
  • On cloud VMs (GCP, AWS, Azure) or nested hypervisors, nested virtualization is typically off by default. Enable it per the provider’s instructions (e.g. GCP --enable-nested-virtualization, AWS .metal instance types, Azure Dv3/Ev3+ with nested virt).
  • In CI, ensure the runner has KVM access — see CI.

No kernel found

no kernel found — the test harness was likely invoked outside `cargo ktstr test` (which builds and injects a kernel automatically).
  hint: run `cargo ktstr test --kernel <path-or-version>` to drive this test, or set KTSTR_TEST_KERNEL=/path/to/{bzImage|Image} to point at a pre-built bootable image directly.
  hint: set KTSTR_KERNEL to one of: exact version (`6.14`), inclusive range (`6.14..7.0` or `6.14..=7.0`), git source (`git+URL#tag=NAME`, `git+URL#branch=NAME`, or `git+URL#sha=<40-hex>`), absolute or `~`-prefixed path, or cache key. List cached keys with `cargo ktstr kernel list`; build new ones with `cargo ktstr kernel build`

On aarch64 the first hint’s image filename is Image instead of bzImage. ktstr needs a bootable kernel image; see cargo ktstr kernel for the discovery chain. ktstr shell and cargo ktstr shell auto-download the latest stable kernel when nothing is found — see Kernel auto-download failures for download-specific errors.

Fixes:

  • Download and cache a kernel: cargo ktstr kernel build
  • Build from a local tree: cargo ktstr kernel build --kernel ../linux
  • Set KTSTR_TEST_KERNEL to an explicit image path.
  • The host’s installed kernel works for basic testing.

Scheduler not found

scheduler 'scx_mitosis' not found. Set KTSTR_SCHEDULER or
place it next to the test binary or in target/{debug,release}/

SchedulerSpec::Discover resolves the scheduler binary entirely on the host. The order depends on how the test was launched:

Under cargo ktstr test (the normal path):

  1. KTSTR_SCHEDULER_BIN_<NAME>, then KTSTR_SCHEDULER env overrides.
  2. cargo build -p <scheduler> — the build runs up front, so an edited scheduler is never validated against a stale pre-built binary. If that build fails, the test hard-fails rather than falling back; set KTSTR_SCHEDULER_ALLOW_STALE_FALLBACK=1 to re-enable the sibling / target/{debug,release}/ pre-built fallback while the workspace build is broken.

Under bare cargo test / cargo nextest run (marked with KTSTR_CARGO_TEST_MODE=1):

  1. The env overrides, with $PATH also consulted — so an installed scheduler binary resolves without an in-tree build.
  2. Sibling of the test binary, then the target/release/ and target/debug/ build dirs — the scheduler’s build profile (release by default) is probed first.
  3. The on-demand cargo build -p <scheduler> runs last, only after the pre-built probes miss.

Fixes:

  • cargo build -p scx_mitosis — on the orchestrated path this only primes the cache; on the bare path it makes the probe hit.
  • Set KTSTR_SCHEDULER=/path/to/binary (or the per-name KTSTR_SCHEDULER_BIN_<NAME> variant).
  • Use SchedulerSpec::Path for an explicit path.

Scheduler died

scheduler process died unexpectedly after completing step 2 of 5 (12.3s into test)

The scheduler process died while the scenario was running — usually a crash. The exact message varies by when the crash was detected. The failure output contains diagnostic sections (each present only when relevant): --- scheduler log --- (the scheduler’s own output, cycle-collapsed), --- diagnostics --- (init stage, VM exit code, kernel console tail), and --- sched_ext dump --- (when a SysRq-D dump fired). Set RUST_BACKTRACE=1 to force --- diagnostics --- on all failures.

Next steps:

  • Read the --- scheduler log --- for the crash reason; see Reading Failure Output for the full section-by-section anatomy.
  • A second VM automatically reproduces the crash with BPF probes attached — see Auto-Repro.
  • Follow Investigate a Crash for the crash-to-pin workflow.

Scheduler fails the BPF verifier

verifier
  scheduler: NOT ATTACHED — scheduler process exited during BPF load/startup

verifier --- verifier stats ---
  processed=186  states=7/7

verifier --- scheduler log ---
Global function ktstr_dispatch() doesn't return scalar. Only those are supported.
0: R1=ctx() R10=fp0
; if (crash) @ main.bpf.c:423
0: (18) r1 = 0xff5d3bb3000f60dc       ; R1=map_value(map=bpf_bpf.bss,ks=4,vs=280,off=220)
...
; *p = (int)acc; @ main.bpf.c:464
191: (61) r2 = *(u32 *)(r10 -8)       ; R2=scalar(id=53,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=mmmmscalar(id=53,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
192: (63) *(u32 *)(r1 +0) = r2
R1 invalid mem access 'scalar'
processed 186 insns (limit 1000000) max_states_per_insn 0 total_states 7 peak_states 7 mark_read 0

The in-guest BPF verifier rejected the program, so the scheduler never attached. Read the log bottom-up: the last few lines name the rejected instruction (R1 invalid mem access 'scalar') and the source line the C-line comments (@ main.bpf.c:464) map it to. The first line is the verifier’s summary of the top-level complaint.

Verifier acceptance depends on kernel version and topology — values like nr_cpus bake into .rodata, so a program that verifies on one CPU count can blow up on another. Sweep your scheduler across kernels and topologies with cargo ktstr verifier, which also collapses repeated loop iterations (--- N identical iterations omitted ---) so real rejections stay readable.

Scheduler cannot load: kfunc BTF mismatch

--- scheduler log ---
libbpf: extern (func ksym) 'scx_bpf_create_dsq': func_proto [755] incompatible with vmlinux [54769]
libbpf: failed to load BPF skeleton 'bpf_bpf': -EINVAL
Error: Failed to load BPF program

ktstr surfaces this as scheduler did not turn on — scheduler process exited during BPF load/startup in verifier cells, or as a scheduler death / no test result received from guest in test runs — with the libbpf lines above in the scheduler log.

The cause is the kernel image, not your scheduler. Newer kernels (first released in v7.1) give scx kfuncs an implicit trailing struct bpf_prog_aux *aux argument; kernel build tooling (resolve_btfids, driven by pahole’s decl_tag_kfuncs BTF feature) is supposed to publish a BPF-facing twin of each kfunc with the trimmed prototype so schedulers built against released scx headers and libbpf still match. When the toolchain drops that tag for a kfunc — observed with some pahole builds, and varying by config — the plain-name prototype keeps the extra argument and no released scheduler can load on that kernel.

Check any kernel in one command:

bpftool btf dump file <vmlinux> format raw | grep -E "FUNC 'scx_bpf_(create_dsq|error_bstr)"
# loadable: the plain name points at a trimmed proto (no 'aux' param)
# broken:   a single 4-arg entry — released libbpf/scx headers cannot match it

Warning

expect_err = true tests invert this load failure into a pass, and post_vm assertions skip when the scheduler never attached — so a suite can look green with zero schedulers ever loading. If a kernel’s expect_err tests all “pass” while everything else reports the scheduler never turned on, check the kernel’s BTF before trusting the run.

Fixes:

  • Test against a kernel whose BTF passes the check above (kernels before the implicit-args change, e.g. --kernel 7.0 or --kernel 6.14, are unaffected).
  • Rebuild the kernel with a pahole/toolchain combination that preserves the kfunc tags, and re-run the check.

send_sys_rdy timeout

WARN ktstr::vmm::rust_init: ktstr-init: send_sys_rdy failed within boot budget; see https://ktstr.dev/guide/troubleshooting.html#send_sys_rdy-timeout budget_ms=11200 vcpus=8 elapsed_ms=11342 port_exists=false kern_addrs_sent=false

The guest init could not send its “ready” signal to the host within the boot budget (10 s plus 150 ms per vCPU, capped at 90 s). The WARN itself is non-fatal — the guest continues and the host starts sampling anyway — but the test usually then fails through the normal VM-teardown path (see Scheduler died); the authoritative deadline is the host watchdog, which scales with host overcommit.

The diagnostic fields split the cause in two:

  • port_exists=false — the virtio-console port device never appeared in the guest. Almost always a slow or starved boot (or an early guest panic — check the --- diagnostics --- console tail).
  • port_exists=true — the port exists but writes did not complete. This is a host-side virtio-console issue, not guest CPU contention; file a bug with the failure dump.

Fixes (for the port_exists=false case):

  • Pass --no-perf-mode (or KTSTR_NO_PERF_MODE=1) to reduce host-side contention starving the guest’s vCPU threads.
  • Reduce the test’s topology — fewer vCPUs boot faster.
  • KASAN / KCSAN / lockdep kernels add substantial boot overhead; re-run on a non-instrumented kernel to separate instrumentation cost from a real stall.

Insufficient hugepages

performance_mode: WARNING: no 2MB hugepages available, guest memory will use regular pages
performance_mode: WARNING: need N 2MB hugepages, only K free — falling back to regular pages

Performance mode requests 2MB hugepages for guest memory. The first form fires when none are reserved on the host; the second when fewer than the run needs. In both cases the VM falls back to regular pages and continues to boot.

Fix:

echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Worker assertion failures

tid 2 stuck 4500ms on cpu2 at +3200ms (threshold 3000ms)
unfair cgroup: spread=42% (8-50%) 4 workers on 4 cpus (threshold 35%)

The Assert checks (max_gap_ms, max_spread_pct, etc.) detected a worker metric outside the configured thresholds. The tid N prefix names the thread so you can cross-reference the --- timeline --- and --- stats --- sections, which key per-thread metrics by tid; unfair cgroup is per-cgroup and cross-references the per-cgroup spread / workers / cpus columns in --- stats --- instead.

Fixes:

  • Check whether the topology has enough CPUs for the scenario — small topologies produce higher contention, larger gaps, and more spread.
  • Override thresholds for scenarios that need relaxed limits — see Customize Checking.
  • Check the scheduler’s behavior under the specific flag profile that triggered the failure.

Cgroup name typos

A typo’d cgroup name surfaces only when an op tries to write to a non-existent cgroup directory; names are not pre-validated. The diagnostic depends on which op references the typo:

  • Op::RemoveCgroup / Op::StopCgroup against a typo silently succeed (rmdir / kill against a non-existent path are no-ops); the failure surfaces on the next op that touches the name.

  • Op::SetCpuset falls through to the kernel’s ENOENT, wrapped with a one-line cgroup-state snapshot:

    cgroup-state-snapshot: parent=/sys/fs/cgroup/ktstr name=nonexistent parent.cgroup.controllers="cpuset cpu memory io pids" parent.cgroup.subtree_control="cpuset cpu memory" child.cgroup.controllers="<read failed: No such file or directory (os error 2)>" child.cpuset.cpus.exists=false child.listing=<read_dir failed: No such file or directory (os error 2)>: No such file or directory (os error 2)
    

    The child.listing=<read_dir failed: ...> segment is the tell: a typo’d name has no directory to list, distinguishing this from “cgroup exists but the write was rejected” (where the listing would enumerate the cgroupfs knobs).

  • Other setters (cpu.max, memory.max, cpuset.mems, …) against a typo produce the same wrapped form as Cgroup controller not enabled — distinguish by checking whether the directory exists.

  • Op::AddCgroup colliding with an already-tracked name bails:

    Op::AddCgroup 'cg_0' collides with a cgroup already tracked (by a prior Backdrop or step-local CgroupDef) — declare it in exactly one place; use a fresh name for the step-local cgroup
    

Fixes: verify the name matches its Op::AddCgroup / CgroupDef::named() / Backdrop.cgroups declaration, and that dynamically formatted names (format!("cg_{i}")) use the same formatting everywhere.

Cgroup controller not enabled

cgroup 'cg_0': set cpu.max='100000 100000' (requires +cpu in parent cgroup.subtree_control): No such file or directory (os error 2)
cgroup 'cg_0': set memory.max='4294967296' (requires +memory in parent cgroup.subtree_control): No such file or directory (os error 2)
cgroup 'cg_0': set memory.swap.max='1073741824' (requires +memory in parent cgroup.subtree_control; file absent on CONFIG_SWAP=n kernels): No such file or directory (os error 2)
cgroup 'cg_0': set cpuset.mems='0-1' (requires +cpuset in parent cgroup.subtree_control): No such file or directory (os error 2)

The cgroup exists but the controller knob is missing from its directory. ktstr’s setup auto-enables the controllers it detects on the scenario’s CgroupDef / Op set, so a missing controller means either: the framework’s detection did not see a declared knob (file a bug); an outer parent (systemd user.slice, container runtime) stripped controllers from the subtree before ktstr ran; or the kernel was built without CONFIG_SWAP (the memory.swap.max wrap spells this out).

Diagnostic command:

cat /sys/fs/cgroup/<parent>/cgroup.subtree_control

A controller named in the wrapped error must appear in this list; if it does not, fix the parent first (echo '+memory' > .../cgroup.subtree_control from a sufficiently-privileged shell) or remove the knob from the scenario.

CpusetSpec errors

cgroup 'cg_0': CpusetSpec validation failed: not enough usable CPUs (4) for 8 partitions
cgroup 'cg_1': CpusetSpec validation failed: index 3 >= partition count 3
cgroup 'cg_2': CpusetSpec validation failed: Range fracs must lie in [0.0, 1.0]: start_frac=-1, end_frac=0.5

A CpusetSpec cannot produce a valid cpuset for the test topology; the step aborts as a hard error before any downstream slicing runs.

Fixes:

  • Guard with a topology check before creating the step: if ctx.topo.usable_cpus().len() < needed { return Ok(AssertResult::skip(...)); }
  • Call CpusetSpec::validate(&ctx) in your scenario builder so failures surface before execute_steps runs.
  • Reduce the partition count, or use CpusetSpec::Llc instead of Disjoint on topologies with fewer CPUs than partitions.
  • For Range/Overlap, keep fractions finite and inside [0.0, 1.0]; Range additionally requires start_frac < end_frac.

Worker count mismatches

PipeIo (group 0) requires num_workers divisible by 2, got 3

Grouped work types (PipeIo, FutexPingPong, CachePipe, FutexFanOut, FanOutCompute, and the contention / waker families — see Workers and Workloads) require num_workers divisible by their group size. The (group N) segment names the composed entry the violation belongs to, so multi-group scenarios point at the entry to fix.

Fixes:

  • Set CgroupDef::workers(n) to a multiple of the work type’s group size (2 for pipe/futex pairs, fan_out + 1 for FutexFanOut and FanOutCompute).
  • Use an ungrouped work type (SpinWait, Mixed, Bursty, IoSyncWrite, IoRandRead, IoConvoy, YieldHeavy) if worker count flexibility is needed.

Cache corruption

  6.14.2-tarball-x86_64-kc...                 (corrupt: metadata.json malformed: ...)
warning: entries marked (corrupt) cannot be used — cached metadata is missing, malformed, or references a missing image. Inspect the entry directory under ~/.cache/ktstr/kernels to remove it manually, or run `kernel clean --corrupt-only --force` which removes ONLY corrupt entries and leaves valid ones intact. ...

A cached kernel entry has missing, unparseable, or schema-drifted metadata.json, or references an image that is no longer present — typically after a partial write (disk full, killed process) or a ktstr upgrade that changed the metadata schema. Corrupt entries are never used; runs fall through to a rebuild. The JSON listing (kernel list --json) carries a stable error_kind token per corrupt entry for CI scripts — see cargo ktstr kernel.

Fixes:

  • Remove only corrupt entries: cargo ktstr kernel clean --corrupt-only --force
  • Rebuild a specific version after cleanup: cargo ktstr kernel build --force --kernel 6.14.2
  • Move the cache with KTSTR_CACHE_DIR if the default location is on a problematic filesystem.

Stale vmlinux.btf or default.profraw in kernel source tree

Older ktstr versions could leave two files in a kernel source directory: <source>/vmlinux.btf (a BTF sidecar, now written only inside the cache root) and <source>/default.profraw (an LLVM coverage artifact, now redirected next to the cargo-ktstr binary). Both are leftover state and safe to remove:

rm -f /path/to/linux/vmlinux.btf /path/to/linux/default.profraw

If they keep reappearing, you are running an old ktstr binary — rebuild or reinstall, then delete again. See profraw layout for where coverage artifacts land now.

Cache directory not found

HOME is unset; cannot resolve cache directory. The container init or login shell did not assign HOME — set it to an absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
HOME is set to the empty string; cannot resolve cache directory. An empty HOME usually means a Dockerfile or shell rc has `export HOME=` or `ENV HOME=` with no value. Either set HOME to a real absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.

The kernel image cache requires a writable directory, resolved as KTSTR_CACHE_DIR > $XDG_CACHE_HOME/ktstr/ > $HOME/.cache/ktstr/. The first form fires when HOME is absent (bare container inits, systemd units without Environment=HOME=…); the second when HOME is set to the empty string.

Fix: Set KTSTR_CACHE_DIR to an explicit path, or ensure HOME is a real absolute path.

Stale kconfig

warning: entries marked (stale kconfig) were built against a different ktstr.kconfig. Rebuild with: kernel build --force --kernel <entry version> (add --extra-kconfig PATH if the entry also carries the (extra kconfig) tag).

cargo ktstr kernel list marks entries whose stored kconfig hash differs from the current embedded ktstr.kconfig fragment — typical after updating ktstr. Stale entries rebuild automatically on the next cargo ktstr kernel build; --force overrides the cache for other reasons.

Kernel auto-download failures

ktstr: no kernel found, downloading latest stable
fetch https://www.kernel.org/releases.json: <error>

ktstr auto-downloads a kernel when no --kernel is specified and the discovery chain finds nothing; the same path runs when --kernel names a version not in the cache. The <error> is the underlying network error (DNS, connection refused, timeout, TLS). Variants:

fetch https://www.kernel.org/releases.json: HTTP 503

kernel.org returned a non-success status.

no stable kernel with patch >= 8 found in releases.json

ktstr requires a stable or longterm release with patch version >= 8 to avoid brand-new majors with build issues; releases.json contained no qualifying version.

extract tarball: <error>

Disk full, bad permissions on the temp directory, or a truncated download.

Fixes:

  • Verify connectivity: curl -sI https://www.kernel.org/releases.json
  • If behind a proxy, set HTTP_PROXY / HTTPS_PROXY / NO_PROXY.
  • Check disk space; override the cache location with KTSTR_CACHE_DIR if needed.
  • Pre-download explicitly — cargo ktstr kernel build --kernel 6.14.10 isolates version resolution from download failures.

Kernel download failures

These fire when an explicit version is requested:

version 6.14.22 not found. latest 6.14.x: 6.14.10

The requested version does not exist; when a sibling in the same series is available, the error suggests it. An EOL series gets only the bare “not found”.

RC tarball not found: https://git.kernel.org/torvalds/t/linux-6.15-rc3.tar.gz
  RC releases are removed from git.kernel.org after the stable version ships.

Use --kernel git+URL#tag=NAME with a git.kernel.org URL to clone the tag instead.

download ...: server returned HTML instead of tarball (URL may be invalid)

Some CDN error pages return HTTP 200 with HTML; the download rejects these responses. Check the URL / version against https://www.kernel.org/releases.json.

Shell mode issues

stdin must be a terminal

stdin must be a terminal for interactive shell mode

cargo ktstr shell requires a terminal for bidirectional I/O forwarding; piped or redirected stdin is rejected.

include file not found

-i strace: not found in filesystem or PATH

Bare names (without /, ., or ..) are searched in PATH; if the binary is not there, use an explicit path.

--include-files path not found: ./missing-file

Explicit paths must exist on disk.

include directory contains no files

warning: -i ./empty-dir: directory contains no regular files

The directory was walked recursively but contained no regular files (FIFOs, device nodes, and sockets are skipped).

Flock timeout / NFS rejection

flock LOCK_EX on run-dir target/ktstr/6.14-abc1234 timed out after
30s (lockfile target/ktstr/.locks/6.14-abc1234.lock, holders:
  pid=12345 cmd=cargo-ktstr test --kernel 6.14). A peer cargo
ktstr test process is writing sidecars to the same
{kernel}-{project_commit} directory; wait for it to finish or kill
it, then retry.

A peer process is holding the per-run-key advisory flock(2) that serializes sidecar writes; the helper polled for 30 s and gave up. Run-dir locks live at {runs_root}/.locks/{kernel}-{project_commit}.lock and serialize the pre-clear + write cycle so two concurrent runs sharing a key cannot tear each other’s sidecars.

target/ktstr/.locks/6.14-abc1234.lock: filesystem NFS is not
supported for ktstr lockfiles (NFSv3 is advisory-only without
an NLM peer; NFSv4 byte-range locking does not cover flock(2)).
Move the lockfile path to a local filesystem (tmpfs, ext4, xfs,
btrfs, f2fs, bcachefs).

ktstr rejects NFS, CIFS, SMB2, CephFS, AFS, and FUSE mounts for lockfiles because flock(2) semantics there are unreliable — see Resource Budget for the rationale.

Diagnose:

  • cargo ktstr locks (or ktstr locks --watch 1s) prints every ktstr flock currently held on the host with PID + cmdline — see ktstr (standalone).
  • cat /proc/locks | grep '<lockfile-path-from-error>' falls back to the kernel’s own flock enumeration when the holder is outside ktstr.
  • stat -f -c '%T' <runs-root> reports the filesystem type.

Fix:

  • Peer-holder timeout: wait for the peer, kill it (kill <pid> from the holder list), or retry.
  • NFS / remote-fs rejection: relocate the runs root to a local filesystem via KTSTR_SIDECAR_DIR — noting that the override path also skips the cross-process flock, so give each concurrent run its own path. The kernel cache’s lockfiles face the same constraint — override KTSTR_CACHE_DIR if the default resolves to NFS.

Test hangs / nextest timeout

A VM test that stops making progress is eventually flagged SLOW by nextest and then terminated when it exceeds the profile’s slow-timeout budget: 60 s × 2 periods on the default profile, 90 s × 3 on the ci profile, with larger per-test overrides for heavy classes (verifier sweeps 180 s, the wide-SMP boots up to 960 s — see .config/nextest.toml).

ktstr’s own per-VM watchdog is sized to fire before nextest’s kill so you get a failure dump instead of a blunt termination. If nextest kills first, you lose the dump — so:

  • Re-run just the failing test with its exact variant name and read the dump — see Reading Failure Output.
  • Check for peers holding CPU locks (cargo ktstr locks) — a contended host makes VM boots slow enough to blow timeouts.
  • On a busy or small machine, pass --no-perf-mode and use --profile ci locally for the bigger budgets.
  • If one test legitimately needs longer (huge topology), give it a per-test override in .config/nextest.toml rather than raising the profile-wide timeout.

Tests pass locally but fail in CI

Common causes:

  • No KVM: CI runners need hardware virtualization. Check for /dev/kvm access.
  • Fewer CPUs: gauntlet topology presets up to 252 CPUs may exceed the runner’s capacity. Use smaller topologies.
  • No kernel: set KTSTR_TEST_KERNEL in the CI environment, or build and cache one per CI.
  • No CAP_SYS_NICE or rtprio: performance-mode tests require CAP_SYS_NICE or an rtprio limit for RT scheduling, and enough host CPUs for exclusive LLC reservation. Pass --no-perf-mode (or set KTSTR_NO_PERF_MODE=1) to disable all performance mode features; tests with performance_mode=true are then skipped entirely.
  • Debug thresholds: CI often runs debug builds. Debug builds use relaxed thresholds (3000ms gap, 35% spread) but may still hit limits on slow runners. See Checking.

Environment Variables

Every environment variable ktstr reads, grouped by task. Unless a row says otherwise, an empty value is treated the same as unset — the exceptions are called out per row.

Most of these have a CLI flag equivalent; prefer the flag in scripts and the variable in CI job-level env: blocks.

Daily knobs

VariableEffectAccepted valuesDefault
KTSTR_KERNELSelects the kernel for every entry point (build-time BTF resolution and runtime image discovery). Set automatically by cargo ktstr test --kernel.Exact version (6.14), range (6.14..7.0), git+URL#tag=…/#branch=…/#sha=…, path, or cache keyAuto-discovered
KTSTR_TEST_KERNELPoints the test harness directly at a bootable image (bzImage on x86_64, Image on aarch64). Set-but-empty is a hard error, not a fallback.Image pathAuto-discovered
KTSTR_SCHEDULERGlobal binary override for every SchedulerSpec::Discover scheduler. See Troubleshooting for the full resolution order.Binary pathBuild/discover cascade
KTSTR_SCHEDULER_BIN_<NAME>Per-scheduler binary override, checked before the global one. <NAME> is the discover name uppercased with non-alphanumerics mapped to _ (scx-ktstrKTSTR_SCHEDULER_BIN_SCX_KTSTR).Binary pathUnset
KTSTR_SCHEDULER_PROFILECargo build profile for the scheduler-under-test (independent of the harness profile). Set by cargo ktstr … --profile.Profile namerelease
KTSTR_SCHEDULER_ALLOW_STALE_FALLBACKAfter a failed orchestrated cargo build -p <sched>, fall back to a pre-built binary instead of failing the test.Any non-empty valueRefuse stale fallback
KTSTR_NO_PERF_MODEDisable performance mode (pinning, RT scheduling, hugepages, KVM exit suppression). A budget-sized CPU reservation is still taken — see Resource Budget. Flag: --no-perf-mode.Any non-empty valuePerf mode available
KTSTR_NO_SKIP_MODETurn resource-contention / insufficient-topology skips into hard failures. Flag: --no-skip-mode. Presence-only: even an empty value activates it.PresenceSkip on contention
KTSTR_CARGO_TEST_MODEMarks a direct cargo test / cargo nextest run without the cargo ktstr wrapper: no gauntlet expansion, no host CPU flocks, per-process initramfs builds, and $PATH-first scheduler discovery.Any non-empty valueFull orchestration
KTSTR_VERBOSEVerbose guest console output (loglevel=7, plus earlyprintk=serial on x86_64).Exactly "1"Quiet console
KTSTR_LOG_PASSESLog every Verdict pass detail, not just failures — for “the test passed but what did the assertion see?”.Anything except empty or "0"Failures only
KTSTR_BUDGET_SECSTime budget for greedy coverage-maximizing test selection at list time. See Running Tests.Positive number (fractional ok); invalid values warn and are ignoredAll tests listed
RUST_BACKTRACEVerbose diagnostics on failure; "1" or "full" also enables the verbose guest console. Propagated to the guest.1, fullOff
RUST_LOGTracing filter, host-side and guest-side (forwarded on the guest kernel command line). Example: RUST_LOG=ktstr::flock=debug surfaces flock-contention heartbeats.tracing filter syntaxOff

Kernel builds and caches

VariableEffectAccepted valuesDefault
KTSTR_CACHE_DIROverride the cache root (kernel images, BTF anchors, blobs). The value is used verbatim — no per-type subdirectory is appended.Absolute path$XDG_CACHE_HOME/ktstr/ or ~/.cache/ktstr/ (per-type subdir appended)
KTSTR_GHA_CACHEEnable the GitHub Actions remote kernel cache. Needs ACTIONS_CACHE_URL (set by the runner) and a ktstr built with --features remote-cache. Local cache stays authoritative; remote failures are non-fatal.Exactly "1"Disabled
KTSTR_KERNEL_PARALLELISMWidth of the download/resolve fan-out for multi-kernel --kernel specs. Affects downloads only — builds serialize on the host CPU locks.Positive integer; 0 / unparseable falls back to defaultHost logical CPU count
KTSTR_CACHE_STORE_LOCK_TIMEOUTTimeout for the exclusive lock taken while storing a built kernel into the cache. Raise on CI runners with slow shared disks.humantime duration (30s, 2m)Compile-time default
KTSTR_BUSYBOX_TARBALLBuild-time (build.rs): read the busybox source tarball from a local path instead of downloading it.Tarball pathDownload
KTSTR_SKIP_BUSYBOX_BUILDBuild-time (build.rs): skip the busybox compile entirely; shell mode becomes unavailable.Any non-empty valueBuild busybox
KTSTR_SKIP_WPROF_BUILDBuild-time (build.rs, wprof feature only): skip fetching and compiling the bundled wprof tooling.Any non-empty valueBuild wprof

Sidecars and stats

VariableEffectAccepted valuesDefault
KTSTR_SIDECAR_DIROverride the per-test sidecar output directory. Skips both the pre-clear and the cross-process flock — the operator owns the directory, so concurrent runs pointing at the same path are unserialized. stats subcommands read the default pool; pass --dir to point them elsewhere. See Runs and Regression Gates.Directory path{runs root}/{kernel}-{project_commit}/
KTSTR_CIStamp every sidecar’s run_source as "ci" instead of "local", so CI-produced runs are filterable in the stats pool.Any non-empty value"local"

Resource coordination and escape hatches

See Resource Budget for how these interact; the first two are mutually exclusive at every entry point.

VariableEffectAccepted valuesDefault
KTSTR_CPU_CAPCap the host CPUs reserved by a no-perf-mode VM or kernel build. Flag --cpu-cap N takes precedence.Integer ≥ 1; 0 / non-numeric rejectedKernel build: 30% of allowed CPUs (min 1). No-perf VM: the vCPU count, floored at 30%.
KTSTR_BYPASS_LLC_LOCKSSkip host-side LLC flock acquisition entirely — no coordination against concurrent runs.Any non-empty valueCoordinate
KTSTR_LOCK_DIRDirectory for the per-LLC / per-CPU flock files. Use when /tmp is constrained on a runner.Directory path/tmp
KTSTR_CONTENTION_BYPASSMake transient KVM errnos hard failures instead of ResourceContention skips (only when the host is not near its limits) — stricter, for catching kernel-side regressions.Exactly "1"Skip on contention
KTSTR_HOST_CGROUP_PARENTcgroup-v2 parent under which host_only tests create per-test cgroups. Must be a non-root subdirectory of /sys/fs/cgroup.Path under /sys/fs/cgroup/sys/fs/cgroup/ktstr
KTSTR_CGROUP_WALK_ROOTWhere the setup-time controller-enable walk starts, for delegated cgroup subtrees (systemd Delegate=yes, container nsdelegate). Must be a prefix of the configured parent.Path prefix of the parent/sys/fs/cgroup
KTSTR_STALL_POLL_MSHost-mode stall-monitor poll cadence.Milliseconds; empty / 0 / unparseable falls back500 ms
KTSTR_WORKER_READY_MARKER_OVERRIDEPath where the jemalloc alloc worker writes its ready marker, for noexec or quota-constrained temp filesystems.Absolute path/tmp/ktstr-worker-ready-<pid>

Set by ktstr itself

cargo ktstr stamps these across the orchestrator → nextest → test-binary boundary. Listed so you can recognize them in ps output and CI logs — do not set them by hand.

VariableCarries
KTSTR_KERNEL_LISTMulti-kernel fan-out list (label=path;…) when a run resolves 2+ kernels; each test expands to one variant per kernel. Takes precedence over KTSTR_KERNEL during variant expansion.
KTSTR_KERNEL_COMMITdir=commit map of each source kernel’s HEAD, so per-test processes skip re-walking the kernel tree.
KTSTR_PROJECT_COMMITThe project commit label perf-delta children must record in their sidecars.
KTSTR_ORCHESTRATEDOrchestration marker; VM-booting integration tests skip when it is absent (raw cargo nextest run would starve their resource budgets).
KTSTR_RUN_EPOCHPer-invocation session token that keeps parallel test processes from pre-clearing each other’s freshly written sidecars.
KTSTR_RUNS_ROOTAbsolute runs root, stamped once so sidecar writers and post-run readers resolve the same directory regardless of CWD.
KTSTR_PERF_ONLYSet by perf-delta runs: skip every test without performance_mode. Exporting it manually restricts any run the same way.
KTSTR_VERIFIER_RAWSet by cargo ktstr verifier --raw: emit verifier logs verbatim, no cycle collapsing.
KTSTR_VERIFIER_RESULT_DIRDirectory where verifier cells write per-cell PASS/FAIL records for the summary grid.
KTSTR_VERIFIER_SCHEDULERThe verifier --scheduler NAME filter, forwarded to cell emission.
KTSTR_BUSYBOX_PATHPath to the busybox blob cargo ktstr extracts at startup for shell-mode VMs and disk-template builds.

Probe wiring

Consulted by integration tests that boot a jemalloc-linked allocator worker and attach the jemalloc probe to it. Both must be populated before ktstr’s early nextest dispatch runs, so tests set them from a #[ctor] — see Payloads and Included Files for the wiring pattern. Leaving them unset is the normal case: no probe is packed into the initramfs.

VariableEffectDefault
KTSTR_JEMALLOC_PROBE_BINARYAbsolute host path to ktstr-jemalloc-probe; packed into every VM’s initramfs at /bin/ktstr-jemalloc-probe when set.No probe packed
KTSTR_JEMALLOC_ALLOC_WORKER_BINARYAbsolute host path to the paired ktstr-jemalloc-alloc-worker, packed alongside the probe.No worker packed

LLVM coverage

VariableEffectDefault
LLVM_COV_TARGET_DIRDirectory for extracted profraw files.Parent of LLVM_PROFILE_FILE, or <exe-dir>/llvm-cov-target/
LLVM_PROFILE_FILEStandard LLVM profiling output path; ktstr reads its parent as a fallback profraw directory.None

Nextest protocol

VariableEffectDefault
NEXTESTSet by nextest when it invokes the test binary. ktstr’s early dispatch inspects it to decide whether to intercept --list / --exact for gauntlet expansion and budget selection.None

VM-internal

Mostly set by the host on the guest kernel command line and read by the guest init (via /proc/cmdline); a few (noted below) are process-internal markers set inside the guest. Not intended for user configuration; listed here for debugging.

VariableDescription
KTSTR_MODEGuest execution mode. shell requests the interactive shell; disk_template requests a one-shot mkfs template-build VM. Absent means the default test-dispatch path.
KTSTR_TOPOTopology string (numa_nodes,llcs,cores,threads) for guest-side scenario resolution.
KTSTR_TERMTerminal type forwarded from the host (sets guest TERM).
KTSTR_COLORTERMColor capability forwarded from the host (sets guest COLORTERM).
KTSTR_COLS / KTSTR_ROWSHost terminal size, used to size the guest pty when available.
KTSTR_GUEST_INITProcess-internal marker set by the guest init — not a host-emitted cmdline token. Used to detect re-entrant worker spawns under PID-1 init.
KTSTR_DISK0_FS / KTSTR_DISK0_MOUNT / KTSTR_DISK0_RODisk-attach metadata (fs type, mount point, ro flag) for #[ktstr_test(disk = ...)], consumed by the guest to mount the virtio-blk backing.

ctprof

The ctprof profiler captures a host-wide per-thread snapshot of scheduling counters, memory / I/O accounting, CPU affinity, cgroup state, and thread identity, then compares two snapshots to surface what changed. It is a manually-invoked CLI companion to the automated scheduler tests — useful when a run passes on one machine and fails on another, or for A/B comparing host behavior across kernel / sysctl / workload changes.

This is a different tool from cargo ktstr show-host, which captures the host context (kernel, CPU model, sched_* tunables, NUMA layout, kernel cmdline) — aggregate state that does not change between scenarios. The profiler captures per-thread cumulative counters that do change, and its comparison surface is designed for the thread-level diff.

When to use it

  • Workload investigation — you observe a regression and want to know which process / thread pool moved in run time, context-switch rate, or migration count.
  • Kernel / sysctl A/B — capture before and after flipping a sched_* tunable on an otherwise-identical workload; the compare output surfaces every counter that responded.
  • Host baselining — capture on a known-good host, capture on a failing host, compare to isolate what differs at the thread-behavior level.

The profiler is not invoked automatically by scenarios or the gauntlet. It is opt-in and operator-driven via the ktstr ctprof subcommand.

Capture, then compare

The whole workflow is three commands: snapshot, change something, snapshot again, diff.

ktstr ctprof capture --output base.ctprof.zst
# ... run a workload, flip a tunable, swap a scheduler ...
ktstr ctprof capture --output cand.ctprof.zst
ktstr ctprof compare base.ctprof.zst cand.ctprof.zst

capture walks /proc for every live thread group, enumerates each thread, and reads a handful of procfs sources for each one. The output is a zstd-compressed JSON snapshot (conventional extension: .ctprof.zst). On a workstation with ~1,200 live threads, each snapshot in the run below took about a second.

Here is a real compare — two captures taken a couple of seconds apart on a busy workstation. Rows sort by largest absolute percent delta, so the biggest movers are the first thing you see:

ktstr ctprof compare base.ctprof.zst cand.ctprof.zst --sections primary --limit 20
## Primary metrics
 comm                              threads  metric             value                delta      %         %uptime
 kworker/{N}:{N}-mm_percpu_wq
     kworker/{N}:{N}-mm_percpu_wq  11→37    voluntary_csw      8.697K → 101.154K    +92.457K   +1063.1%  93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    timeslices         8.699K → 101.166K    +92.467K   +1063.0%  93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    wait_time_ns       2.684s → 27.653s     +24.969s   +930.2%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    stime_clock_ticks  22ticks → 217ticks   +195ticks  +886.4%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    run_time_ns        243.378ms → 2.320s   +2.077s    +853.4%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    nonvoluntary_csw   2 → 12               +10        +500.0%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    thread_count       11 → 37              +26        +236.4%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    nr_migrations      11 → 34              +23        +209.1%   93%
 kworker/{N}:{N}-events
     kworker/{N}:{N}-events        87→60    nonvoluntary_csw   22 → 11              -11        -50.0%    95%
     kworker/{N}:{N}-events        87→60    timeslices         222.140K → 127.813K  -94.327K   -42.5%    95%
 user.slice
   user-{N}.slice
     session-{H}.scope
       ktstr
           ktstr                   1        processor          9 → 43               +34        +377.8%   0%
           ktstr                   1        wait_time_ns       6.850µs → 22.693µs   +15.843µs  +231.3%   0%
           ktstr                   1        nonvoluntary_csw   2 → 6                +4         +200.0%   0%
... 22 more lines truncated (use --limit 0 for unlimited)

Reading it:

  • value is baseline → candidate for the group’s aggregated reading; delta and % carry the signed move. The mm_percpu_wq pool grew from 11 to 37 threads and its voluntary context switches went up 11×, all inside the capture window — the eye lands there first because the sort put it first.
  • threads shows the group population on each side (11→37). A population change is often the story by itself.
  • %uptime is the group’s average thread lifetime relative to the longest-lived group in the snapshot — low values flag young threads whose counters had little time to accumulate.
  • {N} / {H} placeholders come from name-pattern normalization: kworker/3:1 and kworker/7:0 are the same logical pool, so they land in one kworker/{N}:{N} bucket.

Groups present on only one side surface as unmatched — a row is missing because the process did not exist, not because it did zero work. A full (unfiltered) compare lists them in a trailer:

1 group(s) only in baseline (/tmp/ktstr-docs-base.ctprof.zst):
  kworker/u{N}:{N}-flush-btrfs-{N} kworker/u{N}:{N}-flush-btrfs-{N}

2 group(s) only in candidate (/tmp/ktstr-docs-cand.ctprof.zst):
  kworker/u{N}:{N}-writeback kworker/u{N}:{N}-writeback
  kworker/{N}:{N}-events_freezable kworker/{N}:{N}-events_freezable

What is captured per thread

  • Identity — tid, tgid, process and thread name, cgroup v2 path, start time, scheduling policy, nice, CPU affinity mask.
  • Scheduling counters (cumulative, from /proc/<tid>/sched and /proc/<tid>/schedstat) — run / wait / sleep / block / iowait time, context switches, wakeups with locality splits, migrations, plus lifetime peaks (wait_max, slice_max, …).
  • Memory — page faults; jemalloc per-thread allocated/deallocated counters read via ptrace + process_vm_readv (jemalloc-linked processes only — other allocators read zero rather than failing capture); per-process smaps_rollup.
  • I/Orchar / wchar, syscall counts, and block-level byte counters from /proc/<tid>/io (requires CONFIG_TASK_IO_ACCOUNTING).
  • Taskstats delay accounting + watermarks — eight delay categories plus peak-RSS/VM watermarks via the TASKSTATS genetlink family; see Taskstats delay accounting for gating and semantics.
  • PSI and cgroup aggregates — host-level and per-cgroup pressure (CONFIG_PSI), cpu.stat / memory.* / pids.* per cgroup that hosted a sampled thread — read from cgroup files directly, not derived from per-thread data.
  • sched_ext sysfsstate, switch_all, nr_rejected, and the hotplug/enable sequence counters, when CONFIG_SCHED_CLASS_EXT is built.

Three timing families matter when interpreting a diff:

  • Cumulative counters (the majority) only increase, so probe attachment time does not bias the reading — a diff between two captures measures exactly the activity in the window.
  • Lifetime extrema (*_max, hiwater_*, *_delay_min_ns) are per-event peaks kept by the kernel, not sums over the window.
  • Instantaneous gauges (nr_threads, fair_slice_ns, state, affinity, processor) are sampled at capture time and can legitimately differ between two probes of the same thread.

Metrics that reset on attachment (perf_event_open counters, BPF tracing samples) are intentionally absent — they require long-lived instrumentation the capture layer cannot install without disturbing the system it is measuring.

Capture is best-effort

Each internal reader returns Option; a kernel missing a config gate (no CONFIG_SCHED_DEBUG, no CONFIG_SCHEDSTATS) yields None from that reader without failing the rest of the thread. Counters collapse to 0, identity strings collapse to empty. A missing reading is indistinguishable from a genuine zero in the output — the contract is “never fail the snapshot.” The capture summary lines on stderr tally read failures and hint at the likely missing kconfig.

Pulling the jemalloc counters briefly stops each probed thread via ptrace(PTRACE_SEIZE), which needs root, CAP_SYS_PTRACE, or kernel.yama.ptrace_scope=0; without the privilege those fields fall through to zero and the rest of the snapshot still populates.

Cgroup namespace caveat

The per-thread cgroup path is read verbatim from /proc/<tid>/cgroup — it is relative to the cgroup namespace root the capturing process sees, not the system-global v2 mount root. A process inside a nested cgroup namespace sees a truncated path. Cross-namespace comparison requires external canonicalization; the capture layer deliberately does not attempt it.

Compare options

Grouping

compare defaults to --group-by all: all three pattern-aware axes (cgroup, pcomm, comm) contribute to one view — cgroup-grouped rows render as an indented path tree, name-pattern buckets render flat, as in the excerpt above — and renamed-but-identical cgroups are joined for diffing (a [fudged: <leaf>] marker) instead of surfacing as orphans.

  • --group-by pcomm — aggregate every thread of the same process together (the show default).
  • --group-by cgroup — aggregate by cgroup path; enables the per-cgroup sections. Use --cgroup-flatten '<glob>' to collapse dynamic segments (pod UUIDs, session scopes) so the same logical workload lands on the same row across runs.
  • --group-by comm — aggregate by thread-name pattern across every process (tokio-worker-{0..N} → one bucket). Choose it when a thread-pool name spans many binaries.
  • --group-by comm-exact — literal thread names, no pattern collapse, for when distinct token values carry meaning (each kworker/u8:N tracked independently).
  • --no-thread-normalize disables the pattern collapse on the name axes.

Rule of thumb: start with the default all to find which axis moved, then re-run with that single axis plus --sections / --metrics filters for a narrow, pasteable table.

Filtering: --sections vs --metrics

  • --sections picks which sub-tables render: primary, taskstats-delay, derived, cgroup-stats, cgroup-limits, memory-stat, memory-events, pressure, host-pressure, smaps-rollup, sched-ext. The five cgroup sections require --group-by cgroup. The taskstats rows render inside the primary/derived tables but match the taskstats-delay section name, so you can scope to them alone or exclude them.
  • --metrics picks which rows render inside the primary and derived tables, by metric name from the metric-list vocabulary. Secondary sub-tables have fixed shapes and ignore it.

They compose: --sections primary --metrics run_time_ns shows a single row and nothing else. --sort-by 'wait_sum:desc,run_time_ns:desc' re-ranks rows by your own key instead of the default biggest-|Δ%| ordering; --limit N caps lines per section.

How groups aggregate

Every metric declares how per-thread values reduce into a group row; the registry binds each metric to exactly one reduction so a nonsensical fold (summing peaks) cannot be expressed.

Metric classGroup reductionWhy
Cumulative counters (csw, wakeups, migrations, run/wait time, io bytes, delay totals)sumtotals compose; deltas stay meaningful
Lifetime peaks (wait_max, *_delay_max_ns, hiwater_*)maxsumming peaks conflates one 1 s spike with 1000 × 1 ms spikes
Instantaneous gauges (nr_threads, fair_slice_ns)maxsumming a sampled instant has no physical meaning
Bounded ordinals (nice, priority, processor)[min, max] rangea shift on either end stays visible
Categorical (policy, state)mode + count/totalno arithmetic on categories; delta is same / differs
CPU affinitymin/max CPU count + uniform flagheterogeneous groups render N-M cpus (mixed)

Three cross-cutting caveats, stated once:

Swapin / thrashing overlap

Every thrashing event is also a swapin event at the syscall layer. Never sum the two families; rollups OR them with max().

The *_delay_min_ns sentinel

The kernel keeps the smallest non-zero observation, so 0 means “no events observed”, not “saw a zero-ns event”. Disambiguate against the matching *_count.

Shared-mm watermarks

The kernel reads hiwater_rss_bytes / hiwater_vm_bytes from the shared mm_struct, so sibling threads of one process all report the same value; kernel threads read zero by design.

Derived metrics

Derived metrics combine already-aggregated inputs into a scalar with its own scale — they render in a separate ## Derived metrics table on both compare and show. A missing input or zero denominator yields - (not computable), distinct from a computed zero. Representative entries:

MetricFormulaReading it
cpu_efficiencyrun / (run + wait)fraction of scheduler-tracked time on-CPU; lower = more runqueue waiting
avg_slice_nsrun_time_ns / timeslicesaverage on-CPU slice; catches timeslice-tuning regressions
involuntary_csw_rationonvol / (vol + nonvol)preemption pressure vs cooperative blocking
avg_cpu_delay_nscpu_delay_total_ns / countrunqueue wait per event, from the delayacct path
live_heap_estimateallocated - deallocatedjemalloc-only live heap; zero is genuine for other allocators
total_offcpu_delay_nssum of delay buckets, swapin/thrashing OR’done off-CPU number; - when delayacct is off entirely

The full registry (17 derived + every primary metric) is enumerated by metric-list; all names are valid --sort-by and --metrics keys.

Output and interpretation

The comparison prints raw numbers and percent delta. There are no judgment labels (regression vs. improvement) — whether “run_time went up 15%” is good depends on whether you measured a CPU-bound workload (more work done) or a spin-wait pathology (more time wasted). The interpretation is scheduler-specific and left to the operator. Ratio-valued rows suppress the % column: the absolute delta of a [0, 1] quantity already carries percentage-point semantics.

show

show renders a single snapshot as a per-group table — same grouping, filtering, and sorting surface as compare, minus the diff columns. Default grouping is pcomm.

## Primary metrics
 pcomm                                    threads  metric         value
...
 kworker/u{N}:{N}-btrfs-endio-write       23       run_time_ns    41.186s
 kworker/u{N}:{N}-btrfs-endio-write       23       voluntary_csw  1.249M
 kworker/u{N}:{N}-btrfs-endio-write       23       nr_migrations  26.759K
 kworker/u{N}:{N}-btrfs-endio             47       run_time_ns    40.940s
... 464 more lines truncated (use --limit 0 for unlimited)

--columns controls the rendered column set; show accepts group, threads, metric, value, tags, uptime, while compare accepts group, threads, metric, baseline, candidate, delta, %, arrow, tags, uptime — each side rejects the other’s diff/value columns.

metric-list

metric-list prints every registered metric with its tags and a one-line description — the authoritative vocabulary for --metrics and --sort-by, plus a tag legend explaining which kernels populate each counter:

## Metrics

 metric                        tags                           description
...
 run_time_ns                   [SCHED_INFO]                   Cumulative on-CPU time, ns; /proc/<tid>/schedstat field 1.
 wait_time_ns                  [SCHED_INFO]                   Cumulative time waiting on the runqueue, ns; schedstat field 2.
 timeslices                    [SCHED_INFO]                   Number of times the task was run on a CPU; schedstat field 3.
 voluntary_csw                                                Voluntary context switches (task gave up the CPU itself).
 nonvoluntary_csw                                             Involuntary context switches (task was preempted).
 nr_wakeups                    [SCHEDSTATS]                   Total wakeups via try_to_wake_up().
...
 nr_wakeups_affine             [cfs-only] [SCHEDSTATS]        Wakeups that succeeded under the wake_affine() heuristic.

The bracketed tags mark scheduler-class gates ([cfs-only] counters stay zero under sched_ext) and kconfig gates ([SCHEDSTATS], [TASK_DELAY_ACCT], …) so you know whether a zero means “idle” or “not compiled in”.

Taskstats delay accounting

The kernel’s TASKSTATS genetlink family delivers per-task delay-accounting and memory-watermark fields that are not exposed via /proc/<tid>/sched or /proc/<tid>/stat. The 34 captured fields (8 delay categories × 4 bucket fields + 2 watermarks) all tag the taskstats-delay section so they can be filtered as a unit.

Capability and kconfig gating

Querying the netlink family requires CAP_NET_ADMIN on the capturing process. A non-root operator running ktstr ctprof capture hits EPERM on the first query and every taskstats field collapses to zero per the best-effort contract.

  • Delay-accounting fields require CONFIG_TASKSTATS=y and CONFIG_TASK_DELAY_ACCT=y and the runtime delayacct=on toggle (boot param or kernel.task_delayacct=1). A kernel built with both configs but launched without the toggle produces all-zero delay readings. ktstr’s standard kernel build includes both kconfigs, and the test harness adds delayacct to the guest cmdline.
  • Memory-watermark fields (hiwater_rss_bytes, hiwater_vm_bytes) require CONFIG_TASKSTATS=y and CONFIG_TASK_XACCT=y, and do not respond to the runtime toggle. See the shared-mm caveat.

The structured tally on CtprofSnapshot::taskstats_summary (ok_count / eperm_count / esrch_count / other_err_count) distinguishes “kernel doesn’t expose this” (netlink open failed, all counters zero) from “every tid raced exit” (high esrch_count) from “CAP_NET_ADMIN missing” (high eperm_count). There is no CLI lens for it yet — read it from the snapshot JSON (zstd -d < snap.ctprof.zst | jq .taskstats_summary).

The eight delay categories

CategoryKernel sourceNotes
cpu_delay_*tsk->sched_info via delayacct_add_tsk (kernel/delayacct.c)Runqueue wait. Count and total update locklessly, so a reader can transiently observe one ahead of the other — averages are approximate at sub-event scale, stable integrated. Same bucket as schedstat wait_* via a different path.
blkio_delay_*delayacct_blkio_start/_endSynchronous block-I/O wait; updates serialize through the task’s delay lock. The canonical delayacct block-I/O reading, distinct from schedstat iowait_sum.
swapin_delay_*delayacct_swapin_start/_endSwap-in wait. Overlaps thrashing.
freepages_delay_*called from mm/vmscan.cDirect-reclaim wait.
thrashing_delay_*called from mm/filemap.c, mm/page_io.cThrashing wait; refines swapin tracking.
compact_delay_*called from mm/page_alloc.cMemory-compaction wait.
wpcopy_delay_*called from mm/memory.c, mm/hugetlb.cWrite-protect-copy (CoW) fault wait. Taskstats v13+.
irq_delay_*delayacct_irqIRQ-handler windows charged to the task. Taskstats v14+. On kernels predating a bucket, the missing fields read zero from the truncated payload.

Each category carries four fields: *_count (windows observed), *_delay_total_ns (cumulative), *_delay_max_ns (longest single window), and *_delay_min_ns (shortest non-zero window — mind the sentinel).

File format

.ctprof.zst is zstd-compressed JSON of CtprofSnapshot. The schema is #[non_exhaustive] so field additions do not break existing snapshots. The top level carries threads, cgroup_stats, host-level psi, optional sched_ext sysfs state, the capture summaries (probe_summary, parse_summary, taskstats_summary), and an embedded HostContext — the same structure show-host prints — for round-trip tooling. Thread start times are recorded in USER_HZ (100 on x86_64 and aarch64), so cross-host comparison between differently-configured kernels on those architectures is meaningful.

Extending ctprof

Adding a metric to the registry is a typed three-step change (field newtype → capture wiring → registry entry) designed so a mismatched aggregation fails to compile. See the module documentation for ktstr::ctprof_compare in the rustdoc.

ktstr (standalone)

ktstr is the standalone debugging companion to the #[ktstr_test] test harness. It owns interactive VM shells, host topology inspection, host-wide per-thread profiling, and lock introspection — the operations a scheduler author reaches for when investigating a test failure.

To run the test suite, use cargo ktstr test; to reproduce a test as a self-contained script without a VM, use cargo ktstr export.

A typical failure investigation chains the two binaries: a test fails (read the output), you boot the same environment interactively with cargo ktstr shell --test NAME, and if the question is “what else changed on this host?”, you bracket the workload with ktstr ctprof capture and diff the snapshots.

Build from the workspace:

cargo build --bin ktstr

Subcommands

topo

Show the host CPU topology — the same view the resource planner and performance mode use:

$ ktstr topo
CPUs:       64
LLCs:       4
NUMA nodes: 1
  LLC 0 (node 0): [0, 1, 2, 3, 4, 5, 6, 7, 32, 33, 34, 35, 36, 37, 38, 39]
  LLC 1 (node 0): [8, 9, 10, 11, 12, 13, 14, 15, 40, 41, 42, 43, 44, 45, 46, 47]
  LLC 2 (node 0): [16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55]
  LLC 3 (node 0): [24, 25, 26, 27, 28, 29, 30, 31, 56, 57, 58, 59, 60, 61, 62, 63]

This box is 1n4l8c2t in ktstr’s topology notation: 1 NUMA node, 4 LLCs, 8 cores per LLC, 2 threads per core — note the SMT siblings (CPU 0 pairs with CPU 32).

kernel

Manage cached kernel images: list, build, clean. Identical to the cargo-ktstr kernel subcommands — see there for full documentation.

shell

Boot an interactive shell in a KVM VM. The guest is a busybox userland with your files mounted at /include-files/:

ktstr shell
ktstr shell --kernel ../linux
ktstr shell --kernel 6.14.2 --topology 1,2,4,1
ktstr shell -i ./my-binary -i strace
ktstr shell --exec 'cat /proc/schedstat'

Files and directories passed via -i land at /include-files/<name> inside the guest. Directories are walked recursively; bare names (no path separator) are resolved via PATH; dynamically-linked ELF binaries get automatic shared-library resolution, and non-ELF files are copied as-is.

Stdin must be a terminal: the host terminal enters raw mode for bidirectional forwarding, and the saved terminal state is restored on exit paths ktstr controls (normal exit, errors, catchable fatal signals). A SIGKILL cannot be intercepted — run reset if the terminal is left raw.

FlagDefaultDescription
--kernel IDautoSame kernel grammar as cargo ktstr test --kernel (path, version, cache key, range, git source), resolving to a single kernel; raw image files are rejected here. When absent, resolves via cache then filesystem, falling back to downloading the latest stable kernel.
--topology N,L,C,T1,1,1,1Virtual topology as numa_nodes,llcs,cores,threads. All values must be >= 1.
-i, --include-files PATHFiles or directories to include in the guest. Repeatable.
--memory-mib MiBautoGuest memory in MiB (minimum 128). When absent, estimated from payload and include-file sizes.
--dmesgoffForward the guest kernel console (COM1/dmesg) to stderr in real time; sets loglevel=7.
--exec CMDRun a command instead of an interactive shell; the VM exits when it completes.
--exec-timeout DURATION120sMax wall-clock for a --exec payload before the VM is killed (30s, 5m, 1h).
--no-perf-modeoffDisable all performance mode features. Also via KTSTR_NO_PERF_MODE.
--cpu-cap NunsetReserve only N host CPUs for the shell VM; requires --no-perf-mode. See Resource Budget.
--disk SIZEunsetAttach a raw virtio-blk disk at /dev/vda, backed by a fresh sparse tempfile. IEC sizes only (256mib, 1gib).

cargo ktstr shell runs the same boot flow with two additions: it accepts raw kernel image files for --kernel, and it has --test NAME to derive topology, memory, and include files from a registered #[ktstr_test].

ctprof

Capture or compare a host-wide per-thread snapshot — for “the scheduler looks fine but something on the host is still behaving oddly”. Every visible thread’s scheduling, memory, and I/O counters are snapshotted as zstd-compressed JSON:

ktstr ctprof capture --output baseline.ctprof.zst
# ... run the workload of interest ...
ktstr ctprof capture --output candidate.ctprof.zst
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst

Cumulative counters and lifetime peaks are probe-timing-invariant — sampled twice, a value either increased monotonically or stayed at its high-water mark — so a diff between two snapshots measures exactly the activity over the window. Capture uses no kprobes or kernel tracing and does not modify thread state; the only exception is the jemalloc-only memory fields, read by briefly ptrace-attaching jemalloc-linked processes (needs root, CAP_SYS_PTRACE, or ptrace_scope=0; recorded as zero when denied).

compare joins two snapshots on a grouping axis and renders per-metric baseline/candidate/delta rows, sorted by largest relative change. Real output (a cargo build ran between the snapshots):

## Primary metrics
 comm                              threads  metric             value                delta      %         %uptime
 kworker/{N}:{N}-mm_percpu_wq
     kworker/{N}:{N}-mm_percpu_wq  11→37    voluntary_csw      8.697K → 101.154K    +92.457K   +1063.1%  93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    timeslices         8.699K → 101.166K    +92.467K   +1063.0%  93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    wait_time_ns       2.684s → 27.653s     +24.969s   +930.2%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    stime_clock_ticks  22ticks → 217ticks   +195ticks  +886.4%   93%
     kworker/{N}:{N}-mm_percpu_wq  11→37    run_time_ns        243.378ms → 2.320s   +2.077s    +853.4%   93%
...

Thread names are token-normalized (kworker/3:1 and kworker/7:0 fold into kworker/{N}:{N}), so the join key survives across process restarts and even across hosts — deltas reflect the named workload, not a specific pid.

Choosing --group-by: start with the default all (cgroup, then process, then thread pattern — it folds renamed-but-identical cgroups together); use pcomm when you think in processes, cgroup when comparing services or containers, and comm / comm-exact when a single thread pool is the suspect. Most-used compare flags:

FlagDefaultDescription
--group-by AXISallall, pcomm, cgroup, comm, or comm-exact (literal thread names).
--sections NAMESeverySub-tables to render, e.g. primary, taskstats-delay, derived, pressure, smaps-rollup.
--metrics NAMESeveryMetric allowlist (names from ktstr ctprof metric-list).
--sort-by SPEClargest |delta_pct|Multi-key sort: metric[:asc|desc],....
--limit N500Max rendered lines per section; 0 disables truncation.

show renders a single snapshot without diff math, and metric-list prints the metric vocabulary — see the ctprof reference for those, the full flag tables, aggregation rules, and taskstats kconfig gating.

locks

Enumerate every ktstr flock held on this host — read-only, never acquires anything. When a build or test stalls behind a peer’s reservation, ktstr locks names the peer without disturbing it:

$ ktstr locks
LLC locks:
 LLC  NODE  LOCKFILE               HOLDERS
 0    0     /tmp/ktstr-llc-0.lock  <none recorded>
 1    0     /tmp/ktstr-llc-1.lock  <none recorded>
...

Run-dir locks:
 RUN KEY               LOCKFILE                                       HOLDERS
 7.0.14-73730e0-dirty  target/ktstr/.locks/7.0.14-73730e0-dirty.lock  <none recorded>

An idle host shows <none recorded>; while a lock is held, the HOLDERS column names the holder’s PID and cmdline (cross-referenced against /proc/locks). Four lock-file roots are scanned:

  • {KTSTR_LOCK_DIR}/ktstr-llc-*.lock (default /tmp) — per-LLC reservations held by perf-mode test runs and --cpu-cap-bounded builds.
  • {KTSTR_LOCK_DIR}/ktstr-cpu-*.lock — per-CPU reservations from the same flow.
  • {cache_root}/.locks/*.lock — kernel-cache entry locks held during kernel build writes, plus per-source-tree locks held while building from a path.
  • {runs_root}/.locks/{kernel}-{project_commit}.lock — sidecar write locks serializing concurrent runs targeting the same run directory.
FlagDefaultDescription
--jsonoffJSON snapshot (pretty in one-shot mode; ndjson under --watch).
--watch DURATIONunsetRedraw at the interval until SIGINT (100ms, 1s, 5m).

Available identically as cargo ktstr locks. The reservation model behind these locks is documented in Resource Budget.

completions

Generate shell completions (bash, zsh, fish, elvish, powershell):

ktstr completions bash

--binary NAME overrides the registered name when invoking ktstr through a differently-named symlink. The same subcommand exists as cargo ktstr completions.

Assertable Metrics

Every regression comparison — cargo ktstr perf-delta and the per-test PerfDeltaAssertion gate — is driven by the metric registry: the static ktstr::stats::METRICS table. Each entry carries a metric’s name, its regression polarity, its aggregation kind, the dual-gate significance thresholds, and a display unit. This chapter explains those fields, how to enumerate the live catalog, which workloads emit which metric families, and how to pin a per-test regression gate.

The catalog: stats list-metrics

The authoritative, always-current catalog is the command output — it enumerates the registry directly, so it never drifts from the code:

cargo ktstr stats list-metrics          # text table
cargo ktstr stats list-metrics --json   # machine-readable (includes kind + every field)
 NAME                                    POLARITY       DEFAULT_ABS  DEFAULT_REL  UNIT
 worst_spread                            lower          5            0.25         %
 worst_gap_ms                            lower          500          0.5          ms
 total_migrations                        lower          2            0.3
 worst_migration_ratio                   lower          0.05         0.2
 max_imbalance_ratio                     lower          1            0.25         x
...
 worst_p99_wake_latency_us               lower          50           0.25         µs
 worst_median_wake_latency_us            lower          20           0.25         µs
...
 iteration_rate                          higher         1            0.3          iter/s
 total_iterations                        higher         2            0.1

list-metrics reads only the static registry; it needs no sidecar pool. Which of these metrics a particular run actually carries depends on the emitting workload — see Workload → emitted metrics. (cargo ktstr stats list-values enumerates the pool’s filter dimensions — kernels, commits, schedulers, topologies, work types — not its metric keys, so it cannot answer which metrics are present.)

Registry fields

  • name — the metric key (e.g. worst_spread, worst_gap_ms, sched_count_per_sec). This is the string a PerfDeltaAssertion names and the key perf-delta reports on.

  • polarity — the regression direction:

    • LowerBetter — an increase is a regression (latency, spread).
    • HigherBetter — a decrease is a regression (throughput, iterations).
    • Informational — directionless: a change is shown but never counted as a regression or improvement and never gates the exit.
    • TargetValue(t) / Unknown — also exist (rendered target(t) / unknown by list-metrics) but no registered metric uses them today.
  • kind — how per-sample readings fold into the run-level value: Counter (sum), Peak (max-of-max), Gauge (average, last, or max, per metric), Rate (re-derived ratio), plus phase-aware kinds such as DeltaSum that fold pre-deltaed per-phase readings. The kind decides whether the cross-run fold is a mean, a max, or a re-derived ratio.

  • default_abs / default_rel — the dual gate. A move counts as a confident regression only when it clears both the absolute floor (default_abs, in the metric’s units) and the relative threshold (default_rel, a fraction). The absolute floor’s role depends on the metric’s dynamic range:

    • Scale-bounded metrics (fractions, ratios, % spread, ms/µs latencies) use default_abs as a fixed unit-scale noise floor — a sub-unit move is immaterial regardless of its relative size.
    • Scale-varying metrics (*_per_sec rates, ops/s, req/s, raw counts) can span orders of magnitude across workloads, so a fixed floor would mask a large relative regression on a low-throughput workload. For these, default_abs is only a near-idle activity guard and default_rel carries materiality — a 40 % drop is flagged whether the baseline is 50/s or 50000/s.

    perf-delta --threshold PCT / --policy FILE override the relative gate; the absolute gate is per-metric.

  • display_unit — the unit rendered in tables (ms, /s, ns, …).

Workload → emitted metrics

A metric only appears in a comparison if the run actually emitted it.

FamilyExample metricsEmitted byPresent when
Spread / gapworst_spread, worst_gap_msevery scenario (scheduling-latency capture)always
Iteration throughputtotal_iterations, worst_iterations_per_cpu_seccompute / spin workloadsthe workload iterates; the *_per_cpu_sec form is overcommit-invariant
schedstat counters / ratestotal_run_delay_ns_per_sched, total_ttwu_count, sched_count_per_secschedstat sampling over the runschedstat capture enabled
IRQ / pressureavg_irq_util, total_irq_pressure_us, max_cgroup_psi_irq_avg10IRQ-heavy scenarios, periodic host-pressure capturethose captures ran
NUMA localityworst_page_locality, worst_cross_node_migration_ratioNUMA-aware scenariosmulti-node topology
Payload metricssched_delay_msg_us, taobench_total_qps, schbench_loop_countschbench / taobench payloadsthe payload ran and reported

Not every registry name can back a gate: perf-delta --must-fail rejects unknown names, internal rate components, per-phase-only metrics, and — without --noise-adjust — whole-run distribution metrics and informational metrics, up front rather than silently never firing.

PerfDeltaAssertion how-to

A PerfDeltaAssertion is a per-test performance-regression gate. It is inert during a normal cargo ktstr test run (the in-VM verdict never consults it) and active only under cargo ktstr perf-delta --noise-adjust, which serializes the declaration into the sidecar and enforces it host-side. Plain (scalar) perf-delta does not evaluate declared gates — gating on a single run would flip CI on noise, so only the multi-run --noise-adjust path (Welch / disjoint- band separation) is a sound basis. Declaring a gate requires performance_mode (checked by the macro at compile time and by test discovery at run time).

A declaration names a registry metric and overrides, for this test, the gate that decides a confident regression on it. It layers on top of the --noise-adjust all-metrics regression net (which still runs to catch unknown-unknown regressions) — it is an explicit contract check, not a whitelist.

Bind each gate to a const and list it on the macro:

use ktstr::prelude::*;

// Name any metric from `cargo ktstr stats list-metrics`.
const SPREAD_GATE: PerfDeltaAssertion =
    PerfDeltaAssertion::new("worst_spread").with_max_regression_pct(5.0);

#[ktstr_test(performance_mode = true, perf_delta_assertions = [SPREAD_GATE])]
fn schbench_steady() -> Scenario {
    // ... a degenerate / steady-state scenario whose worst_spread
    // must not regress more than 5% against the baseline commit.
}

Builders (all const fn, chainable):

  • .with_max_regression_pct(pct) — relative gate: a worsening move larger than pct% of the baseline gates. Unset → registry default_rel.
  • .with_min_abs(min) — absolute-materiality floor: a move smaller than min (in the metric’s units) never gates. Unset → registry default_abs.
  • .with_direction(polarity) — pin the regression direction instead of inheriting the registry polarity (e.g. treat an Informational metric as LowerBetter for this test).
  • .with_phase(step_index) — scope the gate to one phase (0 = BASELINE, 1..=N = scenario Step ordinals) instead of the whole-run value.

Then gate CI with the noise-adjusted compare:

cargo ktstr perf-delta --noise-adjust 5 --kernel 7.0 \
    -E 'test(schbench_steady)'

This runs schbench_steady five times at HEAD and five at the baseline, and fails when worst_spread regresses past the declared 5% gate with statistical confidence. See Runs and Regression Gates for the full perf-delta workflow and the CI chapter for wiring the gate into a pull-request job.

API Reference

The guide and the rustdoc split the work: the guide explains why and when to reach for an API; the rustdoc carries signatures, field semantics, and edge cases. The complete rustdoc for every ktstr workspace crate is published at ktstr.dev/rustdoc/ktstr, and ktstr::prelude re-exports everything a test author needs.

You want toReach forRustdocGuide chapter
Declare a test#[ktstr_test]attr.ktstr_testThe #[ktstr_test] Attribute
Declare a schedulerdeclare_scheduler!, Scheduler, SchedulerSpecmacro.declare_scheduler, test_supportScheduler Definitions
Drive a scenarioCtx, scenarios::*scenarioScenarios, Custom Scenarios
Compose steps and opsStep, Op, HoldSpec, Backdropscenario::opsOps, Steps, and Backdrop
Shape cgroups and cpusetsCgroupDef, CpusetSpecscenario::opsTopology
Check resultsAssert, AssertResult, Verdict, claim!assertChecking, Customize Checking
Gate performance regressionsPerfDeltaAssertiontest_supportAssertable Metrics
Generate loadWorkType, WorkloadConfig, WorkloadHandleworkloadWork Types, Workers and Workloads
Run guest binaries#[derive(Payload)], Payloadderive.PayloadPayloads and Included Files
Capture guest stateSnapshot, SnapshotBridge, Sample, SampleSeriesscenario::snapshot, scenario::sampleSnapshots, Temporal Assertions

The #[ktstr_test] attribute’s arguments (topology dimensions, thresholds, execution flags) are documented in the macro reference chapter and on the attribute’s own rustdoc page; the macro itself lives in the ktstr-macros crate and is re-exported at the crate root.