Overview
ktstr
Test Linux schedulers like code. Every test boots a real kernel in a KVM micro-VM with the topology it declares — and ktstr watches what your scheduler does from the host, without touching the guest.
Scheduler bugs hide in topology: the fairness regression that only shows
up on an odd LLC count, the starvation that needs SMT siblings, the crash
that wants a NUMA crossing. Testing a
sched_ext scheduler against those
shapes has meant scrounging hardware and hand-running repro scripts.
ktstr turns it into cargo test: declare the topology on the test, and
the VM actually has it.
Quick taste
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_mine",
});
#[ktstr_test(scheduler = MY_SCHED, llcs = 1, cores = 2, threads = 1)]
fn steady_under_my_sched(ctx: &Ctx) -> Result<AssertResult> {
scenarios::steady(ctx)
}
Run it against any kernel — a released version, a local source tree, or a git URL:
cargo ktstr test --kernel 7.0
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
Nextest run ID 98581174-… with nextest profile: default
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
Summary [ 34.498s] 1 test run: 1 passed, 12531 skipped
Without a scheduler attribute, tests run under the kernel’s default
scheduler (EEVDF) — useful for baselines and A/B comparisons.
When it breaks, you see why
A crash log tells you where the scheduler died. ktstr also tells you what
the state looked like on the way there: on a crash it boots a second VM,
attaches BPF probes along the crash path, and reruns the scenario. Each
probed function prints decoded struct fields; → marks fields that
changed between entry and exit:
=== AUTO-PROBE: scx_exit fired === ktstr_enqueue main.bpf.c:21 task_struct *p pid 97 cpus_ptr 0xf(0-3) dsq_id SCX_DSQ_INVALID ... scx_flags QUEUED|ENABLED do_enqueue_task kernel/sched/ext.c rq *rq cpu 1 task_struct *p pid 97 cpus_ptr 0xf(0-3) dsq_id SCX_DSQ_INVALID → SCX_DSQ_LOCAL ... scx_flags QUEUED|DEQD_FOR_SLEEP → QUEUED
Auto-repro is on by default and needs a kernel with the sched_ext_exit
tracepoint — see Auto-Repro. For the
anatomy of ordinary failures (stats, timeline, monitor verdict), see
Reading Failure Output.
Real kernels under KVM
Each test gets a fresh micro-VM booting the exact kernel you target. Real cgroups, real BPF, no shared state.
How it works →Topology as code
NUMA nodes, LLCs, cores, SMT — declared on the test attribute, realized in the guest down to the ACPI tables.
Topology →Gauntlet
One test declaration fans out across a matrix of topology presets — odd LLC counts, SMT, NUMA crossings — with budget-aware selection for CI.
Gauntlet →Auto-repro
Crashes rerun themselves in a probe VM that captures function arguments and struct state along the crash path.
Auto-Repro →Design
Fidelity without overhead. Every test boots a real Linux kernel in a KVM VM with real cgroups and real BPF programs — no mocking, no containers, no state carried between tests. The VMM is purpose-built for this job; see VMM.
Direct access over tooling layers. The host-side monitor reads guest memory through BTF-resolved struct offsets — runqueues, DSQ depths, schedstat counters — loading nothing into the guest, so observation does not perturb the scheduler under test. See Monitor.
What it tests
- Fair scheduling — workers get CPU time without starvation or excessive scheduling gaps.
- Cpuset isolation — workers stay on assigned CPUs.
- Dynamic operations — cgroups created, destroyed, and resized mid-run.
- Affinity — the scheduler respects thread affinity constraints.
- Stress — many cgroups, many workers, rapid topology changes.
- Stall detection — the scheduler doesn’t drop tasks.
Note
ktstr is pre-release. 0.x APIs change between releases, so pin the exact version — Getting Started shows how.
Next steps
- ktstr in Action — the full feature tour, with real output.
- Getting Started — install, build a kernel, first green test.
- Tutorial: Zero to ktstr — build up a real test suite step by step.
- Test a New Scheduler — already have a scheduler? Start here.
ktstr in Action
The feature tour: each capability in a few sentences, with real output where output is the point — captured from actual runs (ktstr 0.23.0, kernel 7.0.14 unless noted), trimmed but never edited.
Testing
Real kernel, clean slate, x86/arm parity
Every VM test boots its own Linux kernel in KVM — fresh state each run, no shared daemons, no containers. Topology is configurable per test: NUMA nodes, LLCs, cores per LLC, threads per core, with real ACPI SRAT/SLIT tables on x86_64 and FDT cpu nodes on aarch64 — 24 topology presets on x86_64, 14 on aarch64. The framework drives the whole lifecycle: boot, attach, scenario, collect, teardown. See Topology.
Fast boot
The initramfs base (test binary + busybox + shared libraries) is LZ4-compressed and cached in shared memory; concurrent VMs COW-map the cached base instead of rebuilding it, so boot is dominated by kernel init, not initramfs preparation. Measured on a 64-CPU host:
initramfs spawn: 55.583µs
kvm+kernel: 867.005µs
setup_memory (joins initramfs): 1.409360963s
setup_vcpus: 1.409565321s
VM setup total: 1.409619773s
Declarative scheduler registration
One macro declares a scheduler — binary, default topology, kernel filter for the verifier sweep, assertion overrides, always-on CLI args — and tests reference the const it emits:
use ktstr::declare_scheduler;
declare_scheduler!(MITOSIS, {
name = "mitosis",
binary = "scx_mitosis",
topology = (1, 2, 4, 1),
sched_args = ["--exit-dump-len", "1048576"],
});
Schedulers that take a --config JSON file declare the arg template
once; each test supplies its config inline via config = …. See
Scheduler Definitions and
The #[ktstr_test] Attribute.
Data-driven scenarios
You declare intent as data — cgroups, cpusets, workloads, mid-run ops —
and the framework creates cgroups, spawns workers, applies policies and
affinity, and tears it all down. Canned scenarios grouped by scheduling
concern — affinity, cpusets, dynamic cgroups, nesting, contention,
stress — cover the common patterns; the ops DSL underneath (Step,
Op, Backdrop) expresses the rest. See
Scenarios and
Ops, Steps, and Backdrop.
45 work types
Each work type targets a specific scheduling pressure, so a test can pin the kernel path a regression lives in. A sample of the 45:
| Pressure | Examples |
|---|---|
| CPU and IPC | SpinWait, AluHot, IpcVariance |
| Wakeup placement | FutexPingPong, WakeChain, PipeIo |
| Task churn | ForkExit, CgroupAttachStorm, AffinityChurn |
| Priority and starvation | PreemptStorm, RtStarvation, PriorityInversion |
| Memory and NUMA | CachePressure, PageFaultChurn, NumaWorkingSetSweep |
| IRQ and I/O | IrqWake, NetTraffic, IoConvoy |
| Benchmarks | Schbench, Taobench |
Workers can also set comm, nice level, and thread-group-leader name
(pcomm) to model real applications, and Custom takes a user-supplied
work function. Full catalog: Work Types.
Gauntlet
A single #[ktstr_test] auto-expands across topology presets, and
multi-kernel runs (--kernel A --kernel B) add the kernel as another
matrix dimension. Budget-based selection maximizes coverage within a CI
time limit; multi-NUMA and very large presets are opt-in via constraint
attributes. See Gauntlet.
Real-kernel BPF verifier analysis
The verifier sweep boots a VM per (scheduler × kernel × topology
preset), loads the scheduler through struct_ops — the same path
production uses — and reads actual verified instruction counts from
guest memory. Topology is a real verification axis: values baked into
.rodata (like CPU counts) change what the verifier explores, so a
scheduler can attach on one topology and be rejected on another.
PASS [ 12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
...
verifier verified_insns (per scheduler; rows: kernel, cols: BPF program, cell: range across topologies):
ktstr_sched:
kernel ktstr_dispatch ktstr_dump ktstr_dump_cpu ktstr_dump_task ktstr_enqueue ktstr_exit ktstr_exit_task ktstr_init ktstr_init_task ktstr_select_cp ktstr_yield
kernel_7_0 102 81 13 70 74 25 419 2296 29077 39 8
verifier summary: 4 ✅ 0 ❌ 0 🇽
topology ktstr_sched
odd-3llc ✅
smt-2llc ✅
tiny-1llc ✅
tiny-2llc ✅
On rejection, the log is cycle-collapsed — repeated loop-unrolling iterations are deduplicated so the offending access is readable:
Global function ktstr_dispatch() doesn't return scalar. Only those are supported.
...
--- 8x of the following 25 lines ---
; u64 t = bpf_ktime_get_ns(); @ main.bpf.c:453
38: (85) call bpf_ktime_get_ns#5 ; R0=scalar()
...
--- 6 identical iterations omitted ---
...
--- end repeat ---
192: (63) *(u32 *)(r1 +0) = r2
R1 invalid mem access 'scalar'
processed 186 insns (limit 1000000) max_states_per_insn 0 total_states 7 peak_states 7 mark_read 0
See BPF Verifier Sweep.
Bare-metal export
cargo ktstr export packages a registered test as a self-extracting
.run script that reproduces the scenario on real hardware, no VM. The
script freezes the scheduler binary, its args and config files, and the
required topology; it validates the host and refuses to displace an
already-attached sched_ext scheduler.
wrote /tmp/sched_basic_proportional.run (90074903 bytes archive, 0 include files)
----- head -40 of the generated script -----
#!/bin/bash
# Generated by `cargo ktstr export`. Do not edit; regenerate to update.
...
# --- frozen test specification ---
KTSTR_TEST_NAME=sched_basic_proportional
KTSTR_SCHED_NAME=ktstr_sched
KTSTR_GIT_HASH=73730e0
NEED_LLCS=1
NEED_CORES_PER_LLC=2
NEED_THREADS_PER_CORE=1
NEED_NUMA_NODES=1
...
See cargo ktstr.
Observability
Zero-perturbation introspection
Everything is built on direct reads of guest physical memory from the host via the KVM memory mapping. Kernel state — per-CPU runqueues, sched_domain trees, schedstat and sched_ext event counters — is read through BTF-resolved struct offsets; BPF maps get typed field access via program BTF. No guest-side instrumentation, no BPF syscalls: the observer does not perturb the scheduler under test. See Monitor.
Cast analysis
Schedulers stash kernel and arena pointers in BPF map fields declared as
u64, because BTF cannot express those pointer types. The cast analyzer
walks the scheduler’s instruction stream and proves which fields are
really pointers and to what — so dumps chase through them and print
typed structs annotated (cast→arena) or (cast→kernel) instead of raw
hex. It runs on every scheduler load, no configuration (the failure-dump
excerpt below shows it in action). See
Monitor.
Periodic capture and temporal assertions
#[ktstr_test(num_snapshots = N)] samples BPF map fields and scheduler
stats at N points across the workload window, from outside the guest.
Temporal patterns — nondecreasing, rate_within, steady_within,
converges_to, and friends — assert over the whole series; on-demand
and write-triggered snapshots share the same machinery. See
Periodic Capture,
Temporal Assertions, and
Snapshots.
Statistical regression detection
Every run writes machine-readable results; cross-run comparison with
dual-gate significance thresholds (absolute and relative) catches
regressions single-run assertions miss. Gated metrics include
worst_spread, worst_gap_ms, worst_p99_wake_latency_us, and the
duration-invariant total_run_delay_ns_per_sched; run
cargo ktstr stats list-metrics for the registry. See
Runs and Regression Gates and
Assertable Metrics.
Debugging
Failure dumps
A failing test’s stderr carries the whole story: the tripped check,
per-cgroup stats, a phase timeline, the scheduler log, the monitor’s
verdict. On scheduler crash, ktstr also snapshots every BPF map with
fields rendered by name through BTF — .bss globals, arena allocator
state, typed pointer chases — plus vCPU registers at the instant of
death:
--- repro VM failure dump --- DualFailureDumpReport: early=absent (max_age never crossed threshold (peak=66j, threshold=2500j)), late=(12 maps, 2 vcpu_regs) ... map bpf_bpf.bss (type=array, value_size=448, max_entries=1) .bss: scx_arena_verify_once=true ktstr_alloc_count=76 nr_dispatched=907 nr_enqueued=495 nr_select_cpu=372 stats_magic=6004496034161779060 ... scx_task_allocator scx_allocator: ... root 0x100000006000 → sdt_desc: nr_free=512 chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{} ... vcpu_regs: vcpu 0: ip=0xffffffff96347fbf sp=0xffffffff97203e78 ptroot=0x0000000001e85003 vcpu 1: ip=0xffffffff9560bdc5 sp=0xff3b18cb8000f778 ptroot=0x0000000001e85003
The same report is written as a JSON artifact next to the run’s stats sidecar. See Reading Failure Output and Snapshots.
Auto-repro
On a scheduler crash, ktstr extracts the crash stack, discovers the
struct_ops callbacks, and reruns the scenario in a second VM with BPF
probes attached along the crash path — decoded function arguments and
struct state at each call site, → arrows marking entry-to-exit
changes:
do_enqueue_task kernel/sched/ext.c
rq *rq
cpu 1
task_struct *p
pid 97
cpus_ptr 0xf(0-3)
dsq_id SCX_DSQ_INVALID → SCX_DSQ_LOCAL
...
scx_flags QUEUED|DEQD_FOR_SLEEP → QUEUED
On by default; requires a kernel with the sched_ext_exit tracepoint.
See Auto-Repro.
Interactive shell
ktstr shell boots a busybox VM and drops you into it.
--include-files injects host binaries with their shared-library
closure resolved automatically (recursive DT_NEEDED discovery);
--exec "cmd" runs one command non-interactively. For debugging, not
tests. See ktstr (standalone).
ctprof
ktstr ctprof capture snapshots per-task and per-cgroup scheduler
telemetry on the host — no VM involved. Capture before and after a
change, then compare to see which processes scheduled differently:
## Primary metrics
comm threads metric value delta % %uptime
kworker/{N}:{N}-mm_percpu_wq
kworker/{N}:{N}-mm_percpu_wq 11→37 voluntary_csw 8.697K → 101.154K +92.457K +1063.1% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 timeslices 8.699K → 101.166K +92.467K +1063.0% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 wait_time_ns 2.684s → 27.653s +24.969s +930.2% 93%
...
See ctprof.
Infrastructure
Supported kernels
| Capability | Kernel requirement |
|---|---|
| CI-tested series | 6.14 and 7.1, on x86_64 and aarch64, every push |
| Watchdog-timeout override | 7.1+ via BTF (scx_sched.watchdog_timeout); older kernels via the static scx_watchdog_timeout symbol |
| sched_ext event counters | 6.16+ (two BTF layouts); sampling is disabled when neither is present |
| Auto-repro probe trigger | kernels with the sched_ext_exit tracepoint |
Outside the CI-tested series, the monitor degrades feature by feature rather than failing: tests still run, and unavailable capabilities are reported as absent.
Kernel management
cargo ktstr kernel build builds and caches kernel images from version
numbers, local source paths, or git URLs; automatic discovery resolves
cached images, host kernels, and CI-provided paths, and an optional GHA
cache backend shares built kernels across CI runs. See
cargo ktstr and CI.
Performance mode
Performance mode pins vCPUs to reserved host cores, pre-faults 2 MB
hugepages, runs vCPU threads under SCHED_FIFO, and suppresses PAUSE/HLT
exits — removing host scheduling noise so a latency spike in the guest
points at the scheduler under test, not the host. Guest-side jitter from
shared LLCs and memory bandwidth remains, so compare performance-mode
runs only against the same host. When the host can’t physically satisfy
a declared topology, no_perf_mode builds the VM anyway and skips the
isolation. See Performance Mode.
Resource-budget coordination
--cpu-cap N confines kernel builds and no-perf-mode VMs to N host
CPUs, reserved whole-LLC-at-a-time and coordinated between concurrent
ktstr processes through per-LLC locks — so a kernel build and a
performance run can share a box without trampling each other.
ktstr locks lists every held lock with its holder. See
Resource Budget and
ktstr (standalone).
Change-scoped selection
cargo ktstr affected attributes a base..HEAD diff to the schedulers
it touches and emits a JSON array ready for a GitHub Actions dynamic
matrix — one job per affected scheduler instead of the whole fleet.
--relevant applies the same attribution to the local working tree for
a fast inner loop. Any uncertainty widens to “run all”: silently
skipping an affected scheduler is the worst outcome. See CI.
Guest coverage
Profraw data is collected inside the VM over shared memory and merged
with host coverage, so guest-side code paths count toward the same
cargo llvm-cov report as host-side ones.
Ready to try it? Getting Started sets up your first test; the Recipes cover complete workflows.
Getting Started
Every #[ktstr_test] boots a real Linux kernel in a KVM microVM with
the CPU topology the test declares, runs your workload inside it, and
checks the scheduler’s behavior from the host. This page takes you
from nothing to a green run.
Zero to green
cargo install --locked cargo-nextest
cargo install --locked ktstr # installs `cargo-ktstr` + `ktstr`
cargo ktstr kernel build --kernel 7.0 # one-time full kernel build; cached after
$EDITOR tests/sched_test.rs # write a #[ktstr_test] (below)
cargo ktstr test --kernel 7.0
Both installs are required: cargo ktstr test delegates to nextest,
and the ktstr package installs cargo-ktstr (the cargo plugin
behind every command in this guide) plus the standalone ktstr host
CLI. The kernel build is a real make -j$(nproc) kernel build —
plan for that once; later runs reuse the cache. On a cached kernel,
the run shown below took about 35 seconds end to end.
Prerequisites
Linux only (x86_64, aarch64). ktstr boots KVM virtual machines; it does not build or run on other platforms.
- KVM access (
/dev/kvm) — see Troubleshooting if it’s missing or unreadable - Rust ≥ 1.94.1 (the crate’s MSRV)
- clang, pkg-config, make, gcc, and autotools (autoconf, autopoint, flex, bison, gawk) — BPF skeletons and the vendored libbpf/libelf/zlib build
- BTF (
/sys/kernel/btf/vmlinux) — present by default on most distros - Internet access on first build (downloads busybox source; kernel builds download tarballs from kernel.org)
The host kernel only needs KVM. The guest kernel — the one your tests boot — needs sched_ext, which landed in 6.12; the next section builds one.
# Ubuntu/Debian
sudo apt install clang pkg-config make gcc autoconf autopoint flex bison gawk
# Fedora
sudo dnf install clang pkgconf make gcc autoconf gettext-devel flex bison gawk
Add the dependency
[dev-dependencies]
ktstr = "=0.23.0"
ktstr is pre-release: pin the exact patch version and keep the
installed cargo-ktstr on the same one — minor bumps may break the
test-facing API. To keep ktstr out of a scheduler crate’s normal
builds, gate it behind a feature instead — see
Test a New Scheduler.
Build a kernel
cargo ktstr kernel build downloads a kernel tarball from
kernel.org, applies the embedded ktstr.kconfig fragment (sched_ext,
BPF, kprobes, minimal boot), builds it, and caches the result:
cargo ktstr kernel build # latest stable series with >= 8 point releases
cargo ktstr kernel build --kernel 7.0 # highest 7.0.x release
cargo ktstr kernel build --kernel 6.14.2 # exact version
cargo ktstr kernel build --kernel ../linux # local source tree
The bare form skips series with fewer than 8 maintenance releases —
brand-new majors tend to hit build issues on older toolchains; name
a version explicitly to override. cargo ktstr kernel list shows
the cache and cargo ktstr kernel clean --keep 3 prunes it. You can
also skip this step entirely — cargo ktstr test --kernel 7.0
builds and caches on first use.
Write a test
One mental model before the first example: your test function runs
inside the VM, as the guest’s init process. execute_defs and
friends create real cgroups and spawn real workers; ctx hands you
the guest topology (ctx.topo) and cgroup management
(ctx.cgroups).
Create a file in your crate’s tests/ directory (e.g.
tests/sched_test.rs). The simplest test runs a canned scenario:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)] // llcs = last-level caches
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
// Canned scenario: two cgroups of CPU spinners, default duration.
scenarios::steady(ctx)
}
No scheduler attribute means the test runs under the kernel’s
default EEVDF scheduler (see Overview) — a useful
baseline before pointing at your own.
When the canned scenarios stop being enough, declare your own
cgroups, workloads, and cpusets with CgroupDef — the
Tutorial builds that up one step at a time, and
Writing Tests is the reference.
Point it at a sched_ext scheduler
Declare your scheduler once and reference it from any test:
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_mysched", // your scheduler's binary name
});
#[ktstr_test(scheduler = MY_SCHED, llcs = 2, cores = 2, threads = 1)]
fn my_sched_steady(ctx: &Ctx) -> Result<AssertResult> {
scenarios::steady(ctx)
}
The binary is resolved on the host — target/{debug,release}/, the
test binary’s directory, or a KTSTR_SCHEDULER=/path override — and
packed into the VM’s initramfs. Full field reference:
Scheduler Definitions;
walkthrough: Test a New Scheduler.
Run it
cargo ktstr test --kernel 7.0 # everything
cargo ktstr test --kernel 7.0 -- -E 'test(my_test)' # one test (nextest filter)
cargo ktstr test resolves the kernel — an explicit --kernel
version, path, or cache key, or, without the flag, a discovery chain
through environment variables, the kernel cache, and host kernels —
then wraps cargo nextest run. The full chain and flag grammar live
in cargo ktstr.
Here is a real run, on a cached kernel (transcript captured from
ktstr’s own suite — your run shows ktstr/my_test on the PASS line
instead):
cargo ktstr: fetching latest 7.0.x kernel version cargo ktstr: latest 7.0.x kernel: 7.0.14 cargo ktstr: resolved kernel "7.0" ... ──────────── Nextest run ID 24c18577-cd34-43bd-9d14-b0197701c187 with nextest profile: default Starting 1 test across 121 binaries (12531 tests skipped) PASS [ 34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields ──────────── Summary [ 34.490s] 1 test run: 1 passed, 12531 skipped cargo ktstr: test outputs ... (1 stats sidecar(s), 0 wprof trace(s) written this run)
Reading it:
- The first three lines are kernel resolution:
--kernel 7.0picked the newest 7.0.x release and found it already cached — no rebuild. - Test names have the shape
crate::binary ktstr/test_name; thektstr/prefix marks the base variant, and the same test also generatesgauntlet/topology variants, skipped by default (see Running Tests). The 34 s covers everything: VM boot, scenario, teardown, evaluation. - Every run writes a stats sidecar per test under
target/ktstr/{kernel}-{commit}/— the raw material for regression gates (Runs and Regression Gates).
What gets checked
Warning
Nothing, by default. A bare
#[ktstr_test]boots the VM, runs the scenario, and reports pass even if the scheduler stalled, starved workers, or never dispatched a task.
Every check is an opt-in attribute: not_starved = true enables the
starvation/fairness/gap trio, max_spread_pct, min_iteration_rate,
and friends set explicit thresholds. Checking
explains the model;
Customize Checking shows the override
flow.
When a check fails
A failing check prints the violated threshold with the observed
value, then per-cgroup statistics. This excerpt is from a real run
that set an impossible min_iteration_rate floor:
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
The header names the test, scheduler, and topology variant; each
detail line names the check, observed value, and threshold. The full
output continues with timeline, scheduler-log, and monitor sections,
plus failure-dump artifacts and a ready-to-paste cargo ktstr replay command — Reading Failure
Output walks the whole anatomy.
Next steps
- Tutorial: Zero to ktstr — build a complete test step by step, break it on purpose, and read the wreckage.
- Test a New Scheduler — you have
an
scx_*binary and want it under test in five minutes. - Writing Tests — the authoring reference: attributes, scenarios, snapshots, assertions.
Zero to ktstr
This tutorial walks through writing a complete #[ktstr_test] from
scratch. By the end you’ll have a scheduler test that runs two
cgroups with different lifecycle patterns across a multi-LLC
topology, asserts fairness, throughput parity, and cpuset isolation
— and you’ll have broken it on purpose once, so real failures look
familiar.
Already have a scheduler binary? This tutorial teaches ktstr from the ground up. If you have an existing
scx_Xyou want to test, jump to one of the targeted recipes instead: test-new-scheduler.md (5 minutes, validates basic behavior), ab-compare.md (compare two scheduler builds), or diagnose-slow-scheduler.md (debug performance regressions).
What you’ll build
A test named mixed_workloads that:
- Runs two cgroups on separate LLCs:
background_spinner— a persistent CPU-bound load that runs for the entire test duration.phased_worker— a worker that loops through explicitSpin → Yield → Spin → Yield …phases viaWorkType::Sequence.
- Targets a 2-LLC, 4-core topology so the scheduler has a real cache boundary to respect.
- Sets an explicit test duration.
- Asserts fairness (per-cgroup spread), throughput parity (CV across workers + minimum rate), and cpuset isolation (workers stay on their assigned CPUs).
- Fails once, deliberately, so you learn the failure output.
- Captures a snapshot of the scheduler’s BPF state after the workload.
The complete test is at the end of this page.
Prerequisites
Getting Started covers the toolchain, KVM
access, the dev-dependency, and building a bootable kernel
(Build a kernel). With those in
place, create a file under your crate’s tests/ directory (e.g.
tests/mixed_workloads.rs) and follow along.
Step 1: The skeleton
Every #[ktstr_test] is a Rust function that takes &Ctx and
returns Result<AssertResult>. Start with an empty body that passes
unconditionally:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
let _ = ctx;
Ok(AssertResult::pass())
}
let _ = ctx; keeps the unused-variable lint quiet at the skeleton
stage; Step 2 onward uses ctx.
Try it. Once this file compiles, run just this test with
cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'. A bare-skeleton test passes immediately — the rest of the tutorial adds the workload and assertions on top.
use ktstr::prelude::*; brings in every type the test body needs —
Ctx, AssertResult, CgroupDef, WorkType, CpusetSpec,
execute_defs, and the Result alias from anyhow. The
#[ktstr_test] attribute registers the function so cargo ktstr test discovers it and boots a VM with the requested topology.
A test without a scheduler = … attribute runs under the kernel’s
default EEVDF scheduler — a useful baseline (see
Overview). Step 2 swaps in a sched_ext scheduler so
the rest of the tutorial exercises that scheduler instead.
For the full attribute reference, see The #[ktstr_test] Attribute.
Step 2: Define your scheduler
To target a sched_ext scheduler, declare it with
declare_scheduler! and reference the generated const from
#[ktstr_test(scheduler = …)]. The example uses scx-ktstr, the
test-fixture scheduler shipped in the ktstr workspace; substitute
your own binary name to target a different scheduler.
use ktstr::prelude::*;
declare_scheduler!(KTSTR_SCHED, {
name = "ktstr_sched",
binary = "scx-ktstr",
});
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
let _ = ctx;
Ok(AssertResult::pass())
}
declare_scheduler! emits a pub static KTSTR_SCHED: Scheduler and
registers it so the verifier sweep
discovers it automatically. The scheduler = slot expects the bare
const name. The fields used here:
name— scheduler name for display and result files.binary— binary name, resolved on the host:target/{debug,release}/, the directory containing the test binary, or aKTSTR_SCHEDULERoverride path. The resolved binary is packed into the VM’s initramfs.
Other commonly used fields: topology = (numa, llcs, cores, threads) sets a default VM topology that per-test attributes can
override; sched_args = ["--flag"] prepends CLI args to every test
using this scheduler; kernels = [...] lists kernel specs for the
verifier sweep. For the full surface (sysctls, kargs,
config_file, gauntlet constraints, scheduler-level assertion
overrides) and the manual-builder path for programmatic composition,
see Scheduler Definitions.
Step 3: Add workloads
A CgroupDef declares a cgroup along with the workers that will run
inside it. The builder methods configure worker count, the work each
worker performs, scheduling policy, and cpuset assignment.
Add two cgroups — both running tight CPU spinners for now. Step 5 will swap one of them for a phased workload:
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::SpinWait),
])
}
Without .cpuset(...), a cgroup’s workers run on every CPU in the
test’s topology — they share the VM’s full CPU set with all other
cgroups. .cpuset(CpusetSpec::Llc(idx)) (introduced in Step 4)
restricts a cgroup to one LLC’s CPUs.
WorkType::SpinWait runs a tight CPU spin loop; it is one of many
work primitives, each targeting a different kernel scheduling path —
see Work Types for the full set and how to
choose one.
execute_defs runs each cgroup concurrently for the test’s full
duration. Use execute_steps when you need to add cgroups mid-run
or swap cpusets between phases — see
Ops, Steps, and Backdrop.
Step 4: Set topology
The #[ktstr_test] attribute carries the VM’s CPU topology.
Dimensions are big-to-little: numa_nodes (default 1), llcs
(total across all NUMA nodes), cores per LLC, and threads per
core. Total CPU count is llcs * cores * threads.
LLC count matters because the last-level cache is the primary
scheduling boundary — tasks sharing an LLC benefit from shared
cache lines, while cross-LLC migration carries a cold-cache penalty.
A scheduler that ignores LLC topology will look fine on llcs = 1
and start failing as soon as there is a real cache boundary to
respect.
Bump the topology to two LLCs with two cores each (4 CPUs total) so each cgroup can own its own LLC:
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(1)),
])
}
CpusetSpec::Llc(idx) confines a cgroup to the CPUs that belong to
LLC idx. Other variants (Numa, Range, Disjoint, Overlap,
Exact) cover NUMA-node binding, fractional partitioning, and
hand-built CPU sets — see Topology.
Step 5: Compose phased work inside a cgroup
So far both cgroups run identical CPU spinners. The point of this
test is to exercise a scheduler against different lifecycle
patterns at once, so swap phased_worker for a worker that loops
through explicit phases.
WorkType::Sequence runs each phase for its specified duration and
then advances to the next; when the last phase ends the loop
restarts. Phases: WorkPhase::Spin(Duration),
WorkPhase::Sleep(Duration), WorkPhase::Yield(Duration),
WorkPhase::Io(Duration), and WorkPhase::AluHot { .. }. Use the
WorkType::sequence(first, rest) constructor. Only
std::time::Duration needs an extra use line:
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
// Persistent CPU pressure on LLC 0 for the whole run.
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
// Phased worker on LLC 1: spin 100 ms, yield for 20 ms,
// then loop. Stresses the scheduler's wake-after-yield
// placement repeatedly while the LLC-0 spinner keeps
// runqueue pressure constant.
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
])
}
The two cgroups now exercise distinct paths concurrently:
background_spinner keeps two CPUs continuously busy on LLC 0,
while phased_worker alternates between burning CPU and yielding on
LLC 1, exercising voluntary preemption and wakeup placement.
Both cgroups still run for the entire scenario duration: the phasing
happens within each phased_worker worker’s loop. To express
phasing across cgroups (e.g. add phased_worker only for the
second half of the run), use execute_steps with multiple Step
entries — see Ops, Steps, and Backdrop.
Step 6: Tune execution
Several #[ktstr_test] attributes control how the VM runs the
scenario. The defaults are tuned for fast iteration:
| Attribute | Default | What it does |
|---|---|---|
duration_s | 12 | Per-scenario wall-clock seconds. Workers run for this long, then stop and report. |
watchdog_timeout_s | 5 | sched_ext watchdog fire threshold. |
memory_mib | 2048 | VM memory in MiB. |
watchdog_timeout_s is sched_ext’s per-task stall threshold — if a
runnable task is not picked for that many seconds, the scheduler
exits with SCX_EXIT_ERROR_STALL. The scenario duration and
watchdog are independent; a 12 s scenario with a 5 s watchdog is
normal. Tune the watchdog only when the scheduler under test is
expected to legitimately leave a runnable task parked longer than
the default 5 s.
For the run we’re building, set the duration to 20 s (so each phase iteration repeats many times):
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
// body unchanged from Step 5 — two cgroups via execute_defs
}
For the full attribute reference (auto-repro, performance mode, topology constraints, etc.), see The #[ktstr_test] Attribute.
Step 7: Add assertions
Every check is opt-in — no threshold is compared until you turn its
check on, either at the scheduler level or on the per-test attribute
(Checking explains the model, and
Customize Checking the override
chain). The first check to opt into is not_starved = true, which
enables three related worker-level checks together:
- Starvation — any worker with zero work units fails the test.
- Fairness spread — per-cgroup
max(off-CPU%) - min(off-CPU%)must stay under the spread threshold (release default 15%; debug default 35% — debug builds in small VMs show higher spread, so the threshold loosens automatically in debug builds). - Scheduling gaps — the longest wall-clock gap observed at work-unit checkpoints must stay under the gap threshold (release default 2000 ms; debug default 3000 ms).
Cpuset isolation is separate — enable it with isolation = true.
Override the spread threshold and add throughput-parity gates:
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
isolation = true,
not_starved = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
])
}
What each new attribute gates:
isolation = true— workers must only run on CPUs in their assigned cpuset; any execution on an unexpected CPU fails the test.not_starved = true— enables the starvation/spread/gap trio described above, at the default thresholds.max_spread_pct = 20.0— custom fairness threshold. It replaces the default-threshold spread verdict fromnot_starvedwith your limit (and enables the spread check on its own even withoutnot_starved). 20.0 loosens the release default of 15.0 slightly to absorb noise from the phased worker’s yield-driven re-placement.max_throughput_cv = 0.5— coefficient of variation ofwork_units / cpu_timeacross workers. Catches a scheduler that gives some workers disproportionately less effective CPU.min_work_rate = 1.0— minimum work units per CPU-second per worker. Catches the case where every worker is equally slow (CV passes but absolute throughput is too low).
Host-side monitor checks (imbalance ratio, DSQ depth, stall detection, event rates) also run on every test, but they are report-only by default — Checking covers what they observe and how to make them enforce.
Step 8: Run it
Run the test with cargo ktstr test, scoped to this one test name:
cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'
cargo ktstr test resolves the kernel image, boots a VM with the
declared topology, runs the test as the guest’s init, and reports
the result. A real passing run looks like this (transcript captured
from ktstr’s own suite — your run shows ktstr/mixed_workloads on
the PASS line instead):
cargo ktstr: fetching latest 7.0.x kernel version cargo ktstr: latest 7.0.x kernel: 7.0.14 cargo ktstr: resolved kernel "7.0" ... ──────────── Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default Starting 1 test across 121 binaries (12531 tests skipped) PASS [ 34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields ──────────── Summary [ 34.498s] 1 test run: 1 passed, 12531 skipped cargo ktstr: test outputs ... (1 stats sidecar(s), 0 wprof trace(s) written this run)
That run took about 35 seconds end to end on a cached kernel — VM
boot, scenario, teardown, and evaluation included. The ktstr/
prefix on the test name marks the base variant; see
Running Tests for the name shapes and the
sidecar files each run writes.
If something goes wrong instead:
- “kernel not found” — the
--kernelargument points at a directory without a built kernel, or at a version the cache cannot locate. Runcargo ktstr kernel buildto populate the cache — see Getting Started: Build a kernel. - “scheduler binary not found” — the declared
binary = "..."from Step 2 didn’t land where the discovery cascade looks. SetKTSTR_SCHEDULER=/path/to/binaryto pin an explicit path, or rebuild the scheduler crate so the binary lands undertarget/{debug,release}/. - probe-related errors (“probe skeleton load failed”, “trigger
attach failed”) — re-run with
RUST_LOG=ktstr=debugto see the underlying libbpf reason; see Troubleshooting.
Step 9: Break it on purpose
A green run tells you the harness works; it doesn’t teach you to read a failure. Crank one threshold to an impossible value and watch what comes out. Add an iteration-rate floor no 2-core VM can meet:
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
isolation = true,
not_starved = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
min_iteration_rate = 50_000_000.0, // deliberately impossible
)]
Below is a real capture of exactly this experiment — a demo test with the same impossible floor, on a 2-CPU topology:
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK
...
cargo ktstr: test outputs
...
FAILED throughput_gate [my_sched 1n1l2c1t]
...
replay cargo ktstr replay --filter throughput_gate --exec
How to read it:
- The header names the test, the scheduler, and the topology variant. Every detail line under it names the check that tripped, the observed value, and the threshold — here, workers managing ~40k iterations/s against a 50M floor.
--- stats ---gives the per-cgroup roll-up: worker counts, CPUs touched, fairness spread, worst scheduling gap, migrations, and iteration totals.verdict: monitor OKis worth noticing: the host-side monitor saw nothing wrong. The scheduler behaved fine — the test’s own gate was impossible. When a real scheduler bug trips a check, the monitor and timeline sections are usually where the story is.- The footer hands you a ready-to-paste
cargo ktstr replayline to re-run exactly the failing variant.
The full failure anatomy — timeline, scheduler log, auto-repro,
failure-dump artifacts — is covered in
Reading Failure Output. Now delete the
min_iteration_rate line and the test goes green again.
Step 10: Capture a snapshot
Threshold assertions tell you something is off; snapshots tell you
what the scheduler’s state actually was.
Op::capture_snapshot(name) freezes every vCPU long enough to read
the scheduler’s BPF map state, vCPU registers, and per-CPU counters
into a named report, then resumes the guest.
execute_defs (used so far) takes a flat list of cgroups. To inject
a snapshot, switch to execute_steps, which takes a list of Steps
— each with setup cgroups, an ops list, and a hold duration.
Warning
Within a step, ops fire before the
setupcgroups are created. A single step with both the workload and a snapshot op named “after_workload” would capture an empty guest. Use two steps: a setup step that holds the workload, then a follow-up step whose op fires after the hold ends.
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1, duration_s = 20)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_steps(ctx, vec![
Step {
setup: Setup::Defs(vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
]),
ops: vec![],
hold: HoldSpec::FULL,
},
Step {
setup: Setup::Defs(Vec::new()),
ops: vec![Op::capture_snapshot("after_workload")],
hold: HoldSpec::Fixed(Duration::ZERO),
},
])
}
The first step creates the cgroups and holds them for the full
scenario duration; the second step’s op runs after that hold
finishes, so the snapshot reflects the post-workload guest state.
Downstream code reads the captured report by name and walks fields
with a dotted-path accessor — e.g.
snap.var("nr_dispatched").as_u64()? reads a scheduler global. For
the traversal API, error handling, and the write-driven
Op::watch_snapshot variant, see
Snapshots.
The complete test
The shape exercised by every step above, in one file — the Step 7 assertions plus the Step 10 snapshot steps:
use std::time::Duration;
use ktstr::prelude::*;
declare_scheduler!(KTSTR_SCHED, {
name = "ktstr_sched",
binary = "scx-ktstr",
});
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
isolation = true,
not_starved = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_steps(ctx, vec![
Step {
setup: Setup::Defs(vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
WorkPhase::Spin(Duration::from_millis(100)),
[WorkPhase::Yield(Duration::from_millis(20))],
))
.cpuset(CpusetSpec::Llc(1)),
]),
ops: vec![],
hold: HoldSpec::FULL,
},
Step {
setup: Setup::Defs(Vec::new()),
ops: vec![Op::capture_snapshot("after_workload")],
hold: HoldSpec::Fixed(Duration::ZERO),
},
])
}
Run it:
cargo ktstr test --kernel 7.0 -- -E 'test(mixed_workloads)'
Going further
Each of these builds directly on the test you just wrote.
- Gauntlet.
#[ktstr_test]doesn’t emit just one test — it also generates variants that run the same body across every accepted topology preset (gauntlet/mixed_workloads/smt-2llc, …), catching the bugs only odd LLC counts, SMT siblings, or NUMA crossings expose. See Gauntlet. - Worker identity.
.comm("name"),.nice(n), and.pcomm("name")onCgroupDefgive workers realistic names and priorities for schedulers that key ontask->commor nice values. See Work Types. - Inline scheduler config. Schedulers like
scx_layeredtake a JSON config file;config_file_defon the scheduler plusconfig = …on the test writes it into the guest. See The #[ktstr_test] Attribute. - Periodic capture and temporal assertions.
num_snapshots = Ncaptures BPF state at evenly spaced points across the run, and apost_vmcallback asserts temporal patterns over the series (nondecreasing counters, bounded rates, convergence). See Periodic Capture and Temporal Assertions. - Performance mode. For benchmark-grade runs, ktstr pins vCPUs
to reserved host cores and strips host scheduling noise; for
topologies your host can’t mirror,
no_perf_mode = truebuilds the virtual topology as declared. See Performance Mode. - Stats and regression gates. Every run writes machine-readable
sidecars;
cargo ktstr statsaggregates them andcargo ktstr perf-deltagates HEAD against a baseline. See Runs and Regression Gates. - Custom scenarios. When the declarative ops can’t express your scenario, the test body is arbitrary Rust — resize cpusets based on observed telemetry, assert on migrations directly. See Custom Scenarios and Ops, Steps, and Backdrop.
Writing Tests
Tests are Rust functions annotated with #[ktstr_test]. Each test
boots a KVM VM, runs the scenario inside it, and evaluates results
on the host.
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
ctx.cgroup_def("cg_0"),
ctx.cgroup_def("cg_1"),
])
}
ctx.cgroup_def("name") is shorthand for
CgroupDef::named("name").workers(ctx.workers_per_cgroup) — the
common case. Use CgroupDef::named(...).workers(N).work_type(...)
directly when the test needs to customize worker count or work type.
Run with cargo ktstr test --kernel 7.0 (see
Getting Started for setup). A passing run is
one nextest line per test; the VM boot, scenario, and teardown all
happen inside the reported duration:
cargo ktstr: resolved kernel "7.0"
...
Nextest run ID 24c18577-... with nextest profile: default
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
Summary [ 34.490s] 1 test run: 1 passed, 12531 skipped
Every test gets the same machinery for free: a fresh VM per test (no state shared between tests), a failure dump with BTF-rendered scheduler BPF state if the scheduler crashes (see Reading Failure Output), and an automatic second-VM reproduction run with probes attached (Auto-Repro). Each test also expands into gauntlet variants across topology presets — see Gauntlet.
Warning
No worker checks run by default. The example above passes as long as nothing crashes — it does not assert fairness, starvation, or gaps. Opt in with
not_starved = trueand the threshold attributes; see Checking for the model.
Where to go next
- The #[ktstr_test] Attribute — the full attribute reference: topology, timing, checking thresholds, execution knobs.
- Scheduler Definitions —
declare_scheduler!: how the scheduler under test is named, found, configured, and launched. - Payloads and Included Files — run
benchmark binaries (
schbench,fio, …) alongside workers and extract their metrics. - Custom Scenarios — scenario logic the ops system cannot express, written directly in the test body.
- Snapshots — capture scheduler BPF state on demand mid-scenario and assert on it.
- Watch Snapshots — capture at the exact instant the kernel writes a chosen symbol.
- Periodic Capture — cadenced BPF-state sampling across the workload window, no scenario code required.
- Temporal Assertions — assert on trajectories: counters that only advance, metrics that hold steady, systems that converge.
The #[ktstr_test] Attribute
#[ktstr_test] registers a function as an integration test that
boots a VM with a declared topology and runs the function body inside
it. This page is the attribute reference.
Most tests need only a handful of attributes: a scheduler, a topology dimension or two, a duration, and the checking thresholds the test is actually about.
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 2, 4, 1),
});
#[ktstr_test(
scheduler = MY_SCHED, // scheduler under test (default: kernel EEVDF)
threads = 2, // override one dimension; the rest inherit
duration_s = 10, // workload window (default 12 s)
not_starved = true, // enable starvation / spread / gap checks
max_spread_pct = 20.0, // tighten the fairness-spread threshold
)]
fn smt_fairness(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")])
}
The function signature is
fn(&Ctx) -> anyhow::Result<AssertResult>. scheduler = expects the
bare const emitted by declare_scheduler! — see
Scheduler Definitions.
Attribute forms
All attributes are optional, with defaults, and most take
key = value. The sixteen bool attributes (auto_repro,
expect_auto_repro, not_starved, isolation, performance_mode,
pci, no_perf_mode, requires_smt, expect_err,
survives_storm, allow_inconclusive, fail_on_stall, host_only,
ignore, kaslr, wprof) also accept a bare form as shorthand for
= true — #[ktstr_test(host_only)] equals
#[ktstr_test(host_only = true)]. auto_repro and kaslr default
to true, so their meaningful spelling is auto_repro = false /
kaslr = false; the other fourteen default to false (or unset).
Each attribute key may appear at most once per invocation; a duplicate key fails at macro expansion rather than silently letting the later value win.
Topology
| Attribute | Default | Description |
|---|---|---|
numa_nodes | inherited | NUMA nodes |
llcs | inherited | Total LLCs (not per node) |
cores | inherited | Cores per LLC |
threads | inherited | Threads per core |
memory_mib | 2048 | VM memory floor in MiB (see below) |
Each dimension independently inherits from Scheduler.topology when
a scheduler is specified and that dimension is not set. Without a
scheduler, unset dimensions use the macro defaults (numa_nodes = 1,
llcs = 1, cores = 2, threads = 1). See
Topology for the notation and what the
guest actually gets.
Memory
memory_mib is one of three floors: the framework allocates
max(total_cpus * 64, 256, memory_mib) MiB at VM launch. Above 32
vCPUs the CPU-based floor dominates the default 2048, so a 126-vCPU
test gets 8064 MiB regardless. Raise memory_mib only when the test
needs more headroom than the per-CPU budget provides.
Timing
| Attribute | Default | Description |
|---|---|---|
duration_s | 12 | Workload window in seconds (ctx.duration) |
watchdog_timeout_s | 5 | sched_ext watchdog override in seconds |
The watchdog override is applied via scx_sched.watchdog_timeout on
7.1+ kernels and via the static scx_watchdog_timeout symbol on
earlier kernels; when neither path is available the override logs a
warning and the kernel default stands.
Checking thresholds
Checking attributes override the merged check set
(library defaults → scheduler-level assert → per-test attributes).
Checking explains the two evaluation
channels; Customize Checking owns
the merge rules and worked overrides. Everything here is inherited
when unset.
Worker checks — evaluated from per-worker telemetry after the scenario:
| Attribute | Unit | Example | Fails when |
|---|---|---|---|
not_starved | bool | not_starved = true | any worker finishes with zero work units; also enables the spread and gap checks |
isolation | bool | isolation = true | a worker ran on a CPU outside its cgroup’s cpuset |
max_gap_ms | ms | max_gap_ms = 500 | a worker’s longest scheduling gap exceeds the cap |
max_spread_pct | percentage points | max_spread_pct = 20.0 | max−min worker off-CPU% exceeds the cap |
max_throughput_cv | coefficient of variation | max_throughput_cv = 0.35 | per-worker throughput CV exceeds the cap |
min_work_rate | work units / CPU-second | min_work_rate = 1000.0 | a worker’s work rate falls below the floor |
min_iteration_rate | iterations / second | min_iteration_rate = 50000.0 | a worker’s wall-clock iteration rate falls below the floor |
max_migration_ratio | migrations / iteration | max_migration_ratio = 0.5 | a cgroup’s migration ratio exceeds the cap |
max_p99_wake_latency_ns | ns | max_p99_wake_latency_ns = 2000000 | p99 wake latency exceeds the cap |
max_wake_latency_cv | coefficient of variation | max_wake_latency_cv = 1.0 | wake-latency CV exceeds the cap |
min_page_locality | fraction 0.0–1.0 | min_page_locality = 0.9 | fraction of pages on expected NUMA nodes falls below the floor |
max_cross_node_migration_ratio | fraction 0.0–1.0 | max_cross_node_migration_ratio = 0.1 | NUMA-migrated pages / total pages exceeds the cap |
max_slow_tier_ratio | fraction 0.0–1.0 | max_slow_tier_ratio = 0.05 | pages on memory-only (CXL) nodes exceed the cap |
Monitor thresholds — evaluated from the host monitor’s samples of guest scheduler state:
| Attribute | Unit | Example | Fails when |
|---|---|---|---|
max_imbalance_ratio | ratio | max_imbalance_ratio = 2.0 | observed run-queue imbalance exceeds the cap |
max_local_dsq_depth | tasks | max_local_dsq_depth = 8 | a local DSQ grows deeper than the cap |
fail_on_stall | bool | fail_on_stall | the monitor’s stall detection fails the test instead of reporting |
sustained_samples | samples | sustained_samples = 3 | window size a violation must persist for before it counts |
max_fallback_rate | events/s | max_fallback_rate = 5.0 | fallback-dispatch event rate exceeds the cap |
max_keep_last_rate | events/s | max_keep_last_rate = 100.0 | keep-last event rate exceeds the cap |
What a failing gate looks like
A deliberately unreachable floor, to show the output shape — 50M iterations/s is far beyond what two workers deliver:
declare_scheduler!(MY_SCHED, { name = "my_sched", binary = "scx-ktstr" });
#[ktstr_test(scheduler = MY_SCHED, llcs = 1, cores = 2, threads = 1, duration_s = 5)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
let checks = Assert::default_checks().min_iteration_rate(50_000_000.0);
let steps = vec![Step {
setup: vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")].into(),
ops: vec![],
hold: HoldSpec::FULL,
}];
execute_steps_with(ctx, steps, Some(&checks))
}
TRY 1 FAIL [ 31.810s] (───) ktstr::docs_demo ktstr/throughput_gate
stderr ───
...
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK
The failure names the worker, the measured value, and the threshold it crossed; the stats and monitor sections that follow are the context for deciding whether the threshold or the scheduler is wrong. Reading Failure Output walks the full transcript.
Expected-error matchers
Two attributes narrow which failure counts as the expected bug in an
expect_err = true reproducer test (both require expect_err; both
may be set, composing with AND semantics):
expect_scx_bpf_error_contains = "literal"— the capturedscx_bpf_errortext must contain the literal substring. Empty strings panic at construction.expect_scx_bpf_error_matches = "regex"— the text must match the regex. Empty patterns, invalid syntax, and any pattern that matches the empty string (a?,.*,^$) panic at construction, so a vacuous matcher can never silently pass.^/$anchor to the whole string by default (use(?m)for line anchors), and a bare\bslips the vacuity gate — prefer a substring.
See Investigate a Crash for the pin-an-error-as-a-regression-test workflow these serve.
Topology constraints
These filter which gauntlet presets a test expands into; the base
ktstr/ variant is unaffected.
| Attribute | Default | Description |
|---|---|---|
min_llcs / max_llcs | 1 / 12 | LLC-count bounds |
min_numa_nodes / max_numa_nodes | 1 / 1 | NUMA-node bounds — multi-NUMA presets are opt-in |
min_cpus / max_cpus | 1 / 192 | Total-CPU bounds |
requires_smt | false | Only SMT (threads > 1) presets; skips the test entirely on aarch64, which ships no SMT presets |
The gauntlet skips presets that fail any bound. See Gauntlet for the preset table, filtering rules, and a worked expansion.
Execution attributes
auto_repro / expect_auto_repro
On scheduler crash, auto_repro (default true) boots a second VM
with probes attached to capture the state on the way to the crash —
see Auto-Repro. Set
auto_repro = false for faster iteration; expect_err = true also
disables it. expect_auto_repro = true inverts the assertion: the
test fails unless the auto-repro path actually fired (used to pin
the repro machinery itself). It requires a scheduler and wprof,
and is rejected alongside auto_repro = false, expect_err, or
host_only.
expect_err / survives_storm / allow_inconclusive
expect_err = true asserts the run returns Err — the negative
test: a scheduler crash or scenario failure is the expected outcome,
and a clean pass fails the test. survives_storm = true is the
positive inverse: the scx scheduler must stay attached and alive
through every hold; a death or ejection fails with a
survival-specific explainer. It requires a scheduler and is mutually
exclusive with expect_err and expect_auto_repro.
allow_inconclusive = true lets an Inconclusive verdict pass
instead of exiting 2 — see
Checking for when Inconclusive arises.
performance_mode / no_perf_mode / cpu_budget
performance_mode = true pins vCPUs to reserved host cores with
hugepages, NUMA binding, and RT scheduling, for runs whose numbers
must be comparable — see
Performance Mode.
no_perf_mode = true goes the other way: build the VM with the
declared topology even on a smaller host, skipping all pinning.
The two are mutually exclusive (rejected at compile time).
cpu_budget = N overrides the auto-derived host-CPU mask size in
no-perf mode (must be > 0, requires no_perf_mode; an explicit
--cpu-cap / KTSTR_CPU_CAP still wins) — see
Resource Budget.
kaslr
Default true: the guest boots with KASLR enabled, so tests run
against the memory layout real systems have. kaslr = false
appends nokaslr to the guest command line for the rare workflow
that needs stable kernel addresses.
host_only
Run the function directly on the host — no VM. For tests that need
host tools (cargo, nested VMs) unavailable in the guest initramfs.
Mutually exclusive with scheduler, num_snapshots > 0,
auto_repro = true, disk, and networks — everything that
requires a VM to exist.
ignore
Emits #[ignore] on the generated test, so it is skipped by default
and runs only under nextest’s --run-ignored. Use for
slow-by-design tests that should not gate every local run.
pci / disk / networks
disk = CONST attaches a virtio-blk device from a const DiskConfig (built via DiskConfig::DEFAULT chained setters); the
framework owns the backing file. networks = [CONST, …] attaches
one virtio-net device per const NetConfig (aarch64 supports at
most one). On x86_64 either device auto-enables the virtio-PCI
transport; pci = true forces the PCI host bridge on without a
device (default: no bridge, pci=off on the guest command line).
bpf_map_write
bpf_map_write = CONST (or [A, B]) writes a u32 into a named
scheduler BPF-map field from the host, once, after the scheduler
loads — pre-seeding guest state before workers start. See
Snapshots — composing reads with writes.
watch_bpf_maps
watch_bpf_maps = CONST (or [A, B]) samples named scheduler
BPF-map fields observer-effect-free during the run and surfaces each
as a run-level metric. Full semantics in
Watching BPF-map fields below.
perf_delta_assertions
perf_delta_assertions = CONST (or [A, B]) declares per-test
performance-regression gates. Inert in a normal cargo ktstr test
run — enforced only under cargo ktstr perf-delta --noise-adjust.
Requires performance_mode (rejected at compile time otherwise,
since unpinned numbers would make the gate misfire). See
Assertable Metrics.
num_snapshots
num_snapshots = N fires N periodic BPF-state captures inside the
workload’s 10%–90% window. 0 (default) disables periodic capture.
Validated against the 64-capture bridge cap, host_only, and a
100 ms minimum boundary spacing. See
Periodic Capture.
post_vm / post_vm_unconditional
Host-side callbacks (fn(&VmResult) -> anyhow::Result<()>) invoked
after the VM exits — the place to drain snapshot bridges, read run
metrics, and run temporal assertions.
post_vm is suppressed when the guest reported failure;
post_vm_unconditional always runs (guard it with
if !result.success { return Ok(()); } when it reads state a crash
may not have produced, and note it never turns a guest failure into
a pass). When num_snapshots > 0 and post_vm is omitted, the
macro installs a default callback asserting at least one periodic
capture landed with real BPF state.
cleanup_budget_ms
Caps host-side VM teardown wall time; exceeding the budget folds a
failing detail into the test result. Unset disables the check; 0
is rejected.
staged_schedulers
staged_schedulers = [PATH, …] packs additional &'static Scheduler binaries into the guest at boot. Required for scenarios
that invoke Op::ReplaceScheduler / Op::AttachScheduler — the
swap target must already be on disk in the guest. See
Ops.
workload_root_cgroup
workload_root_cgroup = "/path" places the per-test workload
cgroups under a specific guest cgroup path, decoupled from the
scheduler’s cgroup_parent (which roots scheduler-side cells).
wprof / wprof_args
wprof = true attaches the wprof BPF tracer to the workload VM;
wprof_args = "..." passes space-separated CLI args. Both require
the wprof cargo feature.
payload / workloads / extra_include_files / extra_sched_args
payload = CONST declares the test’s primary benchmark binary and
workloads = [A, B] composes more alongside it; the include-file
pipeline packs every referenced binary into the guest. See
Payloads and Included Files.
extra_include_files = ["path", …] adds test-level host files that
belong to no particular payload. extra_sched_args = ["--flag", …]
appends scheduler CLI args after the scheduler’s own sched_args.
config
Inline scheduler config content, paired with a scheduler that
declares config_file_def — covered next.
Inline scheduler config
Some schedulers (e.g. scx_layered, scx_lavd) accept a JSON
config file via a CLI argument like --config /path/to/config.json.
Two pieces wire this into a test:
-
Scheduler declaration — declares the arg template and the guest path via
config_file_def:const LAYERED_SCHED: Scheduler = Scheduler::named("layered") .binary(SchedulerSpec::Discover("scx_layered")) .config_file_def("--config {file}", "/include-files/layered.json");{file}in the arg template is replaced with the guest path. The framework writes the config content to that path inside the guest before the scheduler binary starts. -
Test attribute — supplies the inline content:
const LAYERED_CONFIG: &str = r#"{ "layers": [...] }"#; #[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)] fn layered_test(ctx: &Ctx) -> Result<AssertResult> { Ok(AssertResult::pass()) }Both a string literal and a path to a
const &'static strare accepted.
The pairing gate is bidirectional and enforced at compile time (and
again at runtime for programmatic entry construction): a scheduler
with config_file_def requires config = … on every test, and a
scheduler without it rejects config = … — the content would
otherwise be silently dropped.
For schedulers that take the same config file on every test, use
Scheduler::config_file(host_path) instead — see
Scheduler Definitions.
Watching BPF-map fields
watch_bpf_maps turns “the scheduler computed X” into a post-VM
assertion. The free-running host monitor reads the named field from
the running guest’s BPF-map memory via BTF — without freezing vCPUs
— and folds the samples into a run-level metric.
Each declared const is one WatchBpfMap::new(map_name_suffix, field, agg, label):
map_name_suffix— matched against a loaded BPF map byends_with(".bss"for a section global, or a named map like"cpu_ctx_stor").field— a dot-path into the map’s value type ("sys_stat.avg_lat_cri", or a bare global like"lat_headroom").agg— pick by the field’s semantic class:BpfMapAgg::Scalar— a gauge; folded as the mean over the run’s samples.BpfMapAgg::ScalarCounter— a monotonic counter; folded as the value at the last sample (the final total, not a mean of a rising series).BpfMapAgg::PerCpu— a per-CPU gauge array; folded into a cross-CPU mean and max.BpfMapAgg::PerCpuCounter— a per-CPU counter array; folded as the cross-CPU sum at the last sample. Watch at u64 width so no per-CPU slot truncates before the sum.
label— the metric-key leaf; must be unique within a test.
The metric key is <scheduler-obj>_<label> (per-CPU gauges get
_avg / _max variants). The prefix is libbpf’s object name from
the scheduler’s global-section map, which can differ from the ops
name — scx-ktstr’s object is bpf_bpf, so its prefix is bpf_bpf.
Read the metric back with VmResult::run_metric in a post_vm
hook; an absent metric returns None, never a false 0.0.
const AVG_LAT_CRI: WatchBpfMap =
WatchBpfMap::new(".bss", "sys_stat.avg_lat_cri", BpfMapAgg::Scalar, "avg_lat_cri");
const LAT_HEADROOM: WatchBpfMap =
WatchBpfMap::new("cpu_ctx_stor", "lat_headroom", BpfMapAgg::PerCpu, "lat_headroom");
fn check(result: &VmResult) -> anyhow::Result<()> {
let avg_lat_cri = result.run_metric("scx_lavd_avg_lat_cri")
.ok_or_else(|| anyhow::anyhow!("avg_lat_cri absent"))?;
let headroom_max = result.run_metric("scx_lavd_lat_headroom_max")
.ok_or_else(|| anyhow::anyhow!("lat_headroom_max absent"))?;
anyhow::ensure!(avg_lat_cri.is_finite() && headroom_max.is_finite());
Ok(())
}
#[ktstr_test(
scheduler = SCX_LAVD,
watch_bpf_maps = [AVG_LAT_CRI, LAT_HEADROOM],
post_vm = check,
)]
fn lat_metrics_surface(ctx: &Ctx) -> anyhow::Result<AssertResult> { /* workload */ }
Resolution is lazy: the maps appear only after the scheduler attaches, so the monitor retries until the named map is present, then caches the resolved offset and re-reads only the leaf bytes each tick.
What the macro generates
The macro renames the function, registers it in the KTSTR_TESTS
distributed slice, and emits a #[test] wrapper that boots the VM
and dispatches. Details in the
attribute rustdoc.
Scheduler Definitions
A Scheduler tells the framework how to find, configure, and launch
the scheduler under test. declare_scheduler! builds one and
registers it so both #[ktstr_test] and the
verifier sweep can see it:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
sched_args = ["--exit-dump-len", "1048576"],
topology = (1, 2, 4, 1),
});
#[ktstr_test(scheduler = MY_SCHED)]
fn basic(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![ctx.cgroup_def("cg_0"), ctx.cgroup_def("cg_1")])
}
MY_SCHED is the Rust handle tests reference; name = "my_sched"
is the user-visible label in nextest output, sidecars, and the CLI.
Rename either independently. Once declared, the scheduler shows up
in the verifier sweep’s cells with no further wiring:
Nextest run ID 3522bea7-... with nextest profile: default
Starting 4 tests across 1 binary (55 tests skipped)
PASS [ 12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
PASS [ 12.432s] (2/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/smt-2llc
...
Defining a scheduler
declare_scheduler! emits a pub static MY_SCHED: Scheduler and
registers a reference to it in the KTSTR_SCHEDULERS distributed
slice, which is what cargo ktstr verifier enumerates.
#[ktstr_test(scheduler = ...)] expects the bare ident; the macro
takes the reference internally. The ident can carry a visibility
prefix (pub, pub(crate)).
Accepted fields
name plus exactly one binary-source key (binary, binary_path,
or the kernel_builtin_enable/kernel_builtin_disable pair) are
required; every other key is optional.
name = "..."— short human name (required).binary = "scx_name"— discover a binary by name. Resolution happens entirely on the host, before the VM boots, and the resolved binary is packed into the guest initramfs — nothing is resolved inside the guest. The cascade: a per-nameKTSTR_SCHEDULER_BIN_<NAME>env override, then the globalKTSTR_SCHEDULER, then a fresh workspace build viacargo build -p <name>— a failed build refuses to serve a possibly-stale pre-built binary unlessKTSTR_SCHEDULER_ALLOW_STALE_FALLBACKis set, which enables the pre-built fallbacks (a sibling of the test binary, thentarget/{release,debug}/). Test binaries run outside the cargo-ktstr pipeline (KTSTR_CARGO_TEST_MODE=1) skip the build and consult the hostPATHand the pre-built fallbacks first instead.binary_path = "/abs/path"— explicit pre-built binary; must exist on the host, packed into the initramfs as-is.kernel_builtin_enable = [...]+kernel_builtin_disable = [...]— paired guest shell-command lists for a scheduler compiled into the kernel (no userspace binary). Both keys must appear together.sched_args = ["--a", "--b"]— scheduler CLI args applied to every test; per-testextra_sched_argsappend after them.kargs = ["nosmt"]— extra guest kernel command line (not the scheduler’s CLI — that’ssched_args). Do not override the kargs ktstr injects itself (console=,loglevel=,rdinit=); those break guest init.sysctls = [Sysctl::new("kernel.foo", "1")]— applied at guest boot, before the scheduler starts (see below).topology = (numa_nodes, llcs, cores, threads)— default VM topology tests inherit dimension-by-dimension.constraints = TopologyConstraints { ... }— gauntlet constraints tests inherit; see the macro reference.cgroup_parent = "/path"— cgroup subtree the guest creates for the scheduler before it starts (see below).config_file = "configs/my.toml"/config_file_def = (...)— the two config-file seams (see below).assert = Assert::NO_OVERRIDES.max_imbalance_ratio(2.0)— scheduler-level checking overrides, merged between the library defaults and per-test attributes. See Customize Checking.kernels = ["6.14", "6.15..=7.0"]— filters which kernel-list entries this scheduler verifies against in the verifier sweep. Entries use the same grammar ascargo ktstr verifier --kernel(exact versions, inclusive ranges, paths, git specs); empty means no filter. Match semantics live in BPF Verifier Sweep.
Manual definition
The const builder still works when the macro doesn’t fit — e.g. a programmatically composed scheduler, or a fixture that must stay out of the verifier sweep:
use ktstr::prelude::*;
const MITOSIS: Scheduler = Scheduler::named("scx_mitosis")
.binary(SchedulerSpec::Discover("scx_mitosis"))
.topology(1, 2, 4, 1)
.sched_args(&["--exit-dump-len", "1048576"])
.cgroup_parent("/ktstr")
.assert(Assert::NO_OVERRIDES.max_imbalance_ratio(2.0));
Scheduler::named("foo").binary_discover("scx_foo") is shorthand
for .binary(SchedulerSpec::Discover("scx_foo")) — the argument is
the binary name to discover, not the scheduler name. A manual const
is not registered in KTSTR_SCHEDULERS, so the verifier sweep does
not see it; use declare_scheduler! for anything that should
participate in cargo ktstr verifier.
SchedulerSpec
pub enum SchedulerSpec {
Eevdf, // no sched_ext binary — kernel EEVDF
Discover(&'static str), // host-side discovery by name
Path(&'static str), // explicit host path
KernelBuiltin { // compiled into the kernel
enable: &'static [&'static str],
disable: &'static [&'static str],
},
}
Scheduler::EEVDF (binary SchedulerSpec::Eevdf) runs tests under
the kernel’s default scheduler and is what #[ktstr_test] uses when
scheduler = is omitted. It is not reachable via
declare_scheduler! — reference Scheduler::EEVDF directly.
Eevdf and KernelBuiltin are excluded from the verifier sweep:
neither has a userspace binary to load BPF programs from.
Kernel-builtin example
declare_scheduler!(MINLAT, {
name = "minlat",
kernel_builtin_enable = ["echo minlat > /sys/kernel/debug/sched/ext/root/ops"],
kernel_builtin_disable = ["echo none > /sys/kernel/debug/sched/ext/root/ops"],
});
The enable commands run in the guest before scenarios start;
disable runs after they complete.
Sysctls
sysctls takes Sysctl::new("key", "value") pairs (dot-separated
keys; duplicates apply in order, last write wins). The framework
injects each as sysctl.<key>=<value> on the guest kernel command
line, so the kernel applies them at boot — each test gets a fresh
VM, so there is no apply/revert step. Sysctl::new is const fn,
so a shared tuning block can live in a const slice:
const RT_TUNING: &[Sysctl] = &[
Sysctl::new("kernel.sched_rt_runtime_us", "950000"),
Sysctl::new("kernel.numa_balancing", "0"),
];
declare_scheduler!(RT_TUNED, {
name = "rt_tuned_scx",
binary = "scx_rt_tuned",
sysctls = RT_TUNING,
});
Config files
Pick one of config_file / config_file_def — they are
alternatives.
- The config is the same file for every test →
config_file = "configs/my_sched.toml". The framework packs the host file into the guest at/include-files/{filename}and prepends--config /include-files/{filename}to the scheduler args. The--configflag name is fixed; a scheduler that uses a different flag can still take the packed path viasched_args, but must tolerate the extra--configargument. - The config varies per test →
config_file_def = ("--config={file}", "/include-files/my.json")declares the arg-template + guest-path pair, and each test supplies content via#[ktstr_test(config = …)]. The pairing is enforced both ways at compile time — see Inline scheduler config.
Both fields may technically coexist (the config_file path is
always packed and its flag prepended; the inline config is written
when a test supplies config = …), but a two-config launch is
rarely what anyone wants — pick one.
Cgroup parent
cgroup_parent = "/ktstr" makes guest init create
/sys/fs/cgroup/ktstr (enabling cpuset/cpu controllers on its
ancestors) before the scheduler starts. It does not pass
--cell-parent-cgroup to the scheduler — a cell-aware scheduler
that needs the flag must carry it in sched_args or per-test
extra_sched_args, and the guest then also creates the directory
named by the flag. Paths are validated at compile time by
CgroupPath: they must start with /, must not be / alone, and
must not contain ...
The same validation applies to any --cell-parent-cgroup value
found in sched_args / extra_sched_args at test setup: empty
values, bare /, relative paths, and a trailing flag with no value
all panic with an actionable message instead of resolving to (or
next to) the host cgroup root and corrupting host state.
Default topology
topology = (numa_nodes, llcs, cores_per_llc, threads_per_core)
sets the VM topology tests inherit. Scheduler::named() defaults to
(1, 1, 2, 1) — a minimal 2-CPU VM. Tests override individual
dimensions; unset ones still inherit:
// Inherits llcs=2, cores=4 from MITOSIS; overrides threads to 2.
#[ktstr_test(scheduler = MITOSIS, threads = 2)]
fn smt_test(ctx: &Ctx) -> Result<AssertResult> { /* ... */ }
Related test-attribute slots
Two #[ktstr_test] attributes complement the scheduler definition:
staged_schedulers = [PATH, …] packs extra scheduler binaries for
runtime swaps via Op::ReplaceScheduler / Op::AttachScheduler,
and workload_root_cgroup = "/path" roots workload cgroups
independently of the scheduler’s cgroup_parent. Both are
documented in the macro reference.
Payloads
Payload authoring — #[derive(Payload)], metric hints, include
files — lives on its own page:
Payloads and Included Files.
Payloads and Included Files
Scheduler tests often need a real benchmark running alongside the
cgroup workers — schbench for wakeup latency, fio for IO
pressure, stress-ng for raw contention. A Payload declares that
binary once: its default args, how to parse its output, the metrics
it emits, the checks that gate them, and the files it needs packed
into the guest.
Declaring a payload
#[derive(Payload)] on a marker struct generates a const Payload.
This is a real fixture from ktstr’s own test suite
(tests/common/fixtures.rs) — schbench with a machine-parseable
JSON summary on stdout:
use ktstr::Payload;
#[derive(Payload)]
#[payload(binary = "schbench", name = "schbench_json", output = Json)]
#[default_args("--runtime", "5", "--message-threads", "2", "--json", "-")]
#[default_check(exit_code_eq(0))]
#[metric(name = "int.rps_pct50.0", polarity = HigherBetter, unit = "rps")]
#[metric(name = "int.wakeup_latency_pct99.0", polarity = LowerBetter, unit = "us")]
#[metric(name = "int.request_latency_pct99.0", polarity = LowerBetter, unit = "us")]
pub struct SchbenchJsonPayload;
The derive emits pub const SCHBENCH_JSON: Payload — the const name
is the struct name with a trailing Payload stripped and converted
to SCREAMING_SNAKE_CASE (FioPayload → FIO; a suffixless
BenchDriver → BENCH_DRIVER). The const’s visibility matches the
struct’s.
The attributes:
#[payload(binary = "...", name = "...", output = ...)]—binary(required) names the executable the include-file pipeline resolves and packs;name(default: the binary name) is the display label in sidecars and logs, so two fixtures can share one binary;outputisJson(parse numeric leaves from the output) orExitCode(status code only, the default).#[default_args("--a", "--b")]— CLI args prepended to every invocation; per-test.arg(...)calls append after them.#[default_check(exit_code_eq(0))]— aMetricCheckconstructor (min,max,range,exists,exit_code_eq); theMetricCheck::prefix is optional. Repeat the attribute for several checks.#[metric(name = "...", polarity = ..., unit = "...")]— declares a metric the payload emits.polarityisHigherBetter,LowerBetter,TargetValue(x), orUnknown; it driveslist-metricsand comparison direction. Duplicate metric names are rejected at expansion.#[include_files("helper", "config.json")]— extra files packed into the guest alongside the binary. Thebinaryitself is auto-prepended, so it never needs listing.
Payload is #[non_exhaustive]: downstream crates cannot use
struct-literal construction. For a binary with no declared metrics
or args, Payload::binary(name, executable) is the one-line
constructor; for anything richer, use the derive. Metrics extracted
with no matching #[metric] hint still land in the sidecar with
Polarity::Unknown — declare a hint for any metric a comparison
verdict should classify.
Using a payload in a test
Reference the const from the test attribute, run it from the body:
#[ktstr_test(scheduler = MY_SCHED, payload = SCHBENCH_JSON, duration_s = 10)]
fn wakeup_latency_under_load(ctx: &Ctx) -> Result<AssertResult> {
ctx.payload(&SCHBENCH_JSON)
.arg("--runtime").arg("8")
.run()
.map(|(assert_result, _metrics)| assert_result)
}
payload = is the primary slot; workloads = [A, B] composes more
payloads alongside it (each runnable via ctx.payload(&A)). The
builder returned by ctx.payload(...) inherits the payload’s
default args and checks; .arg(...) / .args(...) extend,
.in_cgroup(...) places the child, .timeout(...) bounds it, and
the terminal .run() blocks and returns
Result<(AssertResult, PayloadMetrics)>. Only binary-kind payloads
are runnable; the scheduler = slot is separate and takes a bare
Scheduler, never a Payload.
Two dedup rules: the same const may not appear in both payload and
workloads (or twice in workloads), but two distinct consts that
share a binary — like FIO and FIO_JSON — are not deduped and
will spawn the binary twice, each with its own argv. Pick one
fixture per binary unless two instances are the point.
Metric extraction: stdout first, then stderr
OutputFormat::Json reads the payload’s stdout as the primary
stream, then falls back to stderr if stdout is empty or yields no
metrics. Some benchmarks emit their numbers only to stderr —
schbench, for example, writes its Wakeup Latencies percentiles /
Request Latencies percentiles blocks via fprintf(stderr, ...)
and leaves stdout blank (pass --json - for a machine-parseable
summary on stdout). The fallback keeps those benchmarks usable
without a redirect.
Consequence: a payload that writes mixed output to both streams has metrics extracted from stdout only, because the fallback fires solely when the primary stream yields nothing parseable. If you care about stderr-side numbers for a stdout-emitting binary, redirect stderr into stdout at the payload layer.
stress-ng is the mirror trap: progress and per-stressor summaries
go to stderr and stdout is blank, so the fallback sees prose and
OutputFormat::Json returns zero metrics. Keep
OutputFormat::ExitCode for stress-ng unless the payload is wired
to emit JSON on stdout.
Included files
Payloads declare their guest-filesystem dependencies on the
Payload itself via #[include_files(...)], instead of relying on
the CLI -i / --include-files flag at every invocation. Specs are
resolved at test time through the same pipeline the CLI flag uses
(see ktstr shell).
Spec shapes
Which branch fires is decided by the shape of the string:
- Bare name (single component, no
/) — looked up in the current working directory first, then the hostPATH. Packed asinclude-files/<filename>."fio"→ host/usr/bin/fio→ guest/include-files/fio. - Relative or absolute path — used verbatim and must exist;
relative paths resolve against the harness’s working directory at
test time. Packed as
include-files/<filename>."./test-fixtures/workload.json"→ guest/include-files/workload.json. - Directory — walked recursively (symlinks followed,
non-regular files skipped); the basename becomes the root.
"./helpers"containinga.shandsub/b.sh→ guest/include-files/helpers/a.shand/include-files/helpers/sub/b.sh.
Strings in the test-level extra_include_files attribute follow the
same three shapes. They are not anchored to CARGO_MANIFEST_DIR —
they resolve against the working directory at test time, plus PATH
for bare names, and the attribute accepts plain string literals only
(no concat!(env!(...))). For fixtures shipped alongside test
source, the reliable options are a bare name placed on PATH by a
setup step, or a relative path rooted where the test is invoked.
A fully declarative test
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 1, 2, 1),
});
#[derive(Payload)]
#[payload(binary = "bench-driver")]
#[include_files("bench-helper")]
#[metric(name = "ops_per_sec", polarity = HigherBetter, unit = "ops/s")]
struct BenchDriver;
#[ktstr_test(
scheduler = MY_SCHED,
payload = BENCH_DRIVER,
extra_include_files = ["test-fixtures/workload.json"],
duration_s = 5,
)]
fn bench_driver_runs_with_declared_helpers(ctx: &Ctx) -> Result<AssertResult> {
// bench-driver, bench-helper, and workload.json all land in the
// guest at /include-files/ and are on the worker's PATH; no -i
// flag on any host-side invocation.
ctx.payload(&BENCH_DRIVER)
.run()
.map(|(assert_result, _metrics)| assert_result)
}
The declarative set — the payload’s include_files, each
workload’s, and extra_include_files — is aggregated at test time
and deduped on identical (archive path, host path) pairs. Two
declarations that resolve to the same archive slot with different
host paths are a hard error naming both host paths, rather than one
silently overwriting the other.
Probe-wiring environment variables
Two variables pack the jemalloc allocator probe pair into the guest:
KTSTR_JEMALLOC_PROBE_BINARY and
KTSTR_JEMALLOC_ALLOC_WORKER_BINARY (absolute host paths; unset
means no probe is packed). They must be populated before ktstr’s
nextest pre-dispatch runs — plain test-body code is too late — so
tests that need them set both from a #[ctor] constructor, using
the re-export at ktstr::__private::ctor to avoid a second ctor
crate in the dependency tree. See
Environment Variables for
the reference rows.
Custom Scenarios
The body of a #[ktstr_test] function is the scenario — there is
no separate registration step. Most bodies hand control to a canned
scenario or to execute_defs / execute_steps; a custom scenario
is the same function keeping control and driving cgroups, workers,
and assertions itself.
For dynamic scenarios (cgroup creation/removal, cpuset changes), prefer the ops/steps system over a hand-written scenario. Reach for custom code only when ops cannot express the logic:
- Cgroups created, removed, or resized at fixed points in the run — ops cover it.
- Different work types, worker counts, or phase-scoped checks per step — ops cover it.
- Snapshot captures at chosen points — ops cover it (Snapshots).
- Branching on state observed mid-run, computing cpusets from
runtime conditions, or asserting directly on raw
WorkerReports — ops cannot; write a custom scenario.
A worked custom scenario
Shrink one cgroup’s cpuset mid-run — a decision ops cannot make, because the second half’s cpuset and the assertion both depend on runtime state — then assert the scheduler actually moved the workers:
use ktstr::prelude::*;
use ktstr::scenario::*;
#[ktstr_test(llcs = 2, cores = 4, threads = 1, duration_s = 10)]
fn workers_follow_cpuset_shrink(ctx: &Ctx) -> Result<AssertResult> {
let wl = dfl_wl(ctx);
// Creates cg_0 and cg_1, spawns and starts workers in each.
let (mut handles, _guard) = setup_cgroups(ctx, 2, &wl)?;
// First half: full topology.
std::thread::sleep(ctx.duration / 2);
// Mid-run: pin cg_0 to LLC 1 only.
let llc1 = ctx.topo.llc_aligned_cpuset(1);
ctx.cgroups.set_cpuset("cg_0", &llc1)?;
// Second half: workers must migrate off LLC 0.
std::thread::sleep(ctx.duration / 2);
let cg0_reports = handles.remove(0).stop_and_collect();
let migrations: u64 = cg0_reports.iter().map(|r| r.migration_count).sum();
anyhow::ensure!(
migrations > 0,
"cpuset shrink forced no migrations — cg_0 workers never moved"
);
let mut result = ctx.assert.assert_cgroup(&cg0_reports, None);
result.merge(collect_all(handles, &ctx.assert));
Ok(result)
}
Bind the CgroupGroup to a named variable (_guard) so the cgroups
live until end of scope — see
CgroupGroup for drop semantics.
Sleeping ctx.duration (rather than a hard-coded period) keeps the
scenario composable with duration_s = N overrides and the gauntlet
budget controller.
Imports:
setup_cgroups,dfl_wl,collect_all, andspawn_diverselive inktstr::scenario, not in the prelude. Theuse ktstr::scenario::*;line is required —use ktstr::prelude::*;alone does not bring them into scope.
Helper functions
setup_cgroups(ctx, n, wl) — creates cgroups cg_0..cg_{n-1},
spawns and starts workers in each, and returns
(Vec<WorkloadHandle>, CgroupGroup) with handles in cgroup order.
collect_all(handles, checks) — stops all workers and collects
reports. Per-cgroup telemetry is always produced; only the checks
the caller enabled record assertion outcomes, and with no checks
enabled the result stays pass (there is no implicit starvation
fallback).
dfl_wl(ctx) — a WorkloadConfig with ctx.workers_per_cgroup
workers and default settings (WorkType::SpinWait).
spawn_diverse(ctx, cgroup_names) — spawns rotating
work types across cgroups (SpinWait,
Bursty, IoSyncWrite, Mixed, YieldHeavy); IoSyncWrite cgroups always
get 2 workers so blocking IO does not drown the scenario.
Custom work functions
When the built-in work types don’t
generate the load pattern you need, WorkType::Custom runs a
user-supplied work function inside each worker. The framework
handles fork, cgroup placement, affinity, and signal setup; the
function owns the work loop and all WorkerReport population —
framework telemetry (migration tracking, gap detection, schedstat
deltas) is not provided.
use std::sync::atomic::Ordering;
use ktstr::workload::{WorkType, WorkerCtx, WorkerReport};
fn my_workload(ctx: &WorkerCtx) -> WorkerReport {
let tid: i32 = std::process::id() as i32; // one worker = one process
let start = std::time::Instant::now();
let mut work_units = 0u64;
while !ctx.stop().load(Ordering::Relaxed) {
// ... custom work ...
work_units += 1;
}
// Start from default() so unpopulated fields stay zero/empty.
WorkerReport {
tid,
work_units,
iterations: work_units,
wall_time_ns: start.elapsed().as_nanos() as u64,
..WorkerReport::default()
}
}
let wt = WorkType::custom("my_workload", my_workload);
WorkerCtx exposes the stop flag (ctx.stop()), ctx.cpus(),
ctx.sibling_pids(), ctx.cgroup_dir(), and ctx.cfg(). Only
plain function pointers are accepted — they carry no captured state
across the fork boundary; closures are not supported. To pass
per-worker configuration, build the work type with
WorkType::custom_with(name, run, cfg): CustomCfg is a Copy POD
payload inherited byte-faithfully across fork. For genuinely
shared state, allocate a MAP_SHARED region and pass its address
through a u64 slot.
Warning
Every worker calls
setpgid(0, 0)after fork, and teardown SIGKILLs the worker’s whole process group — twice (at collect and at handle drop). Any child a custom function spawns inherits that pgid and dies with it. A child that must outlive the worker needssetpgid(child_pid, 0)after fork, or an explicit wait before the function returns.
The Ctx fields scenario authors use
ctx.cgroups— create/remove cgroups, set cpusets, move tasks. A&dyn CgroupOpstrait object; CgroupManager is the production implementation.ctx.topo— CPU/LLC/NUMA queries and cpuset generation. See Topology.ctx.duration— the workload wall-clock budget; sleep against this, not a literal.ctx.settle— time to wait after cgroup creation for the scheduler to stabilize.ctx.workers_per_cgroup— default per-cgroup worker count (dfl_wlreads it; there is noworkerstest attribute — set counts viaCgroupDef::named("x").workers(n)).ctx.sched_pid— scheduler PID for liveness checks;Nonewhen running under kernel-default EEVDF.ctx.assert— the merged check set (defaults → scheduler → per-test). Pass tocollect_all/assert_cgroupso attribute overrides actually apply.ctx.work_type_override— gauntlet-supplied work type applied toCgroupDefs markedswappable; it does not affectdfl_wl.ctx.current_step— live phase counter (0 = baseline, 1..=N = step ordinal), readable viactx.current_step.load(Ordering::Acquire)to gate behavior on phase; periodic captures are stamped with the same value.
The remaining fields are framework wiring; see the Ctx rustdoc.
Snapshots
Was the scheduler’s per-task state right in the middle of the run? A snapshot answers that: the freeze coordinator pauses every vCPU long enough to walk the kernel’s BPF maps, BTF-render every captured value, and store the result under a name you choose. Test code reads it back through a typed accessor whose errors carry the available alternatives — a typo’d map or field name tells you what was actually there.
Three capture triggers share this machinery:
| Capture | Trigger | The question it answers |
|---|---|---|
Op::capture_snapshot (this page) | a chosen point in the scenario | what does state look like right now? |
| Watch Snapshots | a kernel write to a named symbol | what was state at the instant the kernel touched X? |
| Periodic Capture | evenly spaced boundaries | how does state evolve across the run? |
In a #[ktstr_test] scenario the pipeline is wired automatically:
the op sends a request from the guest to the host coordinator, which
freezes, captures, and stores the report on the host-side
SnapshotBridge. The test reads captures after the VM exits, in a
post_vm callback. No bridge setup is needed — manual wiring exists
only for host-side unit tests.
Capturing and reading
use ktstr::prelude::*;
fn inspect_after_spawn(result: &VmResult) -> anyhow::Result<()> {
let drained = result.snapshot_bridge.drain_ordered_with_stats();
let entry = drained
.iter()
.find(|e| e.tag == "after_spawn")
.ok_or_else(|| anyhow::anyhow!("snapshot 'after_spawn' missing"))?;
let snap = Snapshot::new(&entry.report);
let nr_dispatched = snap.var("nr_dispatched").as_u64()?;
anyhow::ensure!(nr_dispatched > 0, "scheduler never dispatched");
Ok(())
}
#[ktstr_test(scheduler = MY_SCHED, post_vm = inspect_after_spawn)]
fn snapshot_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
let steps = vec![Step {
setup: vec![ctx.cgroup_def("workers")].into(),
ops: vec![Op::capture_snapshot("after_spawn")],
hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)
}
A scenario may issue any number of Op::capture_snapshot ops with
distinct names; reusing a name overwrites the prior capture (with a
warning). If the capture pipeline is unavailable, the op fails
loudly — a snapshot that silently didn’t happen would let
assertions that depend on it pass vacuously.
The accessor surface
Snapshot::new(report) builds a borrowed view; accessors walk the
report in place.
Maps and globals
let map = snap.map("scx_per_task")?; // a captured map by name
let nr = snap.var("nr_cpus_onln").as_u64()?; // a top-level global
var(name) searches every *.bss / *.data / *.rodata
global-section map for a top-level member. When several schedulers’
sections carry the same name, var first tries to resolve the
active scheduler’s copy automatically; live_var(name) opts into
that active-scheduler filter explicitly, and map(name) addresses
one scheduler’s section directly. Note var does not split dotted
paths — to walk into a struct global, chain:
snap.var("ctx").get("weight").
Entries inside a map
let first = map.at(0); // by index
let busy = map.find(|e| e.get("tid").as_i64().unwrap_or(-1) == 1234);
let busiest = map.max_by(|e| e.get("runtime_ns").as_u64().unwrap_or(0));
let active = map.filter(|e| e.get("runtime_ns").as_u64().unwrap_or(0) > 0);
Per-CPU maps (BPF_MAP_TYPE_PERCPU_*) need narrowing before
reading: map.cpu(1).at(0). Calling get on a per-CPU entry
without .cpu(N) first is an error, not a silent first-slot read.
Dotted paths and terminal reads
get(path) walks struct members along a dotted path
(entry.get("ctx.weight") ≡ entry.get("ctx").get("weight")),
transparently following pointer dereferences up to 16 hops — you
write the path the BTF suggests, indirection is invisible. get("")
returns the current value, for terminal reads on scalar per-CPU
slots.
| Method | Returns | Accepts |
|---|---|---|
as_u64() | u64 | Uint, non-negative Int/Enum, Bool, Char, Ptr (raw pointer value) |
as_i64() | i64 | Int, Uint ≤ i64::MAX, Bool, Char, Enum |
as_bool() | bool | Bool; non-zero scalar is true |
as_f64() | f64 | Float, Int, Uint, Enum |
as_str() | &str | Enum with a resolved variant name |
raw() | Option<&RenderedValue> | the underlying rendered value |
Errors carry the fix
Every accessor returns Result<_, SnapshotError>, and each variant
carries what you need to correct the call site without re-running
the test. The rendered messages (quoted from the Display impl):
Snapshot::mapmiss —snapshot has no map '{requested}' (captured maps: {available:?})Snapshot::varmiss —snapshot has no global variable '{requested}' in any *.bss/*.data/*.rodata map (available globals: {available:?})- ambiguous global —
snapshot global '{requested}' is ambiguous (found in {found_in:?}); use Snapshot::active().var(name) (or the shorthand Snapshot::live_var(name)) to pick the active scheduler's copy automatically, or Snapshot::map(name) to address a specific scheduler's bss explicitly - path-walk miss —
path '{requested}': component '{component}' (after walking '{walked}') not found (members at this depth: {available:?}) - wrong terminal type —
path '{requested}': cannot read as {expected} — actual rendered variant is {actual} - predicate miss (
find/max_by) —map '{map}': {op} matched none of {len} entries (first {sampled}: {available_keys:?}); an empty map instead rendersmap '{map}': {op} matched no entries (map is empty), distinguishing it from a populated map whose every entry the predicate rejected. When every sampled key renders as raw hex (no BTF for the key type at capture time), the message appends a hint namingCONFIG_DEBUG_INFO_BTF=yas the fix.
Two variants matter for series-based assertions and are routed
specially by the temporal patterns:
PlaceholderSample (the freeze rendezvous timed out, so the report
carries no real data — skipped, never counted as zero progress) and
MissingStats (the per-sample scx_stats request failed or no stats
client was wired — distinct from an in-JSON path miss so the
assertion site can branch on the cause).
SnapshotError implements std::error::Error, so it composes with
? and anyhow.
Cast-recovered pointers
Schedulers stash kernel and arena pointers in fields whose BTF says
u64, because BTF cannot express a pointer to a per-allocation
type. The host-side cast analyzer
recovers the real target type from the scheduler’s instruction
stream, and the renderer chases the pointer into the right address
space. For the test author:
as_u64()still returns the raw pointer value — existing tests keep working.- Dotted-path walks follow the recovered chase transparently; nested fields appear under the same path a natively-typed pointer would give.
- Rendered dumps annotate recovered pointers so you can tell them from BTF-typed ones — no extra calls needed to consume them.
This is what the annotations look like in a real failure dump
(scx-ktstr’s .bss, from the run on the
macro reference page):
map bpf_bpf.bss (type=array, value_size=448, max_entries=1)
.bss:
scx_arena_verify_once=true ktstr_alloc_count=76 nr_dispatched=907
nr_enqueued=495 nr_select_cpu=372 stats_magic=6004496034161779060
...
scx_task_allocator scx_allocator:
...
root 0x100000006000 → sdt_desc:
nr_free=512
chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{}
ktstr_bss_arena_holder ktstr_bss_arena_holder:
bss_plain_counter=76
arena_target 0x10000000aa80 (cast→arena) [chase: arena chase: STX-flow path tagged slot as Arena with deferred resolve; bridge had no entry for 0x10000000aa80]
(cast→arena) / (cast→kernel) mark analyzer-recovered pointers;
(sdt_alloc) marks a forward-declared arena type resolved through
the allocator bridge. The full annotation taxonomy lives in
Monitor.
Composing reads with writes
Snapshots are the read half of host↔guest interaction. The write
half is the #[ktstr_test] attribute bpf_map_write = CONST — a
one-shot host-side poke at scheduler-load time:
use ktstr::prelude::*;
const TRIGGER_FAULT: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);
// (map_name_suffix, BPF global variable name, u32 value). The
// variable's byte offset is resolved from the map's program BTF at
// write time.
#[ktstr_test(scheduler = MY_SCHED, bpf_map_write = TRIGGER_FAULT, expect_err = true)]
fn fault_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
// The host writes 1 into the scheduler's `crash` global before
// workers start; the scheduler reads the flag and reacts.
/* Op::capture_snapshot + post_vm read as above */
Ok(AssertResult::pass())
}
The write waits for the scheduler’s map to appear, resolves the
named variable to an offset via BTF, writes the value, and signals
completion to the guest before workers spawn. Only
BPF_MAP_TYPE_ARRAY maps are supported. A read+write test then
composes naturally: seed a flag with bpf_map_write, run the
scenario, capture with Op::capture_snapshot, assert on the
scheduler’s reaction through the Snapshot accessors.
There is no op for runtime writes — mid-scenario mutation belongs to interfaces the scheduler itself exports (sysfs, debugfs, a BPF map command interface) driven from a workload process.
Harness internals: manual bridge wiring
Warning
Do not install a thread-local bridge inside a
#[ktstr_test]scenario that boots a VM — the host coordinator owns the bridge there, and a scenario-local one would shadow it. Read captures inpost_vmfromVmResult::snapshot_bridgeinstead.
Host-side unit tests that exercise the executor without booting a guest install a fixture bridge:
let cb: CaptureCallback = std::sync::Arc::new(|_name: &str| {
Some(FailureDumpReport::default()) // hand-crafted report
});
let bridge = SnapshotBridge::new(cb);
let handle = bridge.clone();
let _guard = bridge.set_thread_local();
// ... execute_steps(...) ... then handle.drain() ...
set_thread_local returns a guard that restores the prior bridge on
drop; bind it to _guard, not let _ = — the latter drops the
guard immediately and clears the bridge before any op runs.
tests/snapshot_e2e.rs exercises this pattern end-to-end.
Watch Snapshots
What did the scheduler look like at the exact instant the kernel
wrote a specific variable? Op::watch_snapshot("symbol") arms a
hardware data-write watchpoint on a named kernel symbol; every guest
write to it triggers a full snapshot capture, tagged with the symbol
name. Where Op::capture_snapshot answers “what
does state look like at this point in my scenario”,
Op::watch_snapshot answers “what was state when the kernel did
X”.
Watch snapshots are supported on x86_64 and aarch64 KVM hosts; each architecture’s KVM plumbing maps the slots onto its native hardware-watchpoint facility.
Issuing a watch
use ktstr::prelude::*;
fn read_watch_fires(result: &VmResult) -> anyhow::Result<()> {
let drained = result.snapshot_bridge.drain_ordered_with_stats();
// Each fire is stored under the symbol name as its tag.
let fires = drained.iter().filter(|e| e.tag == "scx_watchdog_timestamp");
anyhow::ensure!(fires.count() > 0, "watchpoint never fired");
Ok(())
}
#[ktstr_test(scheduler = MY_SCHED, post_vm = read_watch_fires)]
fn watch_watchdog_writes(ctx: &Ctx) -> Result<AssertResult> {
let steps = vec![
Step::with_defs(vec![ctx.cgroup_def("workers")], HoldSpec::FULL)
.set_ops(vec![Op::watch_snapshot("scx_watchdog_timestamp")]),
];
execute_steps(ctx, steps)
}
In a VM-booting #[ktstr_test], the wiring is automatic: the op
registers the symbol with the host coordinator, which resolves the
address from the vmlinux ELF, arms a free hardware watchpoint slot
via KVM_SET_GUEST_DEBUG, and stores one capture per fire on the
host-side bridge. Read the captures in post_vm through the same
Snapshot accessors every
capture kind shares. When a sidecar dump path is configured for the
run, each fire’s report is also mirrored to a tagged JSON file for
post-hoc inspection.
Choosing a symbol
Production resolution is a verbatim, byte-for-byte match against the
vmlinux ELF symbol table — no prefix stripping, no BTF lookup, no
kallsyms walk. Use exactly the name nm prints:
nm vmlinux | grep -w scx_watchdog_timestamp
A string that matches nothing fails the step with
symbol '<name>' not found in vmlinux symtab (typo, symbol stripped
from the build, or a non-ELF kernel image).
Warning
High-frequency symbols soft-lock the guest. Watching a symbol the kernel writes every jiffy (e.g.
jiffies_64atHZ=1000) fires 1000+ captures per second, and each capture freezes all vCPUs for the full dump pipeline. The guest spends almost all of its wall time paused — schedulers stall, watchdogs fire, and the test wedges before any meaningful work runs. Pick symbols the kernel writes at scenario-relevant cadence: a state field, a per-event counter.
Three watches per scenario
The cap is 3, tied to the hardware watchpoint slots KVM exposes:
slot 0 is permanently reserved for the *scx_root->exit_kind
trigger that drives the failure-dump pipeline on SCX_EXIT_ERROR
(it always runs, whether or not a scenario declares watches), and
the remaining three user slots are yours. A fourth
Op::watch_snapshot fails the step with the pinned message:
Op::WatchSnapshot cap exceeded: scenario already registered 3
watchpoints (3 user watchpoint slots occupied; slot 0 reserved for
the error-class exit_kind trigger). Drop a watch or use
Op::CaptureSnapshot for a time-driven capture instead.
A failed registration — cap exceeded, resolution failure, callback error — does not consume a slot; the bridge rolls the count back so the scenario can retry with a different symbol.
Failure modes
Registration is the single point where the production pipeline can fail. The callback returns an error when:
- The symbol does not match any vmlinux ELF symtab entry.
- The resolved address is not 4-byte aligned (the 4-byte watch
length requires
addr & 0x3 == 0on every supported architecture). - All three user watchpoint slots are already allocated.
KVM_SET_GUEST_DEBUGrejected the arm (host kernel limitation).
When registration fails, the executor bails the step immediately with the symbol and the reason. Silent degradation is deliberately avoided — a watch that never fires would look identical to a healthy passing run, and the test author would never notice the captures were missing.
Host-side unit tests
Outside a VM, a watch-capable fixture bridge needs both callbacks —
a bridge built with only SnapshotBridge::new(cb) rejects every
Op::watch_snapshot with an error naming the missing wiring:
let cb: CaptureCallback = std::sync::Arc::new(|_name| {
Some(FailureDumpReport::default())
});
let reg: WatchRegisterCallback = std::sync::Arc::new(|symbol: &str| {
println!("would arm watchpoint on {symbol}");
Ok(())
});
let bridge = SnapshotBridge::new(cb).with_watch_register(reg);
let _guard = bridge.set_thread_local();
Do not install a thread-local bridge in a VM-booting scenario — see the warning in Snapshots.
Periodic Capture
A single snapshot proves state was right once; scheduler bugs are usually about how state evolves — a counter that stops advancing, utilization that drifts after warmup. Periodic capture samples guest BPF state on a cadence across the workload window, driven entirely by the host: no scenario-code changes, no capture calls in the test body. The result is a time-ordered series of samples that feeds the temporal assertion patterns.
Enabling it
Set num_snapshots = N on the test; 0 (the default) disables
periodic capture entirely.
use ktstr::prelude::*;
#[ktstr_test(num_snapshots = 3, duration_s = 10)]
fn paced_capture(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
])
}
When boundaries fire
The window is the 10%–90% slice of the workload duration,
anchored at the moment the scenario actually starts — VM boot and
BPF verifier time do not eat the budget. The 10% buffers at each
end keep samples off ramp-up and ramp-down transients. The
remaining 80% divides into N + 1 equal intervals, yielding N
interior boundaries at 0.1·d + (i+1)·0.8·d/(N+1). For a 10 s
workload, num_snapshots = 3 captures at scenario start +
{3 s, 5 s, 7 s}.
The boundary clock is workload time, not wall-clock: a scenario pause shifts every un-fired boundary by the pause duration.
Two validation rules, enforced when the entry is built:
- Minimum spacing —
0.8 · duration / (N + 1) >= 100 ms. Boundaries closer than that would fire back-to-back with no workload progress between them. Reducenum_snapshotsor extendduration_s. - Bridge cap —
num_snapshotscannot exceed 64 (MAX_STORED_SNAPSHOTS). Validation rejects higher values rather than silently evicting the earliest samples.
What a capture costs
Each boundary runs the same pipeline as an on-demand
Op::capture_snapshot: every vCPU is parked, the BPF maps are
walked, the report is stored. On a healthy guest the freeze is tens
of milliseconds (10–100 ms steady state; cold-cache or large
guest-memory walks push higher). The host watchdog deadline is
extended by each freeze’s duration, so periodic captures do not eat
the workload’s wall-clock budget — but they do briefly stop the
guest, which is why the spacing floor exists.
Tags and best-effort delivery
Each capture lands on the host SnapshotBridge under
periodic_NNN (periodic_000, periodic_001, …), coexisting with
on-demand and watchpoint tags on the same bridge — filter with
SampleSeries::periodic_only() before asserting.
Delivery is best-effort: an early VM exit, rendezvous timeout, or watchdog deadline can cut the sequence short, and the run loop abandons the remainder after 2 consecutive rendezvous timeouts so a sustained host overload does not pile up placeholder samples. Under KASLR (the default), a boundary that would fire before the guest’s address slide is published is deferred, not dropped — it fires on the next loop iteration. Assert a lower bound on coverage, not equality:
fn check_coverage(result: &VmResult) -> Result<()> {
anyhow::ensure!(result.periodic_target == 3);
anyhow::ensure!(
result.periodic_fired >= 2,
"too few periodic samples ({}/{})",
result.periodic_fired,
result.periodic_target,
);
Ok(())
}
periodic_target mirrors the configured num_snapshots;
periodic_fired counts boundaries actually serviced (including
rendezvous-timeout placeholders). When post_vm is omitted on a
periodic-configured test, the macro installs a default callback
asserting at least one boundary fired with real BPF state.
Draining the bridge
The assertion pipeline runs on the host after vm.run() returns —
inside a post_vm callback. The recommended path is
drain_ordered_with_stats fed into
SampleSeries::from_drained_typed, which preserves insertion order,
per-sample stats results, and timestamps:
use ktstr::prelude::*;
fn post_vm(result: &VmResult) -> Result<()> {
let series = SampleSeries::from_drained_typed(
result.snapshot_bridge.drain_ordered_with_stats(),
result.monitor.clone(),
)
.periodic_only();
anyhow::ensure!(
!series.is_empty(),
"no periodic samples — coordinator never fired",
);
// ... project a field and feed a temporal pattern ...
Ok(())
}
#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = post_vm)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
])
}
Each drained entry carries the tag, the captured report, the typed
per-sample stats result (Err(MissingStatsReason) when the stats
request failed or no scheduler stats client was wired), a
pause-adjusted elapsed_ms timestamp, the scheduled
boundary_offset_ms, and the scenario phase stamp (step_index).
The other drain variants drop metadata the temporal pipeline needs —
see the
SnapshotBridge rustdoc
if you need them.
Temporal Assertions owns the sample
anatomy and projection surface;
Snapshots owns the per-sample
error routing (PlaceholderSample, MissingStats).
What to assert
Two stages: compose the series (drain, periodic_only()), then
project a column and pick a pattern. For monotonic counters,
nondecreasing is the canonical choice; for utilization-style
metrics that should hold once warmup ends, steady_within; for
“stabilizes near a target by a deadline”, converges_to. The full
pattern surface, projection helpers, and failure rendering live in
Temporal Assertions.
Temporal Assertions
Periodic snapshots produce a series of samples over time. Temporal assertions answer questions about the trajectory — does a counter only ever advance? Does a utilization metric stay near its mean once warmup ends? Does a load average converge before a deadline?
The shape is two-stage: build a SampleSeries from
the drained periodic captures, then project a
SeriesField<T> — one column of T-typed values
across every sample — and feed it through a pattern. Pick by the
question:
| Pattern | Question it answers | Type | On a projection error |
|---|---|---|---|
nondecreasing | does this counter only go up? | any ordered | skip the pair, note the gap |
strictly_increasing | does it advance every period? | any ordered | skip the pair, note the gap |
rate_within(lo, hi) | does it advance at the right speed? | f64 | gap — no rate across it, note |
steady_within(warmup, tol) | does it hold near its mean after warmup? | f64 | skip the sample, note |
converges_to(target, tol, deadline) | does it stabilize near a target in time? | f64 | interrupts the witness run, note |
always_true | does this invariant hold at every sample? | bool | fail — strict |
ratio_within(other, lo, hi) | do two series stay in proportion? | f64 | skip the index, note |
each(...) | is every sample inside a scalar bound? | any | fail — strict |
Every pattern takes &mut Verdict and returns it, so assertions
chain onto one accumulator; each failure records a
DetailKind::Temporal detail, and coverage gaps record Notes.
For enabling capture and draining the bridge, see
Periodic Capture — this page covers
projection and assertion.
SampleSeries
SampleSeries is the ordered sample sequence drained from the
bridge after the VM exits:
use ktstr::prelude::*;
let drained = vm_result.snapshot_bridge.drain_ordered_with_stats();
let series = SampleSeries::from_drained_typed(drained, monitor).periodic_only();
periodic_only() filters to tags beginning with "periodic_",
stripping on-demand captures and watchpoint fires that share the
bridge; periodic_ref() is the borrowed-iterator equivalent when
one test needs both views.
SampleSeries exposes:
len(),is_empty()— sample count.iter_samples()— borrowedSample<'_>views. Each sample carriestag, a pause-adjustedelapsed_ms, aSnapshot<'_>over the captured BPF state, astep_index: Option<u16>phase stamp, andstats: Result<&Value, &MissingStatsReason>— the per-sample scx_stats JSON, or the typed reason the stats request failed.bpf(label, |snap| …)/stats(label, |sv| …)— closure projection along the BPF or stats axis.bpf_live_u64(name)/bpf_live_i64/bpf_live_f64— terse BPF-axis shorthand that resolvesnamevia the auto-disambiguatingSnapshot::live_varaccessor (no closure); mirrored on the stats axis asstats_live_u64(path)/_i64/_f64.bpf_map(map_name)/stats_path(path)— typed auto-projection (see Auto-projection).by_stamped_phase()— group samples by the bridge-stamped scenario phase (BTreeMap<u16, Vec<Sample>>; 0 = baseline, 1..=N = step ordinals). Preferby_stimulus_phase(stimulus_events)when a stimulus timeline is available — it re-derives the phase from each sample’sboundary_offset_msand is immune to deferred-fire bursts that collapse stamped phases.
SeriesField
A SeriesField<T> is one per-sample column. Each slot is a
SnapshotResult<T>, so a missing field, type mismatch, or
placeholder report on one sample does not abort the projection — it
surfaces at the assertion site as a per-sample error the pattern
decides how to handle (see the table above). The field carries each
sample’s tag and timestamp alongside the value, so failure messages
name the offending sample without re-threading the series.
Projecting from BPF state
The bpf closure receives each sample’s Snapshot<'_>; the body is
a normal Snapshot accessor expression:
let nr_dispatched: SeriesField<u64> = series.bpf(
"nr_dispatched",
|snap| snap.var("nr_dispatched").as_u64(),
);
Projecting from scx_stats JSON
The stats closure receives a StatsValue<'_> wrapper over the
per-sample stats JSON:
let busy: SeriesField<f64> = series.stats(
"busy",
|sv| sv.get("busy").as_f64(),
);
A sample whose stats slot is Err (the stats request failed, or no
scheduler stats client was wired) yields a
SnapshotError::MissingStats { tag, reason } slot — distinct from
an in-JSON path miss (FieldNotFound / TypeMismatch) so coverage
gaps and data errors stay distinguishable.
Auto-projection
The typed auto-projectors emit ready-to-feed SeriesFields without
a closure:
// Top-level scalar member of a BPF map's first entry.
let dispatched = series
.bpf_map("scx_obj.bss")
.at(0)
.field_u64("nr_dispatched");
// Stats path drilling into nested layer/cgroup keys.
let layer_util = series
.stats_path("layers")
.key("batch")
.field_f64("util");
Bulk discovery: member_names() / u64_fields() / f64_fields()
on the BPF projector (key_names() on the stats projector) project
every member that yields at least one Ok across the series —
useful for blanket “every counter must be nondecreasing” sweeps.
The typed field_* helpers reach top-level scalars only; nested
members ("ctx.weight") need the closure path. Per-CPU maps use
the projector’s cross-CPU reductions (field_cpu_sum_* /
field_cpu_max_* / field_cpu_min_*) or .cpu(n).field_*.
The patterns
nondecreasing / strictly_increasing
Pass when every consecutive pair satisfies
values[i] <= values[i+1] (or < for the strict variant). The
shape for kernel counters whose only legal direction is up.
let mut v = Verdict::new();
nr_dispatched.nondecreasing(&mut v);
nr_dispatched.strictly_increasing(&mut v); // require advance every period
Projection errors are skipped — the affected pair is dropped, the
skip is logged as a Note, and the verdict is not flipped on
missing data; adjacent samples on either side of a gap are still
checked. Fewer than 2 samples records a “vacuously holds” Note
and passes.
rate_within(lo, hi) (f64 only)
Pass when every consecutive (delta_value / delta_ms) lies in
[lo, hi], computed from the per-sample timestamps — a counter that
should advance at ~1 unit/ms reads as rate_within(0.5, 2.0).
let ticks: SeriesField<f64> = series.bpf("ticks",
|snap| snap.var("ticks").as_f64());
ticks.rate_within(&mut v, 0.5, 2.0);
A zero-time delta records an inconclusive detail (zero denominator)
naming the pair; a non-finite rate records its own detail rather
than slipping past the band; lo > hi is a single caller-error
detail. Projection errors are gaps — no rate is computed across
them.
steady_within(warmup_ms, tolerance) (f64 only)
Pass when every post-warmup sample (elapsed_ms >= warmup_ms) lies
inside [mean·(1-tolerance), mean·(1+tolerance)]. The mean is
computed over post-warmup samples only, so ramp-up does not bias
the baseline. tolerance is a fraction (0.10 = ±10%).
let util: SeriesField<f64> = series.stats("busy",
|sv| sv.get("busy").as_f64());
util.steady_within(&mut v, /*warmup_ms=*/ 1000, /*tolerance=*/ 0.10);
Projection errors are skipped with a Note. When warmup absorbs
every sample, the pattern notes “no samples beyond warmup” and
passes vacuously.
converges_to(target, tolerance, deadline_ms) (f64 only)
Pass when three consecutive samples land inside
[target - tolerance, target + tolerance] at or before
deadline_ms — the convergence-witness shape for “the system
stabilizes near target by the deadline”.
load.converges_to(&mut v, /*target=*/ 1.0, /*tol=*/ 0.5, /*deadline_ms=*/ 5_000);
Distinct outcomes: witness found — pass. No witness before the
deadline — a temporal failure naming the sample count (and any
errored samples that interrupted in-progress runs). Fewer than 3
successfully-projected samples in the window — a Note, not a
failure: absence of data is a coverage gap, not a negative finding,
and the note distinguishes “did not collect enough” from “collected
enough but never converged”.
always_true (bool only)
Pass when every sample’s value is true. Projection errors fail
the assertion — this is a strict pattern; a missing boolean is a
coverage gap that must surface.
let alive: SeriesField<bool> = series.bpf("scheduler_alive",
|snap| snap.var("scheduler_alive").as_bool());
alive.always_true(&mut v);
ratio_within(other, lo, hi) (f64 only)
Pass when every per-index self[i] / other[i] lies in [lo, hi] —
two same-length series walked in lock-step.
util.ratio_within(&mut v, &runtime, 0.4, 0.6);
A length mismatch fires one caller-error detail and aborts. A zero
denominator records an inconclusive detail naming the sample;
out-of-band ratios record the lhs/rhs values. Projection errors on
either side are skipped with a Note naming each gap and which
side errored.
Per-sample scalar checks: each
For per-sample bounds, bypass the trajectory patterns via
SeriesField::each:
nr_dispatched.each(&mut v).at_least(1u64);
util.each(&mut v).between(0.0_f64, 100.0_f64);
ticks.each(&mut v).at_most(10_000.0_f64);
each runs the comparator on every successfully-projected sample;
the first failure records a detail and subsequent failures pile on,
so the timeline shows every offending sample. Projection errors
flip the verdict (each is strict, matching always_true). NaN
samples report an incomparable failure by name — without that
branch, IEEE-754 comparisons against NaN are always false, and a
NaN would silently pass value < floor checks.
Phase-bucketed comparisons
Steps stamp each capture with a scenario phase
(Phase::BASELINE, then Phase::step(0), Phase::step(1), … in
step order). Per-phase reducers on a projected field —
counter_delta_per_phase(), first_per_phase(),
last_per_phase(), value_at_phase(phase) — reduce the series to
one value per phase, and ratio_across_phases pins a later phase
against an earlier one. The swap-A/B shape (step 0 runs scheduler
A, step 1 swaps in B via Op::ReplaceScheduler):
use ktstr::assert::Phase;
let dispatched = series.bpf_live_f64("nr_dispatched");
dispatched
.ratio_across_phases(&mut v, Phase::step(0), Phase::step(1))
.at_most(1.5); // B may cost at most 1.5x A on this counter
at_most records the computed ratio and both phase values in the
verdict — pass or fail — so the margin is visible without extra
printing. A phase with no Ok-samples or a zero baseline records an
inconclusive detail rather than a fake ratio.
PhaseMapExt::ratio_across_phases does the same on a pre-reduced
BTreeMap<Phase, _> for caller-derived per-phase values.
Failure rendering
Every temporal failure carries the field’s label, the pattern name, and the offending sample’s tag and timestamp. A nondecreasing regression renders as (shape pinned by the library’s format strings):
nr_dispatched (nondecreasing): regression at sample periodic_004 (+850ms): \
value 100 after prior value 200 at sample periodic_003 (+700ms)
Coverage Notes render with the per-sample error variant, so
PlaceholderSample (rendezvous timeout), MissingStats (stats
request failed), FieldNotFound (typo / wrong map), and
TypeMismatch are distinguishable without a debugger:
nr_dispatched (nondecreasing): skipped 1 sample(s) with projection errors: \
periodic_002(+500ms): snapshot has no global variable 'nrdispatch' \
in any *.bss/*.data/*.rodata map (available globals: ["nr_dispatched", \
"stall"])
Worked example
The pipeline runs on the host: post_vm receives the VmResult
after vm.run() returns, drains the bridge, and walks the series:
use ktstr::prelude::*;
fn assert_temporal_patterns(result: &VmResult) -> Result<()> {
let series = SampleSeries::from_drained_typed(
result.snapshot_bridge.drain_ordered_with_stats(),
result.monitor.clone(),
)
.periodic_only();
let mut v = Verdict::new();
// BPF axis: counter must never regress.
let nr_dispatched: SeriesField<u64> = series.bpf(
"nr_dispatched",
|snap| snap.var("nr_dispatched").as_u64(),
);
nr_dispatched.nondecreasing(&mut v);
// Stats axis: stay under a generous ceiling.
let stats_dispatched: SeriesField<u64> = series.stats(
"nr_dispatched",
|sv| sv.get("nr_dispatched").as_u64(),
);
stats_dispatched.each(&mut v).at_most(1_000_000_000u64);
v.into_anyhow_or_log()
}
#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = assert_temporal_patterns)]
fn dispatch_counter_advances(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
])
}
For capture wiring and num_snapshots semantics, see
Periodic Capture; for the Snapshot
accessors the projection closures call into, see
Snapshots.
Running Tests
Every #[ktstr_test] boots a fresh KVM microVM with the topology the
test declares, on the exact kernel you target. cargo ktstr test
resolves that kernel (building and caching it when needed) and wraps
cargo nextest run, so nextest’s filtering, retries, and parallelism
all apply.
Quick reference
# Run all tests
cargo ktstr test --kernel ../linux
# Run a specific test
cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'
# Run all ktstr-managed tests, skipping non-ktstr tests in the same crate
cargo ktstr test --kernel ../linux -- -E 'test(/^ktstr/)'
# Run ignored gauntlet variants
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'
What’s in this chapter
- cargo ktstr — the host-side command: kernel resolution, test dispatch, replay, coverage, export.
- ktstr (standalone) — the debugging
companion: interactive VM shells,
topo,ctprof,locks. - Gauntlet — run every test across a matrix of topology presets.
- BPF Verifier Sweep — verify, attach, and dispatch every declared scheduler across topologies.
- Reading Failure Output — what a failed test prints, section by section, and how to investigate.
- Auto-Repro — the second VM that replays a scheduler crash with probes attached.
- Runs and Regression Gates — result
sidecars,
stats, andperf-delta.
Test names and variants
Tests registered through #[ktstr_test] show up in nextest output
under one of four prefixes:
ktstr/{name}— single-kernel run (or anyhost_onlytest, which never boots a VM and so never multiplies across kernels).ktstr/{name}/{kernel}— one case per (test × kernel) when--kernelresolves to two or more kernels.gauntlet/{name}/{preset}— one case per topology preset (see Gauntlet).gauntlet/{name}/{preset}/{kernel}— the full (test × preset × kernel) expansion under a multi-kernel run.
This is what those names look like in a real run:
Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/smt-3llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-1llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-2llc
Filter by prefix with -E 'test(/^ktstr/)' or -E 'test(/^gauntlet/)'.
Tip
test(NAME)is a substring match; the exact-match formtest(=NAME)matches the full nextest name, prefix included. Usetest(=ktstr/sched_basic_proportional), not the bare function name —test(=sched_basic_proportional)matches nothing.
The {kernel} suffix is a sanitized kernel label: kernel_ prefix,
lowercase, non-alphanumeric characters collapsed to _ — 6.16.1
becomes kernel_6_16_1, and a path spec becomes
kernel_path_{basename}_{hash6} (with _dirty appended when the
source tree has uncommitted changes). The 6-character hash
disambiguates two source paths that share a basename.
RUST_BACKTRACE=1 controls panic backtraces and verbose failure
output, not guest console streaming — see
Reading Failure Output for the
investigation knobs.
Budget-based test selection
Set KTSTR_BUDGET_SECS to select the subset of tests that maximizes
configuration coverage within a time budget — useful for CI pipelines
and quick smoke tests:
KTSTR_BUDGET_SECS=300 cargo ktstr test --kernel ../linux
The selector encodes each test as a bitset of properties (scheduler, topology class, SMT, workload characteristics) and greedily picks the tests with the highest marginal coverage per estimated second, with duration estimates accounting for VM boot overhead by vCPU count. A summary is printed to stderr during budget-mode listing:
ktstr budget: 42/1200 tests, 295/300s used, 38/38 configurations covered
Testing your own scheduler
Declare it with declare_scheduler! and reference it from
#[ktstr_test(scheduler = ...)] — see
Scheduler Definitions and
the Test a New Scheduler recipe.
cargo ktstr
cargo ktstr is the host-side command for the whole workflow: it
resolves (and if needed builds and caches) the kernel, then drives
cargo nextest run so every #[ktstr_test] boots its VM against
exactly the kernel you asked for.
Install
cargo install --locked ktstr # installs both `ktstr` and `cargo-ktstr`
The test-fixture binaries (jemalloc probes, schbench/taobench
validators) are behind the non-default integration feature and are
not installed by default. To build from a workspace checkout
instead: cargo build --bin cargo-ktstr.
Task map
| I want to… | Command | Depth |
|---|---|---|
| Run tests on a kernel | cargo ktstr test | this page |
| Re-run last session’s failures | cargo ktstr replay | this page |
| Manage cached kernels | cargo ktstr kernel | this page |
| Sweep schedulers through the BPF verifier | cargo ktstr verifier | Verifier |
| Analyze results, gate regressions | cargo ktstr stats / perf-delta | Runs |
| Narrow CI to affected schedulers | cargo ktstr affected | CI |
| Reproduce a test on bare metal | cargo ktstr export | this page |
| Debug interactively in a VM | cargo ktstr shell | ktstr shell |
Common flags
These flags mean the same thing on every subcommand that takes them.
--kernel ID — one grammar everywhere:
cargo ktstr test --kernel ../linux # local source tree (builds + caches)
cargo ktstr test --kernel 6.14.2 # version (auto-downloads on miss)
cargo ktstr test --kernel 6.14 # major.minor prefix → latest patch release
cargo ktstr test --kernel 6.14.2-tarball-x86_64-kc... # cache key (from `kernel list`)
cargo ktstr test --kernel 6.12..6.14 # range: every stable+longterm release inside
cargo ktstr test --kernel git+https://example.com/r.git#tag=v6.14 # git tag (#branch= / #sha= too)
cargo ktstr test --kernel 6.14.2 --kernel 7.0 # repeatable → multi-kernel matrix
Ranges expand against kernel.org’s releases.json; both endpoints
are series-inclusive (6.11..6.14 covers every 6.14.N; spell
6.14.2 for an exact bound), and EOL series silently drop out
unless you pass --include-eol. Git sources are fetched at the ref,
built, and cached; a moved branch tip rebuilds.
When --kernel resolves to two or more kernels, the kernel becomes
another gauntlet dimension: each (test × preset × kernel) tuple is a
distinct nextest case with a kernel-label suffix (see
test name shapes), and
result sidecars partition per kernel under
target/ktstr/{kernel}-{project_commit}/. Kernel resolution
finishes for every requested kernel before any test runs — a
missing kernel aborts up front rather than mid-matrix. host_only
tests run once regardless of kernel count.
--no-perf-mode — disable all
performance mode features
(flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit
suppression). Also via KTSTR_NO_PERF_MODE.
--no-skip-mode — convert resource-contention and
host-topology-insufficient skips into hard failures (exit 1 instead
of 0). The default skips so a contended runner does not fail tests
that simply could not start.
The three profiles — independent of each other:
--release— the harness build profile. Release mode applies stricter assertion thresholds (gap_threshold_ms2000 vs debug’s 3000,spread_threshold_pct15% vs 35%), so tests that barely pass in debug may fail under--release.--profile NAME— the cargo build profile for the scheduler-under-test (a discovered scheduler package). Defaults torelease; pass--profile devfor a fast unoptimized scheduler build.--nextest-profile NAME— the nextest test profile from.config/nextest.toml(retries, timeouts, output settings).
--relevant / --base / --base-ref / --default-branch —
narrow the run to only the tests whose scheduler your working-tree
change touches, against a baseline commit (merge-base with main by
default). A broad or unattributable change fails safe and runs
everything; a docs-only change runs nothing. The CI-matrix
counterpart is affected; both are documented in
CI.
test
Build the kernel (if needed) and run tests via cargo nextest run.
Also available as cargo ktstr nextest (a clap alias). Arguments
after -- are passed through to nextest:
cargo ktstr test --kernel ../linux # everything
cargo ktstr test --kernel 7.0 -- -E 'test(my_test)' # nextest filter
cargo ktstr test --kernel 7.0 -- --retries 2 # nextest retries
cargo ktstr test --kernel 7.0 -- --features integration # cargo features
cargo ktstr test --relevant # only tests my edits affect
A real single-test run (the 7.0 series was already built and
cached, so resolution maps 7.0 to the latest patch release and
reuses the cached image):
$ cargo ktstr test --kernel 7.0 -- --features integration -E 'test(=ktstr/failure_dump_renders_bss_fields)'
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.22s
────────────
Nextest run ID 98581174-246f-4824-a170-50992df166d7 with nextest profile: default
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 34.459s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
Summary [ 34.498s] 1 test run: 1 passed, 12531 skipped
cargo ktstr: test outputs
~/ktstr/target/ktstr/7.0.14-73730e0-dirty
(1 stats sidecar(s), 0 wprof trace(s) written this run)
An immediate re-run of the same command took the same ~34 s — with
the kernel cached, wall time is the test itself, not
infrastructure. For a --kernel <path> source tree, a cache hit is
announced on stderr as
cargo ktstr: cache hit for {path} ({cache_key}, built {age} ago)
and skips the build entirely.
Per-test exit codes
Each #[ktstr_test] process exits with one of three codes, so CI
gates and dashboards can triage runs:
| Code | Verdict | Meaning |
|---|---|---|
0 | Pass / Skip | Assertions passed, or the test never ran (host too small, resource contention, perf mode unavailable). Skips degrade to pass unless --no-skip-mode. |
1 | Fail | An assertion failed; an operator --cpu-cap the host cannot satisfy; a skip under --no-skip-mode; or expect_err = true and the test passed. |
2 | Inconclusive | A zero-denominator ratio gate could not evaluate — the workload produced no signal to ratio against. |
Exit code 2 is the silent-pass guard: a Pass at a ≤ threshold
gate run against a 0 / 0 ratio that synthesized to 0.0 would
have shipped a false-green CI run. The harness records Inconclusive
instead (see Checking) and
projects it to a distinct exit code so external tooling can route
the run separately from real regressions. The values are exported as
EXIT_PASS / EXIT_FAIL / EXIT_INCONCLUSIVE in the prelude.
replay
Re-run only the tests that failed in the last session, from the
result sidecars under target/ktstr/:
cargo ktstr replay # print the nextest filter (dry-run)
cargo ktstr replay --exec # actually run it
cargo ktstr replay -E starve # narrow by test-name substring
cargo ktstr replay --dir PATH # source sidecars from an archived tree
Dry-run is the default: the filter prints to stdout so you can
inspect it (or paste it into CI) before committing to the re-run.
--profile / --nextest-profile apply with --exec. Distinct from
auto-repro, which fires inside the failing test
process; replay is post-hoc, across a whole session.
coverage
Run tests with coverage via cargo llvm-cov nextest. Same kernel
resolution, multi-kernel semantics, and common flags as test;
arguments after -- pass through to cargo llvm-cov nextest:
cargo ktstr coverage --kernel ../linux
cargo ktstr coverage -- --workspace --lcov --output-path lcov.info
Requires cargo-llvm-cov and the llvm-tools-preview rustup
component. Guest-side coverage is flushed over shared memory at VM
exit and merged into the report automatically; multi-kernel runs
merge every variant’s profraw into a single report.
Profraw files accumulate across runs — including host-side files
that plain cargo ktstr test writes next to the cargo-ktstr binary
(an LLVM_PROFILE_FILE injection that keeps default.profraw out
of your kernel source tree; export LLVM_PROFILE_FILE yourself to
opt out). To clean up:
cargo ktstr llvm-cov clean --profraw-only # only *.profraw under target/llvm-cov-target/
rm -f target/debug/llvm-cov-target/default-*.profraw
rm -f target/release/llvm-cov-target/default-*.profraw
Avoid bare cargo ktstr llvm-cov clean (wipes reports too) and
--workspace (also runs cargo clean).
llvm-cov
Raw passthrough to cargo llvm-cov for subcommands that don’t fit
the coverage flow (report, clean, show-env), with the same
kernel-resolution plumbing:
cargo ktstr llvm-cov report --lcov --output-path lcov.info
Always pass a subcommand: a bare cargo ktstr llvm-cov falls
through to cargo test, which skips gauntlet variants and verifier
cells entirely (they exist only under the nextest harness).
kernel
Manage cached kernel images: list, build, clean. The
standalone ktstr kernel subcommands are identical.
How kernels are discovered
There are two flows, and they intentionally differ:
Explicit (--kernel ...) — the full pipeline. For a source-tree
path: validate it is a kernel tree, look up the cache (clean git
trees only — a cache hit short-circuits straight to tests),
auto-configure (make defconfig if needed, append ktstr’s kconfig
fragment, make olddefconfig), build with
make -j$(nproc) KCFLAGS=-Wno-error, validate that the critical
config options survived (CONFIG_SCHED_CLASS_EXT,
CONFIG_DEBUG_INFO_BTF, CONFIG_BPF_SYSCALL, tracing options —
the kernel build system silently drops options with unmet
dependencies, and each failure gets a remediation hint), generate
compile_commands.json, store the result in the cache, and run
nextest with KTSTR_KERNEL (and KTSTR_KERNEL_LIST for
multi-kernel) exported. Dirty or non-git trees build in place, get a
_dirty kernel label, and never cache. Version, range, and git
identifiers download/fetch + build + cache on miss.
Implicit (no --kernel) — discovery only, no builds. The test
framework checks KTSTR_TEST_KERNEL, then KTSTR_KERNEL, then the
newest valid cache entry, then local build trees (./linux,
../linux, /lib/modules/$(uname -r)/build), then host images
(/boot/vmlinuz*). Whatever pre-built image it finds is used as-is
— nothing is compiled and nothing new lands in the cache.
Tip
If you iterate on a local kernel tree, pass
--kernel ../linux(or runcargo ktstr kernel build --kernel ../linuxonce) so the build is cached and reused. Implicit discovery will find the tree but never populate the cache for it.
kernel list
List cached kernels, newest first:
$ cargo ktstr kernel list
cache: ~/.cache/ktstr/kernels
KEY VERSION SOURCE ARCH BUILT
7.0.14-tarball-x86_64-kcabd40422 7.0.14 tarball x86_64 2026-07-04T23:06:12Z
local-b4dc42d-x86_64-kcabd40422 7.1.0 local x86_64 2026-07-04T23:00:04Z
local-5123e5a-x86_64-cfg1982ad42-kcabd40422 7.1.0 local x86_64 2026-07-04T22:20:13Z
...
local-c5d2724-x86_64-kcabd40422 7.0.0 local x86_64 2026-07-04T22:05:35Z
...
7.0.13-tarball-x86_64-kc5f2631f0 7.0.13 tarball x86_64 2026-06-22T21:54:56Z (stale kconfig)
7.1.1-tarball-x86_64-kc5f2631f0 7.1.1 tarball x86_64 2026-06-22T21:46:19Z (stale kconfig)
warning: entries marked (stale kconfig) were built against a different ktstr.kconfig. Rebuild with: kernel build --force --kernel <entry version> (add --extra-kconfig PATH if the entry also carries the (extra kconfig) tag).
(stale kconfig) marks entries built against a different ktstr
kconfig fragment (they are rebuilt automatically when requested);
(EOL) marks entries whose series has left kernel.org’s active
releases list. --json emits the same data for scripting. With
--kernel START..END, list switches to preview mode: it prints
the versions the range expands to without downloading or building
anything — the cheap answer to “what does 6.12..6.16 actually
cover?”.
kernel build
Download (or use a source tree / git ref), build, and cache one kernel or a range:
cargo ktstr kernel build # latest stable from kernel.org
cargo ktstr kernel build --kernel 6.14.2 # specific version
cargo ktstr kernel build --kernel 6.12 # latest 6.12.x patch release
cargo ktstr kernel build --kernel 6.11..6.14 # every release in the range
cargo ktstr kernel build --kernel ../linux # local source tree
cargo ktstr kernel build --force --kernel 6.14.2
With no --kernel, it builds the latest stable series that has had
at least 8 maintenance releases — keeping CI off brand-new majors
whose early builds are more likely to break. Already-cached entries
are skipped unless --force.
| Flag | Description |
|---|---|
--force | Rebuild even if cached. |
--clean | make mrproper first (source-tree builds only). |
--cpu-cap N | Reserve exactly N host CPUs for the build (build parallelism and a cgroup sandbox follow). See Resource Budget. |
--extra-kconfig PATH | Extra kconfig fragment merged over ktstr’s (user values win); lands in its own cache slot. |
--skip-sha256 | Skip tarball checksum verification (emits a warning). |
--include-eol | With a range, also build EOL series from the linux-stable mirror. |
A bare relative name is read as a cache key — prefix a relative
source directory with ./.
kernel clean
cargo ktstr kernel clean --keep 3 # keep the 3 most recent valid entries
cargo ktstr kernel clean --corrupt-only --force # drop only broken entries
Corrupt entries (missing metadata or image) never consume a --keep
slot. --force skips the confirmation prompt (required
non-interactively).
verifier
Sweep every declare_scheduler!-registered scheduler through the
real kernel’s BPF verifier across topologies, checking verify +
attach + dispatch per cell:
cargo ktstr verifier --kernel 7.0
cargo ktstr verifier --scheduler scx-ktstr # one scheduler
cargo ktstr verifier --raw # no cycle collapse
See BPF Verifier Sweep for the cell model, real output, and the kernels-filter contract.
shell
Shares the VM boot flow and flags with
ktstr shell, plus two additions: --kernel
also accepts raw image files (bzImage, Image), and --test NAME
derives topology, memory, and include files from a registered
#[ktstr_test] (mutually exclusive with --topology /
--memory-mib; -i is additive):
cargo ktstr shell --test my_failing_test
cargo ktstr shell --kernel ./arch/x86/boot/bzImage
export
Export a registered test as a self-extracting .run file that
reproduces the scenario on bare metal, no VM: the ktstr binary, the
scheduler binary, and every declared include file, embedded in a
bash preamble that validates root, sched_ext support, cgroup2, the
no-other-scheduler-attached invariant, and topology compatibility
before launching.
cargo ktstr export my_test -o /tmp/my_test.run
wrote /tmp/sched_basic_proportional.run (90074903 bytes archive, 0 include files)
The generated script opens with the frozen test specification and the preflight checks:
#!/bin/bash
# Generated by `cargo ktstr export`. Do not edit; regenerate to update.
set -euo pipefail
# --- frozen test specification ---
KTSTR_TEST_NAME=sched_basic_proportional
KTSTR_SCHED_NAME=ktstr_sched
KTSTR_GIT_HASH=73730e0
NEED_LLCS=1
NEED_CORES_PER_LLC=2
NEED_THREADS_PER_CORE=1
NEED_NUMA_NODES=1
TEST_DURATION_SECS=12
TEST_WATCHDOG_SECS=15
Scheduler choice, scheduler args, and topology are frozen at export
time; --duration, --watchdog-timeout, and --quiet can be
overridden at .run invocation. --package NAME scopes the
workspace search (and disambiguates duplicate test names —
otherwise the first matching binary in path order wins);
--release builds the embedded binaries with the release profile to
match how you will run them.
Not exportable (rejected with actionable errors): host_only tests,
tests using bpf_map_write (they need the host-side probe surface),
and KernelBuiltin schedulers.
completions
cargo ktstr completions bash >> ~/.local/share/bash-completion/completions/cargo
cargo ktstr completions zsh > ~/.zfunc/_cargo-ktstr
cargo ktstr completions fish > ~/.config/fish/completions/cargo-ktstr.fish
SHELL is one of bash, zsh, fish, elvish, powershell;
--binary overrides the completion target (default cargo).
stats
Sidecar analysis for past runs: stats (analysis of the newest
run), stats list (run table), stats list-metrics (the regression
metric registry), stats list-values (distinct filter values in the
pool), stats show-host --run ID (archived host context), and
stats explain-sidecar --run ID (why optional fields are absent).
cargo ktstr show-host prints the same host context live, and
cargo ktstr show-thresholds TEST prints the merged assertion
thresholds a test will run with. See
Runs and Regression Gates.
perf-delta
Compare performance_mode test metrics between HEAD and a baseline
commit, exiting non-zero when enough metrics regress to trip the
gate:
cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14
Baseline resolution, the --noise-adjust statistics, and the full
flag set are documented in
Runs and Regression Gates; the worked
walkthrough is A/B Compare Branches.
affected
Emit the scheduler packages a base..HEAD diff affects, as a JSON
array for a GitHub Actions dynamic matrix
(["scx_lavd","scx_rusty"]). Attribution is fail-safe: any
uncertainty widens to the full set, and only a strictly docs-only
change emits []. --relevant is the same engine applied locally
to narrow test / coverage / perf-delta runs. Both are
documented with the complete CI workflow in CI.
locks
Enumerate every ktstr flock held on this host, read-only, naming
holder PIDs and cmdlines — the troubleshooting companion when a run
is stalled behind a peer’s reservation. Identical to
ktstr locks, where the lock roots and real
output are documented.
Gauntlet
Some scheduler bugs only exist on topologies you don’t develop on: a
per-LLC work-splitting heuristic that breaks on an odd LLC count, an
idle-core picker that lands both SMT siblings, a migration policy that
never crosses a NUMA boundary. The gauntlet expands every
#[ktstr_test] into one variant per topology preset — up to 24
presets (14 on aarch64) — so those bugs surface as a named, re-runnable
test case instead of a production report.
Gauntlet variants are prefixed gauntlet/ and ignored by default:
# Run only base tests (default)
cargo ktstr test --kernel ../linux
# Run only gauntlet variants
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'
# Run everything
cargo ktstr test --kernel ../linux -- --run-ignored all
# Run a single variant
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only \
-E 'test(=gauntlet/my_test/smt-2llc)'
This is what the expansion looks like when nextest lists a test with
min_llcs = 1 and default constraints on this host:
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/medium-4llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/medium-4llc-nosmt
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/odd-3llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/smt-2llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/smt-3llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-1llc
ktstr::worktype_coverage_fork_gauntlet_e2e gauntlet/worktype_fork_gauntlet_covers_all_arms/tiny-2llc
Under a multi-kernel run a {kernel_label} segment is appended —
gauntlet/{name}/{preset}/{kernel}. See
Test names and variants for
the label format.
Topology presets
Note
Multi-NUMA and scale-boundary presets are opt-in. The default constraints (
max_numa_nodes = 1,max_llcs = 12,max_cpus = 192) exclude the fivenuma*presets plusnear-max-llc,max-cpu, and their-nosmtvariants — 15 of the 24 presets are active by default. Raisemax_numa_nodes,max_llcs, ormax_cpuson the test to opt in.
| Preset | Topology | CPUs | LLCs | NUMA | Description |
|---|---|---|---|---|---|
tiny-1llc | 1n1l4c1t | 4 | 1 | 1 | Single LLC |
tiny-2llc | 1n2l2c1t | 4 | 2 | 1 | Minimal multi-LLC |
odd-3llc | 1n3l3c1t | 9 | 3 | 1 | Odd CPU count |
odd-5llc | 1n5l3c1t | 15 | 5 | 1 | Prime LLC count |
odd-7llc | 1n7l2c1t | 14 | 7 | 1 | Prime LLC count |
smt-2llc | 1n2l2c2t | 8 | 2 | 1 | SMT enabled |
smt-3llc | 1n3l2c2t | 12 | 3 | 1 | SMT, 3 LLCs |
medium-4llc | 1n4l4c2t | 32 | 4 | 1 | Medium topology |
medium-8llc | 1n8l4c2t | 64 | 8 | 1 | Medium, many LLCs |
large-4llc | 1n4l16c2t | 128 | 4 | 1 | Large, few LLCs |
large-8llc | 1n8l8c2t | 128 | 8 | 1 | Large, many LLCs |
near-max-llc | 1n15l8c2t | 240 | 15 | 1 | Near maximum |
max-cpu | 1n14l9c2t | 252 | 14 | 1 | Near KVM vCPU limit |
medium-4llc-nosmt | 1n4l8c1t | 32 | 4 | 1 | Medium, no SMT |
medium-8llc-nosmt | 1n8l8c1t | 64 | 8 | 1 | Medium, many LLCs, no SMT |
large-4llc-nosmt | 1n4l32c1t | 128 | 4 | 1 | Large, no SMT |
large-8llc-nosmt | 1n8l16c1t | 128 | 8 | 1 | Large, many LLCs, no SMT |
near-max-llc-nosmt | 1n15l16c1t | 240 | 15 | 1 | Near maximum, no SMT |
max-cpu-nosmt | 1n14l18c1t | 252 | 14 | 1 | Near KVM vCPU limit, no SMT |
numa2-4llc | 2n4l4c1t | 16 | 4 | 2 | Multi-NUMA, 2 nodes |
numa2-8llc | 2n8l8c2t | 128 | 8 | 2 | Multi-NUMA, 2 nodes, SMT |
numa2-8llc-nosmt | 2n8l16c1t | 128 | 8 | 2 | Multi-NUMA, 2 nodes, no SMT |
numa4-8llc | 4n8l4c1t | 32 | 8 | 4 | Multi-NUMA, 4 nodes |
numa4-12llc | 4n12l8c2t | 192 | 12 | 4 | Multi-NUMA, 4 nodes, SMT |
Topology format: {numa_nodes}n{llcs}l{cores_per_llc}c{threads_per_core}t
— 1n2l4c2t is 1 NUMA node, 2 LLCs, 4 cores per LLC, 2 threads per
core = 16 CPUs. Note that llcs is the total across the machine, not
per node.
aarch64: ARM64 CPUs do not have SMT. Presets with
threads_per_core > 1 are excluded on aarch64, leaving 14 presets
(the 5 small presets, 6 -nosmt variants, and 3 non-SMT NUMA
presets).
Constraint filtering
#[ktstr_test] topology constraints filter which presets a test runs
on. A preset is skipped when any constraint is not met:
num_numa_nodes() < min_numa_nodesmax_numa_nodesis set andnum_numa_nodes() > max_numa_nodesnum_llcs() < min_llcsmax_llcsis set andnum_llcs() > max_llcsrequires_smtandthreads_per_core < 2total_cpus() < min_cpusmax_cpusis set andtotal_cpus() > max_cpus
See The #[ktstr_test] Attribute for the attribute table.
Authoring gauntlet-ready tests
Worked example
A test with min_llcs = 2, requires_smt = true, and default
max_numa_nodes = 1 against the preset table above:
tiny-1llc(1 LLC): excluded — belowmin_llcs- All non-SMT presets (
tiny-2llc,odd-*,*-nosmt): excluded —requires_smt near-max-llc(15 LLCs): excluded — above defaultmax_llcs = 12max-cpu(252 CPUs, 14 LLCs): excluded — above defaultmax_cpus = 192(also above defaultmax_llcs = 12)- All
numa*presets: excluded — above defaultmax_numa_nodes = 1
Result: 6 of 24 presets survive (smt-2llc, smt-3llc,
medium-4llc, medium-8llc, large-4llc, large-8llc). On
aarch64, none survive — all aarch64 presets lack SMT.
Variant count
The total number of gauntlet variants for a test is
valid_presets × resolved_kernels: the 6 surviving presets above
produce 6 variants under a single kernel and 12 under
--kernel A --kernel B.
Tests that skip gauntlet
Entries with host_only = true never produce gauntlet variants —
they run on the host without booting a VM, so topology variation
carries no signal. Tests whose names start with demo_ are ignored
by default, gauntlet variants included.
Operator notes
- Wall time. Each variant boots its own VM and runs the full scenario, so a sweep costs roughly (surviving presets × the per-run wall time you observe for the base test). nextest runs variants in parallel within your host’s budget. For a coverage-per-second subset under a deadline, use budget-based selection.
- Memory. Each gauntlet VM gets
max(cpus × 64 MiB, 256 MiB, entry.memory_mib)of guest RAM (plus an initramfs-derived floor). For the 252-CPUmax-cpupresets that is at least 16128 MiB — the host needs that much free memory to run the variant.
BPF Verifier Sweep
A scheduler that loads on your machine can still be rejected — or
attach and then wedge — on a topology you never booted.
verified_insns varies with topology whenever topology-derived
config like nr_cpus is baked into .rodata: the verifier sees
different known constants, walks different branches, and can reach a
different verdict. The verifier sweep boots every declared scheduler
in a KVM VM across a range of topologies and checks three things
against the real kernel: the BPF programs verify, the scheduler
attaches as the active sched_ext scheduler, and it dispatches an
injected workload.
The verifier that runs is the real verifier in the real target
kernel — no host-side BPF loading, no version skew. And there is no
subprocess to bpftool or veristat: the host reads per-program
verified_insns directly from guest memory via bpf_prog_aux
introspection, and applies cycle collapse to verifier logs instead of
truncating them.
Quick start
# Every declared scheduler, kernel discovered via KTSTR_KERNEL / cache
cargo ktstr verifier
# Pin the kernel
cargo ktstr verifier --kernel ../linux
# Sweep across kernels (each cell runs against its own)
cargo ktstr verifier --kernel 6.14.2 --kernel 7.0
# One scheduler across topologies
cargo ktstr verifier --scheduler scx-ktstr
# Raw verifier log, no cycle collapse
cargo ktstr verifier --raw
See cargo-ktstr verifier for the flag list.
A healthy sweep
Four small cells, one scheduler, one kernel — each cell boots its own VM, loads the scheduler, and confirms attach + dispatch:
cargo ktstr: resolved kernel "7.0"
cargo ktstr verifier: dispatching to nextest (verifier/ cells only) on 1 resolved kernel(s) forwarding to nextest: --test kaslr_axis_e2e tiny-1llc tiny-2llc odd-3llc smt-2llc
...
Starting 4 tests across 1 binary (55 tests skipped)
PASS [ 12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
PASS [ 12.432s] (2/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/smt-2llc
PASS [ 12.656s] (3/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-1llc
PASS [ 12.929s] (4/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-2llc
────────────
Summary [ 12.929s] 4 tests run: 4 passed, 55 skipped
verifier verified_insns (per scheduler; rows: kernel, cols: BPF program, cell: range across topologies):
ktstr_sched:
kernel ktstr_dispatch ktstr_dump ktstr_dump_cpu ktstr_dump_task ktstr_enqueue ktstr_exit ktstr_exit_task ktstr_init ktstr_init_task ktstr_select_cp ktstr_yield
kernel_7_0 102 81 13 70 74 25 419 2296 29077 39 8
verifier summary: 4 ✅ 0 ❌ 0 🇽
topology ktstr_sched
odd-3llc ✅
smt-2llc ✅
tiny-1llc ✅
tiny-2llc ✅
A cell in the verified_insns table shows a single number when the
count is flat across topologies, lo..hi when it varies, and -
when that program reported no stats on that kernel. In the grid, ✅
means the scheduler verified, attached, and dispatched on every
kernel that ran the cell; ❌ means it failed on every kernel; 🇽 means
mixed results across kernels (the 🇽 glyph renders inconsistently in
some terminal fonts — the failing-combinations list below the grid is
the authoritative record). This 4-cell sweep ran its VMs in parallel
and finished in about 13 seconds of test time.
What a cell checks
- Verify — inside the VM the scheduler loads its BPF programs;
the target kernel’s verifier runs against them. The host reads
per-program
verified_insnsfrombpf_prog_auxvia guest memory introspection. On load failure, libbpf’s verifier log is forwarded to the host. - Attach (positive confirmation) — the guest confirms the
scheduler process survived load and
/sys/kernel/sched_ext/statereachedenabled. The kernel setsenabledonly afterops.init, per-task init, and switching eligible tasks to the sched_ext class, so this proves the scheduler is scheduling, not merely that its BPF loaded. Attach is confirmed only when the guest reaches its post-attach dispatch phase — a guest that vanishes early (e.g. a panic before any frame is emitted) fails rather than passing by default. - Dispatch probe — the verifier VM has no
#[ktstr_test]body, so it injects a SpinWait workload sized to the guest’s online CPUs, running as SCHED_EXT. A cell passes only when a worker makes forward progress after attach: a scheduler that attaches but never dispatches a runnable task is a distinct, worse failure the attach gate alone cannot catch.
Every cell boots with performance mode disabled
(no_perf_mode) — verified_insns
is perf-mode-independent, so cells share LLC reservations instead of
serializing on them.
A real rejection
The fixture scheduler ships rejection knobs (see
fixture knobs) precisely so this path stays
exercised. Here --verify-loop plants an unrolled loop ending in a
store through a null pointer — the verifier walks the loop, then
rejects the store. Note the collapse markers: the loop body is shown
once, not eight times:
=== ktstr_broken | kernel kernel_7_0 | topology tiny-1llc === verifier scheduler: NOT ATTACHED — scheduler process exited during BPF load/startup verifier --- verifier stats --- processed=186 states=7/7 verifier --- scheduler log --- Global function ktstr_dispatch() doesn't return scalar. Only those are supported. 0: R1=ctx() R10=fp0 ; if (crash) @ main.bpf.c:423 0: (18) r1 = 0xff5d3bb3000f60dc ; R1=map_value(map=bpf_bpf.bss,ks=4,vs=280,off=220) ... ; volatile u32 acc = 0; @ main.bpf.c:450 37: (63) *(u32 *)(r10 -8) = r1 ; R1=0 R10=fp0 fp-8=mmmm0 --- 8x of the following 25 lines --- ; u64 t = bpf_ktime_get_ns(); @ main.bpf.c:453 38: (85) call bpf_ktime_get_ns#5 ; R0=scalar() ; acc += (u32)t; @ main.bpf.c:454 39: (61) r1 = *(u32 *)(r10 -8) ; R1=0 R10=fp0 fp-8=mmmm0 ... --- 6 identical iterations omitted --- ; u64 t = bpf_ktime_get_ns(); @ main.bpf.c:453 171: (85) call bpf_ktime_get_ns#5 ; R0=scalar() ... --- end repeat --- 190: (b7) r1 = 0 ; R1=0 ; *p = (int)acc; @ main.bpf.c:464 ... 192: (63) *(u32 *)(r1 +0) = r2 R1 invalid mem access 'scalar' processed 186 insns (limit 1000000) max_states_per_insn 0 total_states 7 peak_states 7 mark_read 0 ... verifier summary: 0 ✅ 1 ❌ 0 🇽 topology ktstr_broken tiny-1llc ❌ failing combinations (scheduler / kernel / topology): ktstr_broken / kernel_7_0 / tiny-1llc error: cargo nextest run exited with 100
The interleaved ; source line @ file:line comments name the C
statement each instruction group came from — the offending store is
*p = (int)acc; at main.bpf.c:464.
Cycle collapse
The kernel verifier unrolls loops, re-verifying each instruction with updated register state. A bounded 8-instruction loop verified 100 times produces 800 near-identical lines that differ only in register-state annotations; naive truncation loses the context you came for. Cycle collapse keeps the structure: first iteration (what the loop does), an omission count, last iteration (final state).
The algorithm normalizes lines by stripping register-state
annotations (source comments are preserved as anchors), finds the
most frequent normalized line to establish the cycle period (minimum
period 5 lines, minimum 3 repetitions), verifies consecutive blocks
match, and collapses — iterating up to 5 passes for nested loops.
--raw skips all of this and prints the full log.
Matrix dimensions and filters
The sweep matrix is (declared scheduler × kernel × topology preset).
Schedulers come from the declare_scheduler! registry (--scheduler NAME narrows to one; EEVDF and kernel-builtin declarations are
skipped — no userspace binary to verify). Kernels come from the
operator’s --kernel set; with no flag, one auto-discovered kernel is
used. The topology axis is the set of
gauntlet presets each scheduler’s
constraints accept.
Each scheduler’s kernels = [...] declaration filters the
operator-supplied kernel set:
kernels = [](or omitted) — accepts every kernel-list entry.- Version specs (
"6.14.2") — match entries whose label equals the version (raw or sanitized form). - Range specs (
"6.14..6.16","6.14..=6.16") — match entries whose version falls in the inclusive range. - Path / cache-key / git specs — match by sanitized-label equality.
# Scheduler declares kernels = ["6.14..6.16"]
# Operator passes 6.14.2, 6.15.0, 6.17.0 — the third is filtered out.
# Cells emitted per accepted preset:
# verifier/<sched>/kernel_6_14_2/<preset>
# verifier/<sched>/kernel_6_15_0/<preset>
cargo ktstr verifier --kernel 6.14.2 --kernel 6.15.0 --kernel 6.17.0
A cell whose kernel label matches nothing in the resolved set errors with a diagnostic naming the present labels — no silent fallback to an unrelated kernel.
Runtime: total cost is one VM boot per cell — schedulers × kernels × accepted presets. Cells run in parallel under nextest; the 4-cell example above cost ~13 s.
Fixture knobs
The scx-ktstr fixture scheduler ships two flags that make the
rejection path testable on demand:
--fail-verify— sets a.rodatavariable beforescx_ops_load!that enables a store through a null pointer inktstr_dispatch— the invalid access the verifier rejects.--verify-loop— same rejection, preceded by an unrolled 8-iteration loop so the log exercises cycle collapse. It is deliberately not awhile(1): the verifier’s infinite-loop analysis could keepscx_ops_loadfrom returning within the host’s scheduler-attach poll.
Pass them via sched_args on a scratch declare_scheduler! — that
is exactly how the rejection capture above was produced.
Reading Failure Output
When a test fails, everything ktstr knows lands in the test’s stderr as one bundle: the violated thresholds, the workload statistics, a phase timeline, the scheduler’s own log, the monitor’s summary, and — for scheduler crashes — the kernel’s sched_ext dump and an auto-repro trail. This page walks the sections in the order they appear, using two real failures.
A check-gate failure
This test set an iteration-rate floor its (deliberately slowed) scheduler could not meet. The first line names the test, scheduler, and topology; the indented lines under it are the violated checks:
TRY 1 FAIL [ 31.810s] (───) ktstr::docs_demo ktstr/throughput_gate
stderr ───
...
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
--- timeline ---
topology: 1n1l2c1t (2 cpus) scheduler: my_sched scenario: throughput_gate duration: 15.0s
Phase 1: StepStart[0] ops=0 (4960ms, 0 samples):
imbalance: avg=1.2 max=5.0 | dsq: avg=0 max=0 | nr_run: avg=1.0 | fallback: 0/s | keep_last: 38/s | throughput: 79697 iter/s (stimulus-derived)
per-cgroup:
cg_a: off-cpu avg=0.3% min=0.3% max=0.3% spread=0.0% | run-delay mean=915µs worst=915µs | iters=209600 migrations=1 | gap=10ms@cpu0
cg_b: off-cpu avg=9.0% min=9.0% max=9.0% spread=0.0% | run-delay mean=5654µs worst=5654µs | iters=189252 migrations=1 | gap=21ms@cpu0
>>> StepStart[0]: ops=0 (2 cgroups, 2 workers)
Reading it:
- Header + details — the checks that failed, one line each, with the observed value and the threshold in the same line. The detail line is the verdict; everything below is context.
--- stats ---— the per-run roll-up: worker count, distinct CPUs touched, migrations, worst per-cgroup spread and off-CPU gap.cg{i}is the positional index of the cgroup in this roll-up, not its name — it lines up with the per-cgroup rows underneath.--- timeline ---— one block per scenario phase with monitor averages and per-cgroup detail (named cgroups here). In this run,cg_b’s 9% off-CPU time and 5.6 ms mean run-delay againstcg_a’s 0.3%/0.9 ms is the asymmetry the failed rate gate traces back to.
Two more sections follow every failure when they have content:
--- scheduler log ---
libbpf: struct_ops ktstr_ops: member sub_attach not found in kernel, skipping it as it's set to zero
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
events+: refill_slice_dfl=210
schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
bpf: ktstr_select_cp cnt=189 145ns/call
bpf: ktstr_enqueue cnt=373 34ns/call
bpf: ktstr_dispatch cnt=584 237ns/call
verdict: monitor OK
The scheduler log is whatever the scheduler binary printed (libbpf
noise included). The monitor block is the host-side observer’s
summary — see Monitor for what each
line means. verdict: monitor OK here says the monitor’s checks
passed; the test still failed on the worker-side rate gate. The two
channels are independent.
A scheduler crash
When the scheduler itself dies, the trail grows: a BUG SUMMARY
line, a --- diagnostics --- section for the run stage, the kernel’s
sched_ext debug dump, and an auto-repro section. From a real
scx_bpf_error crash (triggered on purpose by a demo test):
BUG SUMMARY: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)
ktstr_test 'bpf_crash_auto_repro_e2e' [sched=scx-ktstr] [topo=1n1l4c1t] failed:
scheduler process died unexpectedly during workload (2.2s into test)
--- stats ---
4 workers, 0 cpus, 0 migrations, worst_spread=0.0%, worst_gap=0ms
cg0: workers=4 cpus=0 spread=n/a gap=0ms migrations=0 iter=0
--- diagnostics ---
stage: payload started but produced no test result
exit_code=1
BUG SUMMARY is the one-line cause, extracted from the kernel’s
triggered exit kind emission or the scheduler log. The
--- diagnostics --- stage line tells you how far the run got before
dying — here the workload started but never reported, because the
scheduler died under it.
The scheduler-log section carries the kernel’s full debug dump for
scheduler exits — exit kind, backtrace, and (if the scheduler
implements ops.dump) its own state:
--- scheduler log ---
...
DEBUG DUMP
================================================================================
swapper/3[0] triggered exit kind 1025:
scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)
Backtrace:
scx_exit+0x50/0x70
scx_bpf_error_bstr+0x78/0x90
bpf_prog_1fed99378f3a8055_ktstr_dispatch+0x4d/0x1cb
bpf__sched_ext_ops_dispatch+0x4b/0xa7
do_pick_task_scx+0x379/0x770
__schedule+0x5ca/0xfc0
...
ktstr scheduler state:
stall=0 crash=1 degrade_rt=0
A --- sched_ext dump --- section repeats the same dump as captured
from the kernel trace channel, and --- auto-repro --- reports the
second VM’s replay of the crash:
--- auto-repro ---
--- probe pipeline ---
extracted: 10 functions from crash backtrace
traceable: 7 passed, 3 dropped: bpf_prog_1fed99378f3a8055_ktstr_dispatch, bpf__sched_ext_ops_dispatch, ret_from_fork_asm
...
repro VM duration: 16.9s
See Auto-Repro for how to read the probe pipeline and its output.
Detail-line catalog
The worker-side checks emit a fixed set of detail-line shapes (each format string is pinned by a unit test, so these stay accurate):
worker {N} iteration rate {R}/s below floor {F}/s— a benchmark rate gate failed.tid {N} starved (0 work units)— a worker made no progress at all (not_starved).tid {N} stuck {X}ms on cpu{C} at +{T}ms (threshold {N}ms)— a worker’s longest off-CPU gap crossedmax_gap_ms.unfair cgroup: spread={P}% ({lo}-{hi}%) {N} workers on {N} cpus (threshold {P}%)— per-cgroup fairness exceededmax_spread_pct.
See Checking for the model behind these and the monitor-side violations.
Artifacts on disk
After the run, cargo ktstr prints where everything landed:
cargo ktstr: test outputs
~/ktstr/target/ktstr/7.0.14-73730e0-dirty
FAILED throughput_gate [my_sched 1n1l2c1t]
failure dump ~/ktstr/target/ktstr/7.0.14-73730e0-dirty/throughput_gate-2ecd2624f3df7276.failure-dump.json
stats ~/ktstr/target/ktstr/7.0.14-73730e0-dirty/throughput_gate-2ecd2624f3df7276.ktstr.json
replay cargo ktstr replay --filter throughput_gate --exec
Every failed test writes a
{test_name}-{variant_hash:016x}.failure-dump.json next to its
result sidecar in the run directory (see Runs for the
directory semantics). Auto-repro runs write a sibling
.repro.failure-dump.json with the repro VM’s own snapshot. The
path is pre-cleared at each dispatch, so a passing rerun never
leaves a stale dump behind.
The dump comes in two shapes. When the scheduler attached and its exit path triggered, it is a full post-mortem: BPF map contents with BTF-typed field names, per-vCPU registers, per-program runtime stats — the JSON form of what Snapshots renders. When the failure happened before the BPF probe attached, a placeholder is written instead, and says so:
{
"schema": "single",
"maps": [],
"sdt_alloc_unavailable": "test failed at stage `payload started but produced no test result`; no BPF state captured (probe did not attach before failure)",
...
"is_placeholder": true
}
The JSON is for tooling that walks the run directory; humans should
read the stderr — the actionable diagnostics are the BUG SUMMARY
line and the --- sched_ext dump --- section.
Investigation workflow
-
Read the header and detail lines — they name the check and the margin by which it failed.
-
For check failures, correlate against
--- stats ---and--- timeline ---: which cgroup, which phase, migrations or gaps? -
For crashes, start from
BUG SUMMARYand the backtrace in the debug dump, then read the auto-repro trail for the state on the way to the exit. -
Re-run exactly the failing variant — including the gauntlet preset segment if it was a gauntlet case (see test name shapes):
cargo ktstr test --kernel 7.0 -- -E 'test(=gauntlet/my_test/smt-2llc)'or re-run everything that failed last session with
cargo ktstr replay. -
Poke at the same environment interactively:
cargo ktstr shell --test my_testboots a VM with the test’s topology, memory, and include files (see ktstr shell).
Verbosity knobs
RUST_BACKTRACE=1— Rust panic backtraces, plus ktstr’s verbose mode: the full guest kernel console is appended to--- diagnostics ---on failure, the auto-repro VM’s console is forwarded live, and the guest boots withloglevel=7.RUST_LOG=ktstr=debug— host-side tracing (probe attach reasons, libbpf errors).--dmesg— streams the guest kernel console in real time; it is a flag oncargo ktstr shell/ktstr shell, not ontest.
Environment errors (kernel not found, cgroup controllers missing, flock timeouts) are cataloged in Troubleshooting.
Auto-Repro
The crash log tells you where the scheduler died; auto-repro tells
you what the state was on the way there. When a test fails because
the scheduler crashed or exited, ktstr boots a second VM, reruns the
scenario with BPF probes attached to the functions from the crash
backtrace, and prints each probed call with decoded arguments and
struct fields — entry and exit values side by side. The trail
appears in the --- auto-repro --- section of the failure output
(see Reading Failure Output); the end-to-end
debugging story is the
Investigate a Crash recipe.
Example output
The probe dump shows each function with decoded fields and source locations (DWARF for kernel functions, BPF line info for callbacks). Where fexit captured post-mutation state, changed fields show an arrow between entry and exit values:
ktstr_test 'bpf_crash_auto_repro_e2e' [sched=scx-ktstr] [topo=1n1l4c1t] failed:
scheduler process died unexpectedly during workload (2.0s into test)
--- auto-repro ---
=== AUTO-PROBE: scx_exit fired ===
ktstr_enqueue main.bpf.c:21
task_struct *p
pid 97
cpus_ptr 0xf(0-3)
dsq_id SCX_DSQ_INVALID
enq_flags NONE
slice 0
vtime 0
weight 100
sticky_cpu -1
scx_flags QUEUED|ENABLED
do_enqueue_task kernel/sched/ext.c
rq *rq
cpu 1
task_struct *p
pid 97
cpus_ptr 0xf(0-3)
dsq_id SCX_DSQ_INVALID → SCX_DSQ_LOCAL
enq_flags NONE
slice 20000000
vtime 0
weight 100
sticky_cpu -1
scx_flags QUEUED|DEQD_FOR_SLEEP → QUEUED
Reading it: the task entered the scheduler’s enqueue callback with
dsq_id = SCX_DSQ_INVALID (on no dispatch queue) and an expired
slice. By the time do_enqueue_task returned, the task sat on the
local DSQ (SCX_DSQ_INVALID → SCX_DSQ_LOCAL) with a refilled
default slice (20000000 ns), and the DEQD_FOR_SLEEP flag had been
cleared. That is a healthy enqueue path — captured at the moment
scx_exit fired, so you can see exactly what the scheduler did with
its last tasks before the error.
After the probe data, the section appends the repro VM’s wall time and, when non-empty, the last lines of its scheduler log, sched_ext dump, failure-dump JSON, and dmesg.
Enabling it — and what it costs
Auto-repro is on by default for every #[ktstr_test] with a
scheduler. Opt out per test:
#[ktstr_test(scheduler = MY_SCHED, auto_repro = false)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> { ... }
It fires only when the primary run fails, and it is disabled
automatically when expect_err = true (no point probing a
deliberately failing test). The cost is a second VM boot plus a full
scenario rerun — in the captured demo below, the repro VM added
about 17 seconds.
How it works
- Stack extraction — function names are parsed from the crash
trace in the scheduler log or kernel console. BPF program symbols
(
bpf_prog_*) are recognized and their short names extracted; generic frames (spinlocks, syscall entry, sched_ext exit machinery, trampolines) are filtered out. - BPF discovery — in the repro VM, loaded struct_ops programs
are discovered and added to the probe list along with their
kernel-side callers (e.g.
enqueue→do_enqueue_task), so the pipeline still probes something when the crash produced no extractable stack. - BTF resolution — signatures come from vmlinux BTF and program
BTF; known structs (
task_struct,rq, dispatch queues) have curated fields resolved to offsets, and other struct pointers get scalar/enum/cpumask fields auto-discovered. - Probed rerun — the second VM reruns the scenario with kprobes
on kernel entry, fentry/fexit on BPF callbacks and kernel exits,
and a one-shot trigger on the
sched_ext_exittracepoint that fires at the moment the exit is claimed. - Stitching — events are filtered to the task that triggered the exit, sorted by timestamp, and rendered with decoded values.
If the primary VM failed before the scheduler ever attached and the
workload ever ran, the repro has nothing to reproduce — the framework
prepends a PRIMARY DID NOT REACH WORKLOAD label to the repro
verdict so you chase the primary’s startup failure (see its
--- diagnostics --- and --- timeline --- sections) instead of
reading the repro as evidence.
Kernel requirement
The probe trigger needs the sched_ext_exit tracepoint, which is
currently only in the sched_ext for-7.2 development branch — no
released stable kernel has it. On a kernel without it, the rest of
the pipeline still runs — the crash call chain is extracted and
probes are prepared — but the trigger cannot attach and no events are
captured. The --- probe pipeline --- block says exactly that; this
is the shape to recognize:
--- auto-repro ---
--- probe pipeline ---
extracted: 10 functions from crash backtrace
traceable: 7 passed, 3 dropped: bpf_prog_1fed99378f3a8055_ktstr_dispatch, bpf__sched_ext_ops_dispatch, ret_from_fork_asm
bpf_discover: 0 programs found
after_expand: 7 total probe targets
kprobes: 0 attached
trigger: attach failed (skeleton load (retry): No such process (os error 3); original error before retry: No such process (os error 3))
probe_data: 0 keys, 0 unmatched IPs
events: 0 captured, 0 after stitch
repro VM duration: 16.9s
The diagnostic tails (repro VM sched_ext dump, dmesg) are still appended, so the repro run remains useful as a crash-reproduction check even without probe events.
Example test
bpf_crash_auto_repro_e2e in ktstr’s tests/scenario_coverage.rs
drives the path end to end: a host-side BPF map write sets the
fixture scheduler’s crash global, the scheduler calls
scx_bpf_error, and the auto-repro VM replays it.
Runs and Regression Gates
Every test run writes machine-readable results — one JSON sidecar
per test, grouped into a run directory keyed by (kernel, project
commit). That makes “did my change regress anything?” a one-command
question: cargo ktstr perf-delta pairs two commits’ sidecars and
fails the build when metrics regress past their gates.
The workflow
-
Run tests — each invocation writes sidecars into
target/ktstr/{kernel}-{project_commit}/:cargo ktstr test --kernel 6.14 -
List runs:
$ cargo ktstr stats list RUN TESTS DATE ARCH 7.0.14-73730e0-dirty 1 2026-07-04T23:28:34Z x86_64 7.1.0-73730e0-dirty 0 - - 7.1.1-73730e0-dirty 0 - - 7.0.0-73730e0-dirty 1 2026-07-04T22:16:24Z x86_64 7.1.0-73730e0 6 2026-07-04T21:41:43Z x86_64Rows sort by directory mtime, most recent first.
DATEis the run’s first sidecar timestamp;ARCHcomes from the first sidecar with host context (-when none has one). -
Gate a change against a baseline commit:
cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14 # HEAD vs merge-base(HEAD, main) cargo ktstr perf-delta --base abc1234 # vs an explicit commit, cached sidecars cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14 -E cgroup_steady # narrow the perf setThe canonical WIP-vs-baseline pattern: run
perf-delta --base abc1234from a-dirtyworking tree against the clean commit you edited from. -
Print analysis of the most recent run (gauntlet outliers, BPF verifier stats, callback profile, KVM stats):
cargo ktstr stats -
Inspect a run’s archived host context (the fingerprint the host-delta comparison uses — CPU identity, THP policy, sched sysctls):
cargo ktstr stats show-host --run 6.14-abc1234
perf-delta
perf-delta compares performance_mode test metrics between HEAD
and a baseline commit, per scenario, using the metric registry’s
polarity and thresholds (enumerate it with
cargo ktstr stats list-metrics — see
Assertable Metrics). Output is
one row per compared metric with the baseline and HEAD values,
colored red for regressions and green for improvements; the command
exits non-zero once enough metrics regress to trip the failure gate
— by default 5 or more (--fail-threshold), so a lone noisy
regression does not flip CI red, or any metric named in
--must-fail. If the baseline produced no performance_mode
sidecars at all, it prints a notice and exits 0 — an empty perf set
is “nothing to compare”, not a failure.
Baseline resolution (highest precedence first):
--base <commit>— compare HEAD directly against this commit-ish, no merge-base.--base-ref <ref>— compare againstmerge-base(HEAD, <ref>).$GITHUB_BASE_REF(set onpull_requestevents) — compare againstmerge-base(HEAD, origin/<ref>).- Otherwise
merge-base(HEAD, main)(override the branch with--default-branch).
The resolved baseline is shortened to the 7-hex form sidecars record, and the command bails if it resolves to HEAD (nothing to compare).
Two ways to get the baseline’s numbers:
- Cached (default) — both sides’ sidecars must already be in
the pool (a prior run, or a CI artifact you downloaded).
perf-deltaonly resolves the pair and compares, applying--threshold PCT(uniform gate) or--policy PATH(per-metric JSON) over the registry defaults. --noise-adjust N(requires--kernel, N ≥ 2) — produces both sides fresh: it checks the baseline and HEAD out into scratch checkouts, runs each side’sperformance_modetests N times, and gates on the observed spread. A regression counts only when the two sides are separated (a two-sided Welch t-test at α = 0.05, or fully disjoint min–max bands) and material (past the registry’s absolute + relative dual gate). This is the mode to trust on a noisy machine; budget N × per-side wall time for it.
perf-delta compares on the commit axis. A cross-config question —
scheduler A vs scheduler B at the same commit — is answered in-test
(see Compare a Scheduler vs EEVDF);
the worked A/B walkthrough with real gates is
A/B Compare Branches, and the CI
perf-gate job lives in CI.
Run directories
target/
└── ktstr/
├── 6.14-abc1234/ # kernel 6.14, project commit abc1234 (clean)
│ ├── test_a.ktstr.json
│ └── test_b.ktstr.json
└── 7.0-def5678-dirty/ # kernel 7.0, commit def5678 + uncommitted changes
├── test_a.ktstr.json
└── test_b.ktstr.json
The key is {kernel}-{project_commit}: the resolved kernel version,
plus the project tree’s HEAD short hex, suffixed -dirty when the
worktree differs from HEAD.
The commit is discovered from the test process’s working directory — for a scheduler crate using ktstr as a dev-dependency, that is the scheduler crate’s commit, not ktstr’s. Run from whichever clone you want the run keyed on.
Warning
A run directory is a last-writer-wins snapshot, not an archive. Re-running the suite at the same kernel and project commit pre-clears the prior sidecars at the new run’s first write. To preserve a run, move the directory out of the runs root (
mv target/ktstr/6.14-abc1234 ~/ktstr-archives/...) — a sibling insidetarget/ktstr/would still be walked bystats list— or commit your changes so the next run lands under a new key.
Pre-clear is shallow: only *.ktstr.json files at the top level are
removed. Subdirectories created by external orchestrators (per-job
gauntlet layouts) are left alone but still read by stats, so clean
those yourself when reusing them.
Inspecting sidecars
Each sidecar records the test name, topology, scheduler, work type, verdict, per-cgroup stats, monitor summary, verifier and KVM stats, kernel version, host context, and timestamps. Discovery tooling:
-
cargo ktstr stats list-values— the distinct values per filterable dimension (kernel, commit, scheduler, topology, work type, …) across the pool: the upstream answer to “what have I got?” before narrowing aperf-delta. -
cargo ktstr stats list-metrics— the regression metric registry (names, polarity, default gates, units). -
cargo ktstr stats explain-sidecar --run ID— why optional fields are absent, per sidecar, with a fix when one exists:walked 1 sidecar file(s), parsed 1 valid test: throughput_gate topology: 1n1l2c1t scheduler: my_sched ... populated optional fields (8): resolve_source, project_commit, monitor, kvm_stats, kernel_version, host, cleanup_duration_ms, run_source none fields (3): scheduler_commit [expected] - no SchedulerSpec variant currently exposes a reliable commit source — reserved on the schema for future enrichment (e.g. --version probe or ELF-note read on the resolved scheduler binary) payload [expected] - test declared no binary payload (scheduler-only test or pure-scenario test that never invokes ctx.payload(...)) kernel_commit [actionable] - KTSTR_KERNEL is unset or empty ... fix: set KTSTR_KERNEL to a local kernel source tree that is a git repository (e.g. a git clone of the kernel)expectedmeansNoneis the steady state;actionablemeans a different environment would populate the field.--jsonemits an aggregate object for dashboards.
stats list-values, show-host, and explain-sidecar all take
--dir DIR to point at an archived sidecar tree copied off a CI
host.
Environment notes
- Local filesystem required. The runs root must live on ext4 / xfs / btrfs / tmpfs — the advisory lock that serializes concurrent sidecar writes rejects NFS and other remote filesystems.
- Non-git runs collide. When the test process is not in a git
repository, the commit slot is the literal
unknown, and every such run shares{kernel}-unknown(with pre-clear between them). SetKTSTR_SIDECAR_DIRor put the tree under git to disambiguate. The sidecar’s ownproject_commitfield staysnullfor these runs — the dirname sentinel and the JSON field intentionally diverge. KTSTR_SIDECAR_DIRoverrides the sidecar directory itself (used as-is, no key suffix) for writes and for barecargo ktstr statsreads. Pre-clear is skipped under the override — you chose the directory, you own its contents. Thestats list/list-values/show-hostsubcommands do not consult it; use--dir.
Failure artifacts
Failed tests additionally write a failure-dump JSON next to their sidecar — see Reading Failure Output for the path scheme, the placeholder-vs-full distinction, and the investigation workflow.
Core Concepts
ktstr tests compose from three layers:
- Scenarios — the scheduling condition the test creates: cgroup layout, CPU partitioning, workloads, mid-run changes.
- Work types — what each worker process does, each variant targeting a specific kernel scheduling path.
- Checking — how results are judged: starvation, fairness, gaps, monitor thresholds, temporal patterns.
One test, all three layers visible:
#[ktstr_test(
scheduler = MY_SCHED, // scheduler under test
llcs = 2, cores = 4, threads = 1, // topology the VM boots with
not_starved = true, // checking: every worker progressed
max_spread_pct = 20.0, // checking: fairness bound
)]
fn steady_two_cells(ctx: &Ctx) -> Result<AssertResult> {
scenarios::steady(ctx) // scenario: 2 cgroups of CPU-spin workers
// (work type: the default SpinWait)
}
The layers compose orthogonally: the same scenario body runs across every topology a gauntlet sweep declares, and the checks apply uniformly to every variant.
Five more concepts round out the picture:
- Ops, Steps, and Backdrop — the API scenarios
are built from. Most tests declare cgroups with
CgroupDef; tests that change state mid-run composeOps intoSteps. - Topology — the NUMA/LLC/core/thread layout a test declares and the VM actually boots with.
- MemPolicy — per-worker NUMA memory placement, for tests that measure memory locality.
- Performance Mode — host-side isolation for noise-sensitive measurements.
- Resource Budget — how concurrent VMs and kernel builds share host CPUs safely.
Read Scenarios, Work types, and Checking first — every test touches all three. Ops matters once a canned scenario stops being enough, and Topology once placement behavior is under test. Performance Mode and Resource Budget are operational: read them when measurements get noisy or hosts get shared.
Scenarios
A scenario is the scheduling condition a test creates — which cgroups
exist, which CPUs they may use, what their workers do, and what
changes mid-run. The canned scenarios in scenarios::* exist so those
conditions have names: scenarios::steady(ctx) produces the same
reproducible condition against every scheduler you point it at, which
is what makes results comparable across schedulers and commits.
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
scenarios::steady(ctx)
}
Canned scenarios (scenarios::*)
| Function | Condition tested | Setup |
|---|---|---|
steady | Baseline fairness | 2 cgroups, no cpusets, equal CPU-spin load |
steady_llc | LLC-boundary scheduling | 2 cgroups on different LLCs (skips on 1-LLC topologies) |
oversubscribed | Dispatch under oversubscription | 2 cgroups, 32 mixed workers each |
cpuset_apply | Cpuset assignment on running tasks | Disjoint cpusets applied mid-run |
cpuset_clear | Cpuset removal on confined tasks | Cpusets cleared mid-run |
cpuset_resize | Cpuset resizing adaptation | Cpusets shrink then grow |
cgroup_add | Scheduler reaction to a new cgroup | Cgroups created while others run |
cgroup_remove | Scheduler reaction to cgroup removal | Cgroups torn down while others run |
affinity_change | Affinity mask changes | Worker affinities randomized mid-run |
affinity_pinned | Narrow-affinity contention | Workers pinned to a 2-CPU subset |
host_contention | Cgroup vs host-task fairness | Root-cgroup workers beside managed cgroups |
mixed_workloads | Mixed workload fairness | Heavy + bursty + IO cgroups |
nested_steady | Nested cgroup hierarchy | Workers in nested sub-cgroups |
nested_task_move | Cross-level task migration | Tasks moved between nested cgroups |
More specialized custom_* functions live in the
ktstr::scenario::{affinity, basic, cpuset, dynamic, interaction, nested, performance, stress} modules — see the
API docs.
Start here
Against a new scheduler, run steady first — it is the smallest
condition that can fail (two cgroups, spin load, nothing dynamic).
Then steady_llc on a 2-LLC topology to see cache-boundary
placement, then mixed_workloads and oversubscribed for load
diversity. The dynamic scenarios (cpuset_*, cgroup_*,
affinity_*) each isolate one reconfiguration path; reach for the
one matching the code you changed.
Run parameters
A scenario body does not pick its own duration or topology — the
#[ktstr_test] attribute does. The workload runs for the test’s
duration_s (see the
macro reference), on the
topology the attribute declares, and a
gauntlet run re-executes the same
body across a whole topology matrix. Worker counts and cpusets come
from the scenario’s own CgroupDefs.
Every scenario ends the same way: worker reports are collected and the opted-in checks run against them. A run’s stats roll-up looks like:
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
Reading Failure Output walks the full anatomy.
From canned to custom
Scenarios graduate in three stages; move down only when the stage above can’t express the condition:
- Canned — call a
scenarios::*function. Zero setup, named, comparable. - Your own cgroup layout —
execute_defs(ctx, vec![...])withCgroupDefs you declare: your worker counts, work types, cpusets, still one static phase. - Steps —
execute_steps/execute_scenariowithSteps andOps for anything that changes mid-run (cpuset swaps, scheduler replacement, snapshots, kernel-memory reads), plus aBackdropfor state that must outlive the steps.
Ops, Steps, and Backdrop documents stage 2 and 3 — the
CgroupDef builder, every Op, and all the execute_* entry
points. A custom scenario is just the #[ktstr_test] function body
itself; Custom Scenarios
covers writing bodies that go beyond Steps entirely.
Ops, Steps, and Backdrop
Dynamic scenarios used to mean hand-written scenario bodies that create cgroups, sleep, poke sysfs, sleep again, and collect — with every ordering rule and error path yours to get right:
// By hand: manual setup, manual timing, manual teardown.
fn shrink_midrun(ctx: &Ctx) -> Result<AssertResult> {
let (mgr, cgroups, handles) = setup_cgroups(ctx, /* defs */)?;
std::thread::sleep(first_half);
mgr.set_cpuset("cg_hot", &ctx.topo.llc_aligned_cpuset(0))?;
std::thread::sleep(second_half);
// ...collect every handle, run checks, tear down in order...
}
The ops system expresses the same scenario declaratively — the framework owns timing, teardown, liveness checking, and report collection:
execute_steps(ctx, vec![
Step::with_defs(
vec![CgroupDef::named("cg_hot").workers(4)],
HoldSpec::frac(0.5),
),
Step::with_op(
Op::set_cpuset("cg_hot", CpusetSpec::llc(0)),
HoldSpec::frac(0.5),
),
])
Op
An Op is one atomic operation on the running scenario. The enum is
#[non_exhaustive] — pattern matches must end with ...
| Op | Effect |
|---|---|
AddCgroup | Create an empty cgroup |
AddCgroupDef | Create cgroup + cpuset + workers from a CgroupDef, mid-step |
RemoveCgroup | Stop workers and remove a cgroup |
StopCgroup | Stop a cgroup’s workers, keep the cgroup |
SetCpuset / ClearCpuset / SwapCpusets | Set, clear, or swap cgroup cpusets |
Spawn | Spawn workers into a named cgroup or the runner’s own cgroup |
SetAffinity | Set worker affinity via AffinityIntent |
MoveAllTasks | Move all tasks from one cgroup to another |
FreezeCgroup / UnfreezeCgroup | Kernel-side cgroup freeze (not SIGSTOP); teardown auto-unfreezes |
SteerIrq | Re-steer a hardware IRQ to one CPU (system-wide, not cpuset-scoped) |
CaptureSnapshot | On-demand host-side snapshot of BPF maps, vCPU registers, per-CPU counters |
WatchSnapshot | Snapshot every time the guest writes a named kernel symbol |
CaptureCgroupProcs | Record a cgroup’s cgroup.procs PIDs under a tag |
ReadKernelHot / ReadKernelCold | Read kernel memory (symbol, KVA, per-CPU or task field) |
WriteKernelHot / WriteKernelCold | Write kernel memory; adjacent cold writes are batched |
RunPayload / WaitPayload / KillPayload | Launch, await, or kill a binary payload |
AttachScheduler / DetachScheduler / RestartScheduler / ReplaceScheduler | Manage the live scheduler mid-scenario |
PinBpfMap | Hold a BPF map fd open across a scheduler swap |
Constructors take string literals directly (no .into()):
Op::add_cgroup("cg_0")
Op::add_cgroup_def(CgroupDef::named("cg_1").workers(4))
Op::set_cpuset("cg_0", CpusetSpec::disjoint(0, 2))
Op::spawn_workers("cg_0", WorkSpec::default().workers(4))
Op::spawn_host(WorkSpec::default().workers(4))
Op::set_affinity("cg_0", AffinityIntent::random_subset([0, 1, 2, 3], 2))
Op::capture_snapshot("after_spawn")
Op::freeze_cgroup("cg_0")
Op::spawn_host puts workers in the test runner’s own cgroup —
typically the guest root — to simulate host-level contention beside
managed cgroups.
Snapshot and watch ops
CaptureSnapshot pauses every vCPU through the freeze coordinator,
reads BPF map state, vCPU registers, and per-CPU counters, then
resumes; the report is keyed by the op’s name. With no snapshot
bridge installed it fails loudly rather than dropping the capture.
WatchSnapshot fires one capture per guest write to the named
symbol; the name must match the guest kernel’s vmlinux symbol table
verbatim, and at most 3 watch ops fit in a scenario (hardware debug
slots; one is reserved for the error-exit trigger). Details and
failure modes: Snapshots and
Watch Snapshots.
Kernel-memory ops
Hot variants read/write against the running vCPU; Cold variants
take a freeze rendezvous first. Targets are KernelTargets: a
symbol, a kernel virtual address, a per-CPU field, or a task field.
Payload ops
RunPayload spawns a binary-kind
Payload in the background;
WaitPayload blocks until it exits naturally, then evaluates its
checks and records its metrics; KillPayload does the same after
SIGKILL. Payloads are addressed by (name, cgroup); cgroup: None
resolves to the unique live copy. WaitPayload has no timeout — pair
it with a bounded hold or the payload’s own runtime flag.
Scheduler-kind payloads are rejected: the scheduler slot is the
#[ktstr_test(scheduler = ...)] attribute.
Scheduler ops
ReplaceScheduler swaps to a different staged scheduler binary
(declared via #[ktstr_test(staged_schedulers = [...])]).
PinBpfMap keeps a map fd alive so a same-binary swap window’s
.bss survives the replacement.
Permissive removal — a footgun
Op::RemoveCgroup and Op::StopCgroup are permitted against any
cgroup, including Backdrop-owned ones, and removing a
nonexistent cgroup silently succeeds (rmdir on a missing path is a
no-op). A typo’d name therefore surfaces later, as the kernel’s
No such file or directory on the next op that references the real
name. If a later step fails with a missing-cgroup error, grep the
test for Op::remove_cgroup calls naming a similar identifier
first. Op::MoveAllTasks is the exception: it rejects moves that
would strand Backdrop workers in a step-local cgroup.
CpusetSpec
CpusetSpec computes a cpuset from the topology at runtime. Build
via constructors — the enum is #[non_exhaustive]:
pub enum CpusetSpec {
Llc(usize), // all CPUs in one LLC
Numa(usize), // all CPUs in one NUMA node
Range { start_frac: f64, end_frac: f64 }, // fraction of usable CPUs
Disjoint { index: usize, of: usize }, // equal disjoint partitions
Overlap { index: usize, of: usize, frac: f64 }, // overlapping partitions
Exact(BTreeSet<usize>), // caller-supplied set
}
CpusetSpec::llc(0), CpusetSpec::numa(0), CpusetSpec::range(0.0, 0.5), CpusetSpec::disjoint(0, 2), CpusetSpec::overlap(0, 2, 0.5), CpusetSpec::exact([0, 1, 2]). Fractional and partition
variants operate on
usable_cpus(); Llc and Numa
cover their full domain.
CgroupDef
CgroupDef bundles the three ops that always travel together —
create cgroup, set cpuset, spawn workers — and is the primary way to
declare cgroups:
let def = CgroupDef::named("cg_0")
.cpuset(CpusetSpec::disjoint(0, 2))
.workers(4)
.work_type(WorkType::SpinWait);
Builder methods:
.cpuset(CpusetSpec)/.cpuset_mems(set)— CPU set, and an explicitcpuset.memsoverride (default derives from the cpuset’s NUMA nodes)..workers(n)/.workers_pct(p)— worker count, absolute or as a fraction of the resolved cpuset (see below). Setting both is rejected with a diagnostic..work_type(WorkType)— what workers do (defaultSpinWait); see Work Types..work(WorkSpec)— add another worker group; call repeatedly for concurrent groups..workload(&'static Payload)— run a binary payload inside the cgroup alongside the workers. Panics on a scheduler-kind payload (there is no scenario-level recovery at build time; the step-levelOp::RunPayloadreturns an error instead)..sched_policy(SchedPolicy)— Linux scheduling policy (defaultNormal); see Scheduling policies..affinity(AffinityIntent)— per-worker affinity (defaultInherit)..mem_policy(MemPolicy)/.mpol_flags(MpolFlags)— NUMA memory placement; see MemPolicy..nice(n),.comm(name),.pcomm(name),.uid(u)/.gid(g),.numa_node(node)— per-worker identity defaults, merged into everyWorkSpecthat doesn’t set its own..swappable(bool)— opt into gauntlet work-type overrides (see below).
Cgroup-v2 controller knobs (default unconstrained):
.cpu_quota_pct(pct) / .cpu_quota(quota, period) /
.cpu_unlimited(), .cpu_weight(w), .memory_max(b) /
.memory_high(b) / .memory_low(b) / .memory_unlimited(),
.memory_swap_max(b) / .memory_swap_unlimited(), .io_weight(w),
.pids_max(n) / .pids_unlimited().
Cpuset-scaled worker counts
Tests that span topologies need worker counts that scale with the cpuset. Hand-computing couples the test to a manual resolution step:
// Before: hand-computed via Ctx::cpuset_cpus.
let n = (ctx.cpuset_cpus(&CpusetSpec::Llc(0)) as f64 * 0.9).ceil() as usize;
let def = CgroupDef::named("cg_hot").cpuset(CpusetSpec::Llc(0)).workers(n);
// After: resolved from the cgroup's own cpuset at apply time.
let def = CgroupDef::named("cg_hot")
.cpuset(CpusetSpec::Llc(0))
.workers_pct(0.9); // ceil(cpuset_cpus * 0.9)
Fractions above 1.0 are accepted as deliberate oversubscription.
Work-type overrides and swappable
A gauntlet run can sweep work types (--ktstr-work-type=NAME,
surfaced as Ctx.work_type_override). The override replaces a
def’s work type only when that CgroupDef is marked
.swappable(true) (default false), and is skipped when the
override is a grouped work type whose group size does not divide the
resolved worker count. Non-swappable defs keep their declared type;
Op::Spawn always uses the type as given. This is the single
override mechanism for both #[ktstr_test] and ops-based scenarios.
Step
A Step is a list of ops plus a hold period:
pub struct Step {
pub setup: Setup, // CgroupDefs to create (after ops run)
pub ops: Vec<Op>, // operations to apply
pub hold: HoldSpec, // how long to hold afterward
}
Setup is Defs(Vec<CgroupDef>) or a topology-dependent
Setup::with_factory(fn(&Ctx) -> Vec<CgroupDef>).
Constructors:
Step::with_defs(defs, hold)— the primary constructor: create cgroups with workers, hold.Step::new(ops, hold)— ops only, no cgroup setup.Step::hold(hold)— hold only; the canonical phase-A shape in an A/B scenario (Step::hold(HoldSpec::frac(0.3))thenStep::with_op(Op::replace_scheduler(&ALT), HoldSpec::frac(0.7))).Step::with_op(op, hold)— one op, then hold.Step::with_payload(payload, hold)— run one binary payload for the hold; it is drained at step teardown, nothing blocks on it.
Builder methods Step::set_ops and Step::set_hold replace their
field. The verb prefixes are consistent across the API: set_X
replaces, push_X appends one, extend_X appends many — so
Step::new(ops).set_ops(more) drops ops, while
Backdrop::new().extend_ops(a).extend_ops(b) accumulates both.
HoldSpec
| Variant | Meaning |
|---|---|
Frac(f64) | Fraction of the scenario duration |
Fixed(Duration) | Fixed time |
Loop { interval } | Re-apply the step’s ops at interval until time runs out |
Sugar: HoldSpec::frac(0.5), HoldSpec::fixed(d),
HoldSpec::loop_at(d), and HoldSpec::FULL for Frac(1.0).
A Loop step is the natural shape for “every N seconds, do X”:
Step::new(
vec![Op::capture_snapshot("periodic")],
HoldSpec::loop_at(Duration::from_secs(2)),
)
The step’s setup runs once at step entry; only the ops repeat. Prior steps’ step-local state was already torn down at their own boundaries, so each loop iteration sees only Backdrop-owned state and this step’s own setup.
Backdrop
Steps tear their state down at each step boundary. A Backdrop is
the scenario-wide layer for state that must persist across steps —
long-lived cgroups every step references, background payloads that
run for the whole scenario, setup ops that seed state once:
let backdrop = Backdrop::new()
.push_cgroup(CgroupDef::named("bg_cell").cpuset(CpusetSpec::disjoint(0, 2)))
.push_op(Op::add_cgroup("bg_overflow")) // empty move-target cgroup
.push_payload(&BG_LOAD);
execute_scenario(ctx, backdrop, steps)
Backdrop::from_cgroups([...])builds one fromCgroupDefs;push_cgroup/extend_cgroups,push_op/extend_ops, andpush_payload/extend_payloadscompose incrementally.- Backdrop cgroups are created in declaration order, before the
first step; every
CgroupDefspawns at least one worker. Declare empty cgroups (move targets) viapush_op(Op::add_cgroup(...)). - Backdrop ops run after the cgroups, before the payloads, with full authority — they may remove or stop Backdrop cgroups where step-local ops are restricted.
- Payloads are spawned once and drained (killed, metrics preserved) at scenario teardown.
Any step can reference Backdrop cgroups by name (Op::MoveAllTasks,
Op::SetCpuset, …). The Backdrop tears down after the last step.
Executors
All in the prelude; each returns Result<AssertResult>:
execute_defs(ctx, defs)— the one-shot path: create cgroups, run for the full duration, collect. Equivalent toexecute_steps(ctx, vec![Step::with_defs(defs, HoldSpec::FULL)]).execute_steps(ctx, steps)— run a step sequence: for each step, apply ops, then setup, then hold (Loopsteps run setup once, then repeat ops); check scheduler liveness between steps; collect worker reports and run checks at the end.execute_steps_with(ctx, steps, Some(&assert))— same, with an explicitAssertoverridingctx.assertfor worker checks.Nonefalls back toctx.assert(the merged scheduler + per-test config).execute_scenario(ctx, backdrop, steps)/execute_scenario_with(ctx, backdrop, steps, checks)— the full composition: Backdrop setup, step sequence with per-step teardown, Backdrop teardown.
Phases
Steps give a scenario its timeline. The framework publishes the
active phase as steps progress: captures (periodic samples, watch
trips, on-demand snapshots) stamp with the phase active at capture
time, and every assertion detail constructed during a step’s hold
auto-stamps with that step’s label. Labels render as BASELINE (the
settle window before step 0) and Step[k] everywhere — sidecar
JSON, the timeline diagnostic, and per-assertion phase fields.
Phase-bucketed metrics are queryable from the result:
let baseline = r.stats.phase(Phase::BASELINE).expect("always populated");
let step_0 = r.stats.phase(Phase::step(0)).expect("Step 0 ran");
let thr = r.stats.phase_metric(Phase::step(0), "throughput");
Gate on r.stats.has_steps() before assuming step buckets exist —
a scenario that bailed in setup returns None from every
phase(Phase::step(k)) lookup. PhaseBucket::expect_metric panics
with the bucket’s label, sample count, and the metric keys actually
present, so a typo’d name and an empty phase are distinguishable at
a glance. Temporal Assertions
builds per-phase pattern checks on top of this.
The per-phase timeline also renders in every failure report:
--- timeline ---
topology: 1n1l2c1t (2 cpus) scheduler: my_sched scenario: throughput_gate duration: 15.0s
Phase 1: StepStart[0] ops=0 (4960ms, 0 samples):
imbalance: avg=1.2 max=5.0 | dsq: avg=0 max=0 | nr_run: avg=1.0 | fallback: 0/s | keep_last: 38/s | throughput: 79697 iter/s (stimulus-derived)
per-cgroup:
cg_a: off-cpu avg=0.3% min=0.3% max=0.3% spread=0.0% | run-delay mean=915µs worst=915µs | iters=209600 migrations=1 | gap=10ms@cpu0
cg_b: off-cpu avg=9.0% min=9.0% max=9.0% spread=0.0% | run-delay mean=5654µs worst=5654µs | iters=189252 migrations=1 | gap=21ms@cpu0
>>> StepStart[0]: ops=0 (2 cgroups, 2 workers)
Work Types
WorkType decides what each worker process does — and each variant
targets a specific kernel scheduling path, so a test pins down the
code path a regression lives in. ForkExit hammers
wake_up_new_task and the exit path (do_group_exit /
wait_task_zombie); AffinityChurn drives affine_move_task and
migration_cpu_stop. If you know which path your scheduler change
touches, there is usually a work type aimed at it.
Choosing a work type
| Scheduler behavior to test | Work type |
|---|---|
| Basic load balancing / fairness | SpinWait (default) |
| Wake placement / sleep-wake cycles | YieldHeavy, FutexPingPong |
| CPU borrowing / idle balance | Bursty, IdleChurn |
| Cross-CPU wake latency | PipeIo, CachePipe, WakeChain |
| Cache-aware scheduling | CachePressure, CacheYield |
| Fan-out wake storms | FutexFanOut, FanOutCompute |
| Broadcast wakeups (thundering herd) | ThunderingHerd |
| epoll exclusive-wake paths | EpollStorm |
| Timer (hrtimer) wake-to-run latency | TimerLatency |
| IRQ/softirq wake paths | IrqWake, NetTraffic |
| Wakeup + request latency (schbench parity) | Schbench |
| KV object-cache request mix (taobench parity) | Taobench |
| Task creation/destruction pressure | ForkExit |
| Priority reweighting / nice dynamics | NiceSweep |
| Affinity churn / forced migration | AffinityChurn, CrossAffinityChurn, NumaMigrationChurn |
| Scheduling-class transitions | PolicyChurn |
| Cgroup migration paths | CgroupChurn, CgroupAttachStorm |
| Page fault / TLB pressure | PageFaultChurn |
| NUMA locality under migration | NumaWorkingSetSweep |
| Lock contention / convoy effect | MutexContention |
| Priority inversion | PriorityInversion |
| RT starving or preempting CFS | RtStarvation, PreemptStorm |
| Signal delivery pressure | SignalStorm |
| Producer/consumer imbalance | ProducerConsumerImbalance |
| Block-I/O D-state cycles | IoSyncWrite, IoRandRead, IoConvoy |
| Mixed / phased real-world patterns | Mixed, Sequence |
| High-IPC compute, SMT interference | AluHot, SmtSiblingSpin, IpcVariance |
| Arbitrary user-defined workload | Custom |
Variants by intent
The WorkType enum in ktstr::workload is the source of truth —
run cargo doc for full per-variant semantics, parameters, and
kernel-path citations. The shape of each family:
CPU primitives. SpinWait — tight spin loop, pure CPU. YieldHeavy
— sched_yield every iteration, exercising wake/sleep paths. Mixed
— spin burst then yield. AluHot { width } — parallel multiply
chains at high IPC, optionally SIMD. SmtSiblingSpin — paired
PAUSE-spin on two SMT siblings. IpcVariance { hot_iters, cold_iters, period_iters } — alternating high-IPC and cache-miss phases.
Block I/O (against /dev/vda; per-worker tempfile fallback when
absent). IoSyncWrite — striped O_SYNC pwrites + fdatasync,
fsync-heavy D-state cycles. IoRandRead — 4 KB O_DIRECT preads at
random offsets, high-IOPS short D-states. IoConvoy — interleaved
sequential writes and random reads with periodic fdatasync.
Burst-and-sleep. Bursty { burst_duration, sleep_duration } —
CPU burst then sleep, freeing CPUs for borrowing. IdleChurn — burst
then nanosleep, exercising hrtimer + idle-class paths.
Cache pressure. CachePressure { size_kib, stride } — strided
read-modify-write sized to pressure L1. CacheYield — the same plus
sched_yield, testing re-placement with a cache-hot working set.
Wake placement and cross-CPU paths. PipeIo — CPU burst then
1-byte pipe exchange with a partner. FutexPingPong { spin_iters } —
paired futex wait/wake (non-WF_SYNC path). CachePipe — cache-hot
working set + pipe wake. FutexFanOut { fan_out, spin_iters } — one
messenger wakes N receivers, which measure wake-to-run latency.
FanOutCompute — fan-out plus matrix-multiply think time per
receiver. WakeChain { depth, wake, work_per_hop } — a ring of
waker-wakee hops via pipe (WF_SYNC) or futex. AsymmetricWaker —
paired workers in mismatched scheduling classes sharing a futex.
EpollStorm { producers, consumers, events_per_burst } — eventfd
producers + epoll_wait consumers (exclusive autoremove wake).
ThunderingHerd { waiters, batches, inter_batch_ms } — N waiters on
one futex word, broadcast-woken.
Timer and IRQ wakes (the AF_PACKET variants need
#[ktstr_test(network = ...)]). TimerLatency { interval_us } —
cyclictest-style absolute-deadline hrtimer wake. NetTraffic —
AF_PACKET self-traffic driving virtio-net RX hardirq + NAPI softirq.
IrqWake — paired sender/receiver; the receiver blocked in
recvfrom is woken from NET_RX softirq context.
Lifecycle and class churn. ForkExit — rapid fork + _exit +
waitpid cycles. NiceSweep — nice level cycled -20..19
(reweight_task; negative values skipped without CAP_SYS_NICE).
AffinityChurn { spin_iters } — self-directed sched_setaffinity
to random CPUs. CrossAffinityChurn — workers rewrite their cgroup
siblings’ affinity; needs a dedicated cgroup. PolicyChurn —
SCHED_OTHER → BATCH → IDLE (→ FIFO/RR with CAP_SYS_NICE)
via __sched_setscheduler. NumaMigrationChurn { period_ms } —
affinity rotated across NUMA nodes. CgroupChurn { groups, cycle_ms }
— membership cycled between sibling cgroups. CgroupAttachStorm
— transient children migrated into a sibling cgroup mid-exit
(attach-path leader race).
Memory pressure / NUMA. PageFaultChurn { region_kib, touches_per_cycle, spin_iters } — mmap, fault random 4 KiB pages
through do_anonymous_page, MADV_DONTNEED, repeat.
NumaWorkingSetSweep — the working set rotated across NUMA nodes
via mbind.
Lock contention. MutexContention { contenders, hold_iters, work_iters } — N-way futex mutex contention (convoy effect,
lock-holder preemption). PriorityInversion — three priority tiers
contending for one lock, PI or plain futex mode.
Signal / preemption pressure. SignalStorm — paired workers
fire tkill(partner, SIGUSR2) between bursts. PreemptStorm — one
SCHED_FIFO worker preempts CFS spinners at ~kHz. RtStarvation —
SCHED_FIFO workers monopolize CPUs while CFS workers starve.
Compound. Sequence { first, rest } — ordered WorkPhases
(Spin / Sleep / Yield / Io / AluHot, each with a
Duration) looped for the run:
WorkType::Sequence {
first: WorkPhase::Spin(Duration::from_millis(100)),
rest: vec![
WorkPhase::Sleep(Duration::from_millis(50)),
WorkPhase::Yield(Duration::from_millis(20)),
],
}
User-supplied. Custom — your own work function, with a
fork-safe config payload. The how-to lives in
Custom Scenarios.
use ktstr::prelude::*;brings inWorkType,WorkSpec,WorkPhase,SchedPolicy, and the parameter enums (SchbenchConfig,TaobenchConfig,FutexLockMode,WakeMechanism,SchedClass,ReapMode,AluWidth). Note the prelude also exports an unrelatedPhasefrom the assertion layer —WorkType::SequenceusesWorkPhase.
Constructors and defaults
Every parameterized variant has a snake-case constructor —
WorkType::bursty(burst, sleep), WorkType::mutex_contention(4, 256, 1024), WorkType::wake_chain(depth, wake, work_per_hop), and so on
— with parameter validation where zero values are meaningless.
Duration-typed parameters (Bursty, IdleChurn, WakeChain) take
std::time::Duration, not raw integers.
WorkType::from_name("FutexPingPong") resolves a PascalCase name to
a default-parameterized instance; the per-variant defaults are the
constants in the
ktstr::workload::defaults
module. Sequence and Custom require explicit construction and
return None from name lookup. WorkType::ALL_NAMES lists every
name; WorkType::name() maps back.
Grouped work types
PipeIo, FutexPingPong, and CachePipe pair workers and require
even num_workers. FutexFanOut and FanOutCompute require
num_workers divisible by fan_out + 1 (one messenger + N
receivers per group). MutexContention requires divisibility by
contenders. WorkType::worker_group_size() returns the group size
for these variants, None for ungrouped types.
Schbench
Schbench re-expresses schbench’s default mode natively: message
threads batch-wake worker threads (wakeup latency), each worker
think-sleeps then does matrix work under a per-CPU lock (request
latency). SchbenchConfig’s fields map schbench’s
-m/-t/-F/-n/-s/-L/-R/-A/-p flags — the rustdoc has
the full CLI-parity table, including which knobs ktstr’s topology
sets for you. Use a single ktstr worker (workers(1)): the
message/worker parallelism is this variant’s internal thread
topology, not ktstr worker processes.
Worker teardown and process groups
Every worker calls setpgid(0, 0) after fork, and teardown SIGKILLs
the worker’s whole process group — on graceful stop, on escalation,
and again at handle drop. Anything a worker spawns that inherits its
pgid (a helper binary, a subshell) dies with it. A child that must
outlive the worker needs its own process group
(setpgid(child_pid, 0)) or an explicit wait before the worker
returns.
Clone mode and pcomm
Workers fork by default (CloneMode::Fork: one process per worker);
CloneMode::Thread runs them as threads sharing the parent’s thread
group. Setting pcomm on a WorkSpec or CgroupDef routes workers
through a fork-then-thread path: one forked leader whose comm is
the pcomm value hosts the matching workers as its threads — the
per-process-leader shape schedulers expect from real applications.
See the WorkloadConfig and WorkSpec rustdoc for the mechanics.
WorkloadConfig
WorkloadConfig is the low-level spawn spec CgroupDef builds
internally; use it directly only when calling WorkloadHandle::spawn
from a custom scenario. Its default is one SpinWait worker with
inherited affinity and policy. The composed field carries secondary
WorkSpec groups spawned alongside the primary; reports identify
them by group_idx. Topology-aware AffinityIntent variants
(SingleCpu, LlcAligned, CrossCgroup, SmtSiblingPair) need
scenario context and are rejected at the direct-spawn gate. See
Workers and Workloads.
Scheduling policies
pub enum SchedPolicy {
Normal,
Batch,
Idle,
Fifo(u32), // priority 1-99
RoundRobin(u32), // priority 1-99
Deadline { runtime: Duration, deadline: Duration, period: Duration },
Ext, // SCHED_EXT — route through the loaded BPF scheduler
}
Fifo, RoundRobin, and Deadline require CAP_SYS_NICE. A
malformed Deadline (runtime <= deadline <= period violated) fails
with a diagnostic before the syscall. Ext is SCHED_EXT: it routes
the worker through the loaded sched_ext scheduler even under a
SCX_OPS_SWITCH_PARTIAL scheduler that leaves other tasks in fair,
and requires CONFIG_SCHED_CLASS_EXT in the guest kernel.
Checking
ktstr judges scheduler behavior through two channels: worker-side telemetry (every worker process reports what happened to it) and host-side monitoring (the monitor reads guest kernel state from outside). Both channels always measure; nothing asserts until the test opts in — a test with no checking attributes passes as long as the VM boots and the scenario completes.
Which API to reach for:
#[ktstr_test]attributes — cover most tests:not_starved,max_gap_ms,max_spread_pct,min_iteration_rate, and every other threshold below has an attribute (see the macro reference).Verdict+claim!— labeled assertions on values you compute inside a custom scenario body.AbsoluteThresholds— a one-call multi-field bound check against collected reports, bypassing the config merge.assert_scx_events_clean— bounds on SCX event counters (“no fallbacks fired”).
Worker checks
After each scenario, ktstr collects a
WorkerReport from every worker and
runs the opted-in checks against them:
- Starvation (
not_starved) — any worker with zero work units fails:tid N starved (0 work units). - Scheduling gaps (
max_gap_ms) — the longest wall-clock gap observed at work-unit checkpoints. A violation renders astid N stuck Xms on cpuY at +Zms (threshold Nms). - Fairness (
max_spread_pct) — workers in one cgroup should get similar CPU time; the spread (max off-CPU% − min off-CPU%) must stay below the bound. - Cpuset isolation (
isolation) — workers may only run on CPUs in their assigned cpuset; any excursion fails. - Throughput —
max_throughput_cvbounds the coefficient of variation of per-worker work rate (some workers quietly slower);min_work_ratesets an absolute floor (all workers equally slow). - Benchmarking —
max_p99_wake_latency_nsandmax_wake_latency_cvbound wake-to-run latency for work types that block and measure it (see Work Types for which do);min_iteration_ratefloors outer-loop iterations per second per worker.
The loop, end to end
A test sets a threshold, the run violates it, the failure output names the check, the value, and the bound:
#[ktstr_test(
scheduler = MY_SCHED,
llcs = 1, cores = 2, threads = 1,
min_iteration_rate = 50_000_000.0, // deliberately unreachable floor
)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("cg_a").workers(1).cpuset(CpusetSpec::disjoint(0, 2)),
CgroupDef::named("cg_b").workers(1).cpuset(CpusetSpec::disjoint(1, 2)),
])
}
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed: worker 71 iteration rate 41903.3/s below floor 50000000.0/s worker 73 iteration rate 37834.5/s below floor 50000000.0/s --- stats --- 2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600 cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252 ... --- monitor --- samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0 avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0 events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0 ... verdict: monitor OK
Both channels report: the worker check that tripped, and the monitor verdict that did not. The full failure anatomy — timeline, scheduler log, dump sections — is in Reading Failure Output.
Monitor checks
The host-side monitor samples guest per-CPU runqueue state (via BTF offsets, no guest instrumentation) roughly every 100ms and evaluates:
- Imbalance ratio —
max(nr_running) / max(1, min(nr_running))across CPUs. - Local DSQ depth — per-CPU dispatch queue depth.
- Stall detection —
rq_clocknot advancing on a CPU with runnable tasks; idle CPUs and preempted vCPUs are exempt. - Event rates —
select_cpu_fallbackanddispatch_keep_lastcounters per second.
Monitor violations always land in the failure report’s --- monitor --- section, but they flip the test result only when the test
enforces them — set the corresponding attributes, call
.with_monitor_defaults() on an Assert, or set
enforce_monitor_thresholds. A monitor that produced no usable
signal (empty samples, uninitialized guest memory) reports
inconclusive, never a silent pass — a CI gate can always tell
“verified OK” from “never measured”.
The defaults with_monitor_defaults() applies:
| Threshold | Default | Rationale |
|---|---|---|
max_imbalance_ratio | 4.0 | max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions. |
max_local_dsq_depth | 50 | Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks. |
fail_on_stall | true | Fail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt. |
sustained_samples | 5 | At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration. |
max_fallback_rate | 200.0/s | select_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure. |
max_keep_last_rate | 100.0/s | dispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation. |
Every monitor threshold uses the sustained_samples window — a
violation must persist for N consecutive samples before it counts.
NUMA checks
For workers with a MemPolicy, three thresholds
gate page placement:
min_page_locality— minimum fraction of pages on the expected NUMA nodes (the cgroup’s cpuset nodes, derived at evaluation time). Zero observed pages counts as zero locality, not a vacuous pass.max_cross_node_migration_ratio— bound on migrated pages relative to allocated pages (from/proc/vmstatdeltas).max_slow_tier_ratio— bound on the fraction of pages landing on memory-only (CXL-tier) nodes.
Default thresholds
not_starved = true also enables the built-in fairness and gap
checks at these defaults:
| Check | Release | Debug |
|---|---|---|
| Scheduling gap | 2000 ms | 3000 ms |
| Fairness spread | 15% | 35% |
Debug builds run with higher scheduling overhead, so thresholds are relaxed.
How configuration merges
Assert is the threshold-config struct; every field is an Option
where None means “inherit”. Three layers merge, last-Some wins:
the baseline (all None), then the scheduler’s assert, then the
per-test attributes — so a scheduler-wide bound applies to every
test and any single test can override or disable it.
enforce_monitor_thresholds is the one sticky field: once any layer
sets it, it stays set. Worked override recipes live in
Customize Checking.
execute_steps_with(ctx, steps, Some(&assert)) bypasses the merged
config with an explicit Assert for that scenario’s worker checks.
Verdicts and outcomes
Every assertion produces one of four outcomes, and a result’s
terminal verdict is the fold over all of them, most severe first:
Fail > Inconclusive > Pass > Skip.
| Outcome | Meaning |
|---|---|
Pass | the assertion ran and the value satisfied the bound |
Fail | the assertion ran and the value violated the bound |
Inconclusive | the assertion ran but had no signal to evaluate |
Skip | the scenario couldn’t run (unmet precondition) |
Inconclusive exists for instrument-derived denominators — a ratio
whose denominator (iterations, samples, wall-clock interval)
legitimately reached zero because the workload produced no signal.
Policy-derived denominators stay Fail on zero: under
MemPolicy::Bind the policy says pages will exist, so their absence
is a defect, not “couldn’t measure”.
CI gates read the verdict through four accessors:
if r.is_pass() { /* ship */ }
if r.is_fail() { /* block; surface r.failure_details() */ }
if r.is_skip() || r.is_inconclusive() { /* no verdict — triage */ }
is_pass() is deliberately strict: inconclusive and all-skip both
read false.
Beyond attributes
-
Verdict+claim!— the claim accumulator for custom scenario bodies. Labels come from the code itself (stringify!-derived), so they cannot drift from the value they describe:let mut v = Assert::default_checks().verdict(); stats.claim_max_gap_ms(&mut v).at_most(100); claim!(v, iter_delta).at_least(1000); let result = v.into_result(); -
AbsoluteThresholds— flat per-run bounds (max_p99_wake_latency_ns,max_iteration_cost_p99_ns,max_migrations,min_work_units) checked in one call:assert_thresholds(&reports, &AbsoluteThresholds::strict()). Empty report slices return a skip rather than a vacuous pass. -
assert_scx_events_clean(events, bound)— SCX event counters under a cap (None= exactly zero); negative counts always fail. -
Composition —
AssertResult::mergeaccumulates results in a loop;all_of/any_offold sibling results as AND / OR.
Signatures, comparators, and construction details are in the
ktstr::assert rustdoc.
For phase-scoped checks over a stepped scenario, see
Phases and
Temporal Assertions.
Topology
Schedulers make placement decisions across LLC and NUMA boundaries — where to wake a task, when a migration is worth the cache cost. Each ktstr test declares the topology those decisions should be tested against, and the VM it runs in actually has it: the declared NUMA nodes, cache domains, and SMT siblings are what the guest kernel sees.
The notation
Topologies render as {n}n{l}l{c}c{t}t — NUMA nodes, LLCs, cores per
LLC, threads per core. One quirk to internalize:
Note
The
lcount is the total LLC count across the VM, not per-node.2n4l4c2tis 2 NUMA nodes and 4 LLCs total (2 per node), 4 cores per LLC, 2 threads per core = 4 × 4 × 2 = 32 vCPUs.
Containment is strict — threads in a core, cores in an LLC, LLCs in
a NUMA node — and guest CPUs are numbered sequentially through it.
1n2l4c2t (16 vCPUs) lays out as:
node 0
├─ LLC 0 ├─ LLC 1
│ ├─ core 0: cpu 0, 1 │ ├─ core 4: cpu 8, 9
│ ├─ core 1: cpu 2, 3 │ ├─ core 5: cpu 10, 11
│ ├─ core 2: cpu 4, 5 │ ├─ core 6: cpu 12, 13
│ └─ core 3: cpu 6, 7 │ └─ core 7: cpu 14, 15
Most tests use one NUMA node; multi-NUMA topologies matter when the scheduler weighs memory locality. The gauntlet sweeps a test across a whole preset matrix of these shapes.
What a test declares — and what it gets
The #[ktstr_test] attributes numa_nodes, llcs, cores,
threads declare the shape (see the
macro reference for defaults
and inheritance). The run output echoes the topology the guest
booted with — the [topo=...] tag in failure headers and the
timeline header:
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
...
topology: 1n1l2c1t (2 cpus) scheduler: my_sched scenario: throughput_gate duration: 15.0s
To see a host’s physical layout in the same vocabulary, ktstr topo:
CPUs: 64
LLCs: 4
NUMA nodes: 1
LLC 0 (node 0): [0, 1, 2, 3, 4, 5, 6, 7, 32, 33, 34, 35, 36, 37, 38, 39]
LLC 1 (node 0): [8, 9, 10, 11, 12, 13, 14, 15, 40, 41, 42, 43, 44, 45, 46, 47]
LLC 2 (node 0): [16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55]
LLC 3 (node 0): [24, 25, 26, 27, 28, 29, 30, 31, 56, 57, 58, 59, 60, 61, 62, 63]
(Host CPU numbering differs from the guest’s sequential scheme — here SMT siblings sit 32 apart — which is exactly why tests declare a topology instead of inheriting the host’s.)
Cpusets from topology
Scenarios don’t hard-code CPU lists; a
CpusetSpec resolves against the test’s
topology at runtime. On 1n2l4c2t, CpusetSpec::Llc(0) resolves to
CPUs 0-7, so the cgroup’s cpuset.cpus is written as 0-7; Llc
and Numa cover their full domain, while the fractional and
partition variants (Range, Disjoint, Overlap) slice the
usable-CPU pool.
Querying topology from a scenario
Ctx.topo is a TestTopology. The queries scenario authors
actually use:
total_cpus(),num_llcs(),num_numa_nodes()— sizes, e.g. for skip guards (if ctx.topo.num_llcs() < 2 { return Ok(AssertResult::skip(...)) }).usable_cpus()/usable_cpuset()— CPUs available for workload placement. On topologies with more than 2 CPUs the last CPU is reserved for the root cgroup (on 8 CPUs: usable = 0-6). Built-in scenarios and fractionalCpusetSpecs use this pool automatically.llc_aligned_cpuset(idx)/numa_aligned_cpuset(node)— the CPU set of one LLC or one node’s LLCs.numa_nodes_for_cpuset(cpus)— which nodes a CPU set touches; this derives the expected-node set for NUMA checks.numa_distance(from, to)— kernel conventions: 10 local, higher is farther, 255 unreachable/unknown. VM topologies without explicit distances report 10 local / 20 remote.node_meminfo(node)/is_memory_only(node)— per-node memory and CXL-style memory-only node detection.
Ctx::cpuset_cpus(&spec) returns the CPU count a spec resolves to —
useful for sizing worker counts by hand. Its denominator is the
topology-level cpuset, not any cgroup’s currently-effective one; for
cgroup-aware sizing prefer
CgroupDef::workers_pct,
which resolves against the cgroup’s own cpuset at apply time.
The full method catalog (construction, LlcInfo, CPU-list parsing)
is in the
TestTopology rustdoc.
Related
- Gauntlet — preset topology matrices and the constraints that filter them.
- MemPolicy — NUMA memory placement to pair with multi-node topologies.
- Resource Budget — how the host’s topology is carved up when tests run concurrently.
MemPolicy
Testing whether a scheduler keeps tasks near their memory requires a
measurable locality signal — workers whose pages verifiably live on
specific NUMA nodes, so that placement decisions show up as page
counts instead of guesswork. MemPolicy creates that signal: it
wraps set_mempolicy(2) per worker (applied after fork, before the
work loop), and the NUMA checks then gate
on where the pages actually landed. Pair it with multi-NUMA
gauntlet presets to sweep the same
test across node counts.
pub enum MemPolicy {
Default,
Bind(BTreeSet<usize>),
Preferred(usize),
Interleave(BTreeSet<usize>),
Local,
PreferredMany(BTreeSet<usize>),
WeightedInterleave(BTreeSet<usize>),
}
Default— inherit the parent’s policy; no syscall made.Bind(nodes)(MemPolicy::bind([0, 1])) — allocate only from these nodes (MPOL_BIND); allocation fails withENOMEMwhen they are exhausted.Preferred(node)(::preferred(0)) — prefer one node, fall back silently when it is full (MPOL_PREFERRED).Interleave(nodes)(::interleave([0, 1])) — round-robin allocations across the nodes (MPOL_INTERLEAVE).Local— nearest node to the allocating CPU (MPOL_LOCAL).PreferredMany(nodes)(::preferred_many([0, 1])) — prefer any of the nodes, fall back when all are full (MPOL_PREFERRED_MANY, kernel 5.15+).WeightedInterleave(nodes)(::weighted_interleave([0, 1])) — interleave proportional to the per-node weights in/sys/kernel/mm/mempolicy/weighted_interleave/(MPOL_WEIGHTED_INTERLEAVE, kernel 6.9+).
Node-set constructors accept any IntoIterator<Item = usize>.
MemPolicy::node_set() returns the referenced nodes (empty for
Default / Local).
MpolFlags
Optional mode flags OR’d into the set_mempolicy mode:
| Flag | Meaning |
|---|---|
NONE | No flags |
STATIC_NODES | Nodemask is absolute — not remapped when the task’s cpuset changes |
RELATIVE_NODES | Nodemask is relative to the task’s current cpuset |
NUMA_BALANCING | Enable NUMA-balancing optimization for this policy |
Flags combine with |. STATIC_NODES | RELATIVE_NODES is rejected
at setup time (the kernel would return EINVAL), as is any unknown
bit. The kernel accepts NUMA_BALANCING only alongside MPOL_BIND
or MPOL_PREFERRED_MANY — ktstr does not pre-validate that pairing,
so other combinations surface as EINVAL from the worker’s
set_mempolicy call.
Usage
WorkSpec and CgroupDef both take .mem_policy() and
.mpol_flags():
let def = CgroupDef::named("cg_0")
.cpuset(CpusetSpec::numa(0))
.workers(4)
.mem_policy(MemPolicy::bind([0]));
Cpuset validation
When a cgroup has a cpuset and no remapping flag is set, ktstr
validates at setup time that the policy’s nodes are reachable from
that cpuset — MemPolicy::Bind([1]) on a cgroup confined to node 0
fails before the run starts, not as a mystery ENOMEM mid-run.
The check is flag-aware: STATIC_NODES swaps it for a
node-exists-on-host check (the nodemask is absolute and deliberately
allowed outside the cpuset), and RELATIVE_NODES bypasses it (the
kernel remaps the ordinals internally). Policies without a node set
(Default, Local) skip validation.
What gets checked
Locality results feed the NUMA checking
thresholds — min_page_locality,
max_cross_node_migration_ratio, max_slow_tier_ratio. The
expected node set is derived from the cgroup’s cpuset at
evaluation time, not from the worker’s MemPolicy; in the common
case where memory is bound to the same nodes the cpuset pins, the
two coincide. A locality violation renders with the observed
fraction, the threshold, and the page counts (format from the
assertion source):
page locality <observed> (<pct>%) below threshold <min> (<pct>%) (<local>/<total> pages local)
Example: NUMA-aware locality test
use ktstr::prelude::*;
#[ktstr_test(
numa_nodes = 2, llcs = 4, cores = 4, threads = 1,
min_numa_nodes = 2, max_numa_nodes = 2,
min_page_locality = 0.8,
)]
fn numa_locality(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("node0")
.cpuset(CpusetSpec::numa(0))
.workers(4)
.mem_policy(MemPolicy::bind([0])),
CgroupDef::named("node1")
.cpuset(CpusetSpec::numa(1))
.workers(4)
.mem_policy(MemPolicy::bind([1])),
])
}
Each cgroup’s workers are pinned to one NUMA node’s CPUs via
CpusetSpec::numa() and their allocations bound to the same node
via MemPolicy::bind(); the test fails if less than 80% of pages
land where they were bound.
Node-set policies only mean something on multi-NUMA topologies. The
constraint pair min_numa_nodes = 2, max_numa_nodes = 2 keeps
gauntlet expansion on two-node presets — single-node presets are
filtered out rather than failing. Both bounds are needed: the
default constraints cap at one NUMA node, and an inverted pair
(min above max) is rejected at validation time. See
Gauntlet for the preset matrix.
Performance Mode
Without performance mode, a 50ms scheduling gap in a measurement could be host noise; with it, the same gap indicates a scheduler problem. Performance mode removes host-side variance — vCPU threads pinned to dedicated cores, hugepage-backed guest memory, NUMA-local allocation, real-time scheduling — so timing thresholds measure the scheduler under test, not the host it happens to share.
Usage
#[ktstr_test(
llcs = 2,
cores = 4,
threads = 2,
performance_mode = true,
)]
fn my_perf_test(ctx: &Ctx) -> Result<AssertResult> {
scenarios::steady(ctx)
}
The VM builder API takes the same switch:
KtstrVm::builder().performance_mode(true).
When to use
Performance mode is for tests where host-side scheduling noise affects results — fairness spread measurements, scheduling gap detection, imbalance ratio checks. It is not needed for correctness tests (cpuset isolation, starvation detection) where pass/fail is binary.
The gauntlet runs many VMs in parallel. Performance mode on parallel
VMs can oversubscribe the host if scheduled naively. Avoid
performance_mode unless the host has enough CPUs for the topology
matrix.
With stable measurements, tests can set tight thresholds
(max_gap_ms, min_iteration_rate, max_p99_wake_latency_ns) to
catch regressions against a fixed bar;
cargo ktstr perf-delta builds on the
same tests to catch regressions against a previous commit. Perf-mode
results are comparable only against runs on the same host — guest-side
jitter from shared caches and memory bandwidth remains.
What it does
On x86_64:
- vCPU pinning — each virtual LLC maps to a physical LLC group and vCPU threads are pinned to cores within it, so the host scheduler cannot migrate them across cache domains mid-measurement.
- Hugepages — guest memory is allocated from 2MB hugepages when enough are free, eliminating host-side TLB pressure.
- NUMA mbind — guest memory is bound (
MPOL_BIND, strict — no silent fallback to remote nodes) to the NUMA nodes of the pinned vCPUs. - RT scheduling — vCPU threads run
SCHED_FIFOpriority 1; the monitor and watchdog run at priority 2 on a dedicated host CPU no vCPU shares, so sampling and timeout enforcement can always preempt a vCPU thread. - PAUSE and HLT exit suppression — guest spinlock
PAUSEloops and idleHLTnormally trap to the hypervisor so it can schedule other vCPUs; with dedicated cores that reschedule is pure overhead, so both exits are disabled. (HLT disable is skipped when the host’s SMT-RSB mitigation forbids it; PAUSE alone is still disabled.) - KVM_HINTS_REALTIME — a CPUID hint telling the guest kernel its vCPUs own dedicated cores; the guest drops paravirt yield paths and polls briefly before halting instead of paying wakeup latency.
On aarch64, the four host-side items apply (pinning, hugepages, NUMA mbind, RT scheduling); the x86-specific exit suppression and CPUID hint do not exist there.
Prerequisites
Sufficient host CPUs — at least
(llcs * cores * threads) + 1 online CPUs; the extra CPU hosts the
monitor and watchdog threads. The host also needs at least as many
physical LLC groups as the test declares virtual LLCs.
2MB hugepages (optional) — check
/sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages. Without
them guest memory uses regular pages and a warning is printed.
CAP_SYS_NICE or an rtprio limit (optional) — SCHED_FIFO
requires root or RLIMIT_RTPRIO at or above the requested priority.
For non-root use:
# /etc/security/limits.conf
username - rtprio 99
Log out and back in for the limit to take effect. Without it, RT scheduling is skipped with a warning and results may be noisier.
Sizing the host
A single perf-mode test needs (llcs * cores * threads) + 1 online
CPUs and llcs free physical LLC groups — the test holds an
exclusive lock on one host LLC group per virtual LLC for the run’s
duration. To run K perf-mode tests concurrently without contention
skips, the host needs K * llcs free LLC groups; with fewer, the
excess tests skip with ResourceContention and nextest retries them
after a holder releases. The vm-perf test group in
.config/nextest.toml caps how many run at once.
Failure modes
Performance mode never runs unisolated: if the host cannot honor the guarantee, the build fails before boot and the test skips visibly rather than shipping a measurement that does not match what was asked for.
PerfModeUnavailable— permanent host insufficiency: too few CPUs or LLC groups for the topology, no satisfiable pinning plan, or no free CPU left for service threads. Skips by default with a visible banner (ktstr: SKIP: <reason>on stderr, exit 0, skip recorded in the run sidecar); promoted to a hard fail underKTSTR_NO_SKIP_MODEfor runs that demand execution.ResourceContention— transient: another run holds a lock on a needed LLC or CPU (the reason names it, e.g.LLC 3 busy). Skips with the sameSKIP:banner; a retry after the holder finishes succeeds.- Warnings (non-fatal) — insufficient free hugepages (regular
pages used); high host load (
procs_runningabove half the vCPU count — results may be noisy); unstable TSC (x86_64, common in nested virtualization — timing variance is higher).
The full skip-vs-fail model — which requester gets a skip, which gets a hard error, and what the default path does instead — is in Resource Budget.
Disabling performance mode
--no-perf-mode (or KTSTR_NO_PERF_MODE=1) forces
performance_mode = false and routes the run through the budgeted
coordination path: a shared LLC reservation sized to a CPU budget,
enforced by a cgroup cpuset instead of pinning — none of the
isolation features above apply. The mode comparison, the CPU budget,
and the --cpu-cap flag live in
Resource Budget.
Resource Budget
ktstr boots KVM VMs and builds kernels on hosts that are usually doing other things at the same time — more tests, kernel builds, a developer session. The resource budget is how concurrent ktstr processes share host CPUs without silently corrupting each other’s measurements: every run reserves host LLCs through advisory file locks, and budgeted runs are additionally confined to an exact CPU count by a cgroup v2 cpuset sandbox.
When to use it
- Multi-tenant CI hosts where unbounded parallelism starves concurrent jobs but the full performance-mode contract (RT scheduling, hugepages, NUMA mbind) is too heavy.
- Kernel builds beside perf-mode tests — the build’s shared lock
coordinates with the perf-mode exclusive lock, so
makenever stomps a measurement in progress. - Concurrent no-perf-mode VMs — a cap of
NCPUs bounds how much capacity each run reserves; peers wait instead of racing for CPU.
The three coordination modes
Every VM run takes one of three coordination paths, selected by two
switches: performance_mode on the test or builder, and
--no-perf-mode / KTSTR_NO_PERF_MODE (any non-empty value).
| Mode | Selected by | LLC lockfiles | Per-CPU lockfiles | Enforcement |
|---|---|---|---|---|
| Performance mode | performance_mode = true | exclusive (LOCK_EX), one per virtual LLC | none — the exclusive LLC lock covers its CPUs | vCPU pinning, RT scheduling, hugepages, NUMA mbind |
| Budgeted (no-perf-mode) | --no-perf-mode / KTSTR_NO_PERF_MODE | shared (LOCK_SH) on the planned LLC set | none — the cgroup cpuset is the enforcement layer | cgroup v2 cpuset sandbox + soft affinity mask |
| Default | neither | shared (LOCK_SH) on the 1:1 plan’s LLCs | exclusive (LOCK_EX), one per assigned host CPU | none — reservation only |
Lockfiles live at {KTSTR_LOCK_DIR or /tmp}/ktstr-llc-{N}.lock and
{KTSTR_LOCK_DIR or /tmp}/ktstr-cpu-{C}.lock. The modes compose:
shared holders coexist with each other, an exclusive holder blocks
every shared acquirer and vice versa. So any number of budgeted and
default runs share LLCs among themselves, a perf-mode run waits for
all of them to release, and while a perf-mode run holds its LLCs
nobody else touches those CPUs. Default runs additionally exclude
each other per CPU, so two default VMs never time-slice the same
host CPU. Kernel builds take the budgeted path.
When the default path cannot map its topology 1:1 onto the host it
does not fail: if a plan exists but every slot is busy, the run skips
with ResourceContention and nextest retries; if no plan can exist
(host too small), the run proceeds overcommitted — every vCPU
thread masked to the allowed CPUs — and warns when that means
oversubscription (see below).
Too-small hosts: who asked determines the verdict
The outcome of an unsatisfiable request depends on where the request came from — an explicit guarantee must never silently degrade, and an operator typo must not look like a host limitation:
| Request | Error | Outcome |
|---|---|---|
performance_mode = true, host can’t honor isolation | PerfModeUnavailable | skip (fail under KTSTR_NO_SKIP_MODE) |
Per-test cpu_budget above the allowed CPUs | TopologyInsufficient | skip (fail under KTSTR_NO_SKIP_MODE) |
Operator --cpu-cap / KTSTR_CPU_CAP above the allowed CPUs | CpuBudgetUnsatisfiable | hard fail |
| Default mode, no 1:1 placement possible | — | runs overcommitted, warns |
A test attribute is a capability requirement a bigger host would satisfy, so it skips. An operator-typed number that does not exist on this host is a misconfiguration, so it fails. The over-cap error names both numbers:
--cpu-cap N = 96 exceeds the 64 CPUs this process is allowed on (from
sched_getaffinity / Cpus_allowed_list). Pick a value ≤ 64, release the
cgroup/taskset constraint restricting this process, or omit --cpu-cap
to use the auto-sized default (30% of the allowed set for kernel
builds; the vCPU count, floored at 30%, for VMs).
The default-mode overcommit warning fires only when the allowed CPU set is genuinely smaller than the vCPU count (a CI runner or systemd slice can be narrower than the online host):
ktstr: WARNING: only 8 host CPUs available for 16 vCPUs (2.0x
oversubscription) — the process cpuset is smaller than the guest, so
the auto-sized CPU budget collapsed to it. NOTHING opted into this.
The host time-slices the vCPU threads, confounding guest-scheduler
measurement (absolute work scales ~1/2; timing metrics are host
artifacts). Widen the process cpuset, or shrink the guest topology.
The stamped cpu_budget in the run’s sidecar also drops below the
vCPU count, so an A/B comparison against an overcommitted run is
flagged rather than silently confounded.
The CPU budget
The budget is resolved in precedence order:
--cpu-cap Non the command line.KTSTR_CPU_CAP=Nwhen the flag is absent (empty string = unset).- Neither: kernel builds get 30% of the allowed CPUs (rounded up,
minimum 1); no-perf-mode VMs get
max(30%, min(vcpus, allowed))so a wide VM’s vCPU threads are not host-oversubscribed by the 30% mask — an oversubscribed guest measures host contention, not its own scheduler. An explicit cap below the vCPU count is the deliberate opt-in to oversubscription for contention testing.
0 is rejected with --cpu-cap must be ≥ 1 CPU (got 0) — zero is a
scripting sentinel, not a silent “no cap”.
The reference set is the calling process’s allowed CPUs
(sched_getaffinity, with a /proc/self/status fallback), not the
host’s online count — so the reservation stays valid under
cgroup-restricted CI runners. An empty allowed set is a hard error:
guessing on a misconfigured host is worse than failing visibly.
A per-test cpu_budget attribute on #[ktstr_test] overrides the
auto-size for that test; an operator --cpu-cap / KTSTR_CPU_CAP
wins over both.
Flag availability
--no-perf-mode:cargo ktstr test/coverage/llvm-cov/shell, andktstr shell.KTSTR_NO_PERF_MODE(any non-empty value) works everywhere.--cpu-cap N:ktstr shell,ktstr kernel build,cargo ktstr shell,cargo ktstr kernel build— and it requires--no-perf-mode(perf mode already holds whole LLCs exclusively, so a cap would double-reserve). Forcargo ktstr test/coverage/llvm-covsetKTSTR_CPU_CAP=Ninstead.
How a reservation is planned
Budgeted acquisition runs three phases:
- Discover — stat every LLC lockfile and read
/proc/locksonce to snapshot current holders. No locks taken. - Plan — rank LLCs: prefer LLCs that already have holders (consolidation packs shared runs together), seed on the best-ranked LLC’s NUMA node, and greedily fill that node before spilling to nearest-by-distance neighbors. Accumulate LLCs until their allowed CPUs cover the budget.
- Acquire — non-blocking shared locks on every selected LLC,
all-or-nothing. If any lock is busy, every held lock is dropped
and the whole cycle retries a few times with short ascending
backoff; after the final attempt it bails with a
ResourceContentionerror naming the winning holders.
The lock granularity is per-LLC, but the reserved CPU list holds exactly the budget — the last selected LLC typically contributes only a prefix of its CPUs. When the plan spans more than one NUMA node, stderr warns:
ktstr: reserving LLCs [0 (node 0), 2 (node 1)] across 2 NUMA nodes
(preferred single-node contiguous unavailable). Build will run;
memory-access latency may be higher.
Cgroup v2 cpuset sandbox
Budgeted runs write the reserved CPUs and their NUMA nodes into a
child cgroup — cpuset.cpus, then cpuset.mems, then the pid into
cgroup.procs, in that order because the kernel may kill a task
migrated into a cgroup whose cpuset.mems is still empty. After each
write the effective value is read back: narrowing by a parent cgroup
(a systemd slice, a container limit) is a fatal error under an
explicit --cpu-cap and a warning otherwise. Kernel builds inside
the sandbox also get their make -j width set to the reserved CPU
count — without that, make -j$(nproc) fans gcc children out to a
width the cpuset then has to time-slice, silently defeating the
budget in scheduling terms.
Observing locks
ktstr locks (or cargo ktstr locks) prints every ktstr lock
currently held on the host — LLC, per-CPU, kernel-cache, and run-dir
locks — with each holder’s PID and command line. It is read-only and
takes no locks itself. Use it when an acquire fails with
ResourceContention: the error names the busy LLCs, the snapshot
shows every contending peer at once. The full output and flags are in
ktstr (standalone).
KTSTR_BYPASS_LLC_LOCKS — escape hatch
Setting KTSTR_BYPASS_LLC_LOCKS=1 skips lock acquisition entirely:
the VM boots or the build starts immediately, with no coordination
against concurrent runs. Use it only when measurement noise is
acceptable — an isolated workstation, or a CI queue that already
serializes jobs at a higher layer. It is mutually exclusive with
--cpu-cap / KTSTR_CPU_CAP at every entry point; the rejection
message always contains "resource contract" so it is greppable.
Filesystem requirement
Every lockfile path must live on a local filesystem — tmpfs, ext4,
xfs, btrfs, f2fs, and bcachefs are the accepted set. NFS, CIFS/SMB,
CephFS, AFS, and FUSE mounts are rejected at open time: flock(2)
coordination or /proc/locks holder enumeration is unreliable on
these configurations, and ktstr refuses to run on a lock it cannot
trust. The error names the offending filesystem and the fix: move the
lockfile path (KTSTR_LOCK_DIR, the cache root, or the runs root) to
a local filesystem. Unknown-but-local filesystems (zfs, erofs, …)
pass through.
Related
- Performance Mode — the full-isolation mode.
- Environment Variables —
KTSTR_CPU_CAP,KTSTR_LOCK_DIR,KTSTR_BYPASS_LLC_LOCKS.
Recipes
Task-oriented walkthroughs. Each recipe is self-contained: pick the one that matches your problem and follow it top to bottom. For the model behind the commands, read Core Concepts; for flag-by-flag detail, the Running Tests chapters.
Note
Two binaries appear below.
cargo ktstr <subcommand>is the host-side cargo wrapper for test workflows; barektstris the guest-init binary that doubles as a host CLI for a few tools (ctprof,topo,locks). Both install withcargo install ktstr. See cargo ktstr and ktstr (standalone).
Which recipe do I want?
| Symptom | Recipe |
|---|---|
| I have a scheduler binary and no tests | Test a New Scheduler |
| A test failed and the scheduler died | Investigate a Crash |
| Default checks don’t fit my scheduler — or nothing is checked at all | Customize Checking |
| I want gates that catch performance regressions — and proof they fire | Benchmark Gates and Negative Tests |
| Is my scheduler at least as good as the kernel default? | Compare a Scheduler vs EEVDF |
Three recipes compare two runs. They answer different questions:
| Two runs differ because… | Recipe |
|---|---|
| …the scheduler source changed (branch vs baseline commit) | A/B Compare Branches |
| …a workload got slower even though tests still pass | Diagnose a Slow Scheduler with ctprof |
| …the host changed (machine, reboot, sysctl drift) | Capture and Compare Host State |
All recipes
In rough lifecycle order:
- Test a New Scheduler — define the scheduler, write tests, sweep the BPF verifier, host the tests in your own crate
- Investigate a Crash — read the crash report, use auto-repro, pin the bug as a regression test
- A/B Compare Branches —
cargo ktstr perf-deltabetween HEAD and a baseline commit - Capture and Compare Host State —
cargo ktstr show-hostsnapshots and the perf-delta host-delta section - Diagnose a Slow Scheduler with ctprof —
per-thread off-CPU diff between two
ktstr ctprofsnapshots - Customize Checking — scheduler-level thresholds, per-test overrides, merge order
- Benchmark Gates and Negative Tests — performance gates plus the negative tests that prove they fire
- Compare a Scheduler vs EEVDF — detach the scheduler mid-run and compare phases within one test
Test a New Scheduler
End-to-end workflow: define a scheduler, write tests, run them, sweep the BPF verifier. At the end you have a scheduler that boots under real kernels on declared topologies, a test suite that fails when its behavior regresses, and (optionally) all of it hosted in your own crate.
1. Define the scheduler
declare_scheduler! generates a pub static MY_SCHED: Scheduler
and registers it so cargo ktstr verifier discovers it
automatically. Tests reference the bare MY_SCHED ident via
#[ktstr_test(scheduler = MY_SCHED)].
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 2, 4, 1),
kernels = ["6.14", "6.15..=7.0"],
sched_args = ["--exit-dump-len", "1048576"],
});
topology = (1, 2, 4, 1) is 1 NUMA node, 2 LLCs, 4 cores per LLC,
1 thread per core — written 1n2l4c1t in test names and output. See
Topology for the notation and
Scheduler Definitions
for every supported field.
2. Write integration tests
Tests inherit the scheduler’s topology. Override with explicit
llcs, cores, or threads when needed.
use ktstr::prelude::*;
#[ktstr_test(scheduler = MY_SCHED)]
fn basic_steady(ctx: &Ctx) -> Result<AssertResult> {
// Inherits 1n2l4c1t from MY_SCHED
scenarios::steady(ctx)
}
#[ktstr_test(scheduler = MY_SCHED, threads = 2)]
fn smt_steady(ctx: &Ctx) -> Result<AssertResult> {
// Inherits llcs=2, cores=4; overrides threads to exercise SMT
scenarios::steady(ctx)
}
While iterating on a single test, mark the others with
#[ktstr_test(scheduler = MY_SCHED, ignore = true)]: nextest skips
them by default, but they stay registered so the verifier sweep
still sees them. Clear the attribute when the test is ready to
land — leaving it on permanently silently drops coverage.
3. Build a kernel
Build a kernel with sched_ext support:
cargo ktstr kernel build
See Getting Started for version selection and local source builds.
4. Run
cargo ktstr test resolves the kernel from KTSTR_KERNEL, the
cache, or an explicit --kernel <spec> — a version like 7.0, a
cache key from cargo ktstr kernel list, or a path to a kernel
source tree. (It does not accept a prebuilt bzImage/Image;
only cargo ktstr shell does.)
cargo ktstr test # auto-discover from cache / KTSTR_KERNEL
cargo ktstr test --kernel 7.0 # pin to a version (latest 7.0.x)
cargo ktstr test --kernel ../linux # pin to a local source checkout
A run looks like this — each PASS line is a fresh VM that booted,
ran the scenario, and shut down:
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
Nextest run ID 24c18577-cd34-43bd-9d14-b0197701c187 with nextest profile: default
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
Summary [ 34.490s] 1 test run: 1 passed, 12531 skipped
...
The run footer names the output directory
(target/ktstr/{kernel}-{project_commit}) where per-test stats
sidecars land — see
Runs and Regression Gates.
5. Sweep the BPF verifier
The verifier sweep loads your scheduler’s BPF programs under the
real kernel verifier, on every accepted topology preset, and
reports per-program verified-instruction counts. Instruction counts
vary with topology when nr_cpus bakes into .rodata — a
scheduler can attach on one topology and wedge on another, which is
exactly what the sweep catches.
cargo ktstr verifier # kernel from KTSTR_KERNEL / cache
cargo ktstr verifier --kernel ../linux # pin one kernel
cargo ktstr verifier --kernel 6.14 --kernel 7.0 # sweep several
Each scheduler’s kernels = [...] declaration filters the
operator-supplied set; an empty or omitted kernels field runs
against every kernel in the sweep.
cargo ktstr: resolved kernel "7.0"
...
Starting 4 tests across 1 binary (55 tests skipped)
PASS [ 12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
PASS [ 12.432s] (2/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/smt-2llc
PASS [ 12.656s] (3/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-1llc
PASS [ 12.929s] (4/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/tiny-2llc
────────────
Summary [ 12.929s] 4 tests run: 4 passed, 55 skipped
verifier verified_insns (per scheduler; rows: kernel, cols: BPF program, cell: range across topologies):
ktstr_sched:
kernel ktstr_dispatch ktstr_dump ktstr_dump_cpu ktstr_dump_task ktstr_enqueue ktstr_exit ktstr_exit_task ktstr_init ktstr_init_task ktstr_select_cp ktstr_yield
kernel_7_0 102 81 13 70 74 25 419 2296 29077 39 8
verifier summary: 4 ✅ 0 ❌ 0 🇽
topology ktstr_sched
odd-3llc ✅
smt-2llc ✅
tiny-1llc ✅
tiny-2llc ✅
One glance shows where the complexity lives (ktstr_init_task at
~29k verified instructions dwarfs every other program) and that all
four topologies attach cleanly. See
BPF Verifier Sweep for the output
format, cycle collapse on rejections, and the kernel-matching
contract. Kernel cache hygiene (kernel list / kernel clean)
lives in the cargo ktstr
reference.
6. Debug failures
Boot an interactive shell with the scheduler binary packed into the
guest. -i (--include-files) adds host-side files to the guest’s
/include-files/ directory:
cargo ktstr shell -i ./target/debug/scx_my_sched
Inside the guest, run /include-files/scx_my_sched manually to
inspect behavior. Use --exec CMD to run a single command
non-interactively instead. See
ktstr (standalone) and the
cargo ktstr reference for all
flags, and Investigate a Crash for the
crash-report workflow.
7. Write a crash test
Schedulers ship their own failure-handling paths; a negative test
pins them. The pattern: define a BpfMapWrite constant naming a
.bss global in your scheduler, have ktstr write a trigger value
into it after the scheduler loads, and have the scheduler’s error
path read the global and call scx_bpf_error(...) with a known
message. The test passes only when that exact error fires — a
wrong or missing error fails, so silent regressions in the error
path become visible.
use ktstr::prelude::*;
// ".bss" names the libbpf-named .bss map; "crash" is the global the
// host writes; 1 is the trigger value. The field offset is resolved
// from the map's program BTF at write time.
static BPF_CRASH: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);
#[ktstr_test(
scheduler = MY_SCHED,
bpf_map_write = BPF_CRASH,
expect_err = true,
expect_scx_bpf_error_contains = "my_sched: host-triggered crash",
)]
fn crash_path_emits_expected_error(ctx: &Ctx) -> Result<AssertResult> {
ktstr::scenario::basic::custom_sched_mixed(ctx)
}
The substring contract is yours to define — the framework only
enforces that what you declare matches what the scheduler emits.
Use expect_scx_bpf_error_matches = r"…" for regex matching; the
full matcher semantics live in the
#[ktstr_test] reference.
8. Host a ktstr test in an external scheduler crate
A scheduler that lives in its own crate (outside the ktstr workspace) can host ktstr tests directly. Gate ktstr behind a feature so it never enters a normal build — ktstr pulls in Linux-only, heavyweight dependencies (KVM, libbpf, a kernel loader, guest memory) that would break non-Linux development and bloat ordinary CI.
In the scheduler crate’s Cargo.toml:
[dependencies]
# Pin the exact installed cargo-ktstr version — ktstr is pre-1.0 and
# a minor bump can break the test-facing API (see the README's
# version-compatibility note). Must be an optional [dependencies]
# entry, not a dev-dependency: `dep:ktstr` in [features] only
# resolves an optional normal dep (Cargo has no optional
# dev-dependencies).
ktstr = { version = "=X.Y.Z", optional = true }
[features]
ktstr-tests = ["dep:ktstr"]
[dev-dependencies]
# Only if a test body uses raw libc (e.g. fork / _exit); the
# prelude does not re-export libc.
libc = "0.2"
The test file gates its whole contents on the feature, so the crate compiles to nothing extra when the feature is off:
#![cfg(feature = "ktstr-tests")]
use ktstr::prelude::*;
// `MY_SCHED` is your `declare_scheduler!(MY_SCHED, { ... })` constant
// (section 1). Drop the `scheduler =` attribute to run under the
// kernel's default scheduler instead.
#[ktstr_test(scheduler = MY_SCHED)]
fn my_sched_runs(ctx: &Ctx) -> Result<AssertResult> {
ktstr::scenario::basic::custom_sched_mixed(ctx)
}
Build a kernel once (section 3), then run the gated tests. The
feature flag rides the nextest passthrough after --:
cargo ktstr kernel build --kernel /path/to/linux
cargo ktstr test --kernel /path/to/linux -- --features ktstr-tests
cargo ktstr test forwards everything after -- to
cargo nextest run, which routes --features ktstr-tests to the
test compile.
Investigate a Crash
When the scheduler under test dies mid-run, the test fails with a structured crash report: the scx exit reason, a kernel-side backtrace, per-CPU scheduler state, and — by default — the results of a second VM that ktstr boots automatically to replay the crash with probes attached. This recipe walks a real crash from report to regression test.
First step: rerun with full diagnostics
RUST_BACKTRACE=1 cargo ktstr test --kernel 7.0 -- -E 'test(my_test)'
RUST_BACKTRACE=1 (or full) does two things: it appends the
--- diagnostics --- section (init stage, VM exit code, kernel
console tail) to every failure — not only scheduler deaths — and it
boots the guest with a verbose console (the same switch
KTSTR_VERBOSE=1 flips).
The crash report
A failure prints as a header line plus sections, each present only
when relevant: --- stats --- (per-cgroup worker results),
--- diagnostics ---, --- timeline --- (phases with monitor
samples), --- scheduler log --- (scheduler stdout+stderr,
including the kernel’s DEBUG DUMP when the scheduler died),
--- monitor --- (host-side observations and verdict),
--- sched_ext dump --- (the same dump as traced by the guest
kernel), and --- auto-repro ---. See
Reading Failure Output for the full
anatomy; this recipe focuses on the crash workflow.
A real crash, top to bottom
The crash below is real: ktstr’s fixture scheduler was told to call
scx_bpf_error() via a host-written .bss trigger (the same
mechanism step 7 of Test a New Scheduler
uses). Trimmed to the load-bearing sections:
BUG SUMMARY: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash) ktstr_test 'bpf_crash_auto_repro_e2e' [sched=scx-ktstr] [topo=1n1l4c1t] failed: scheduler process died unexpectedly during workload (2.2s into test) ... --- scheduler log --- ... DEBUG DUMP ================================================================================ swapper/3[0] triggered exit kind 1025: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash) Backtrace: scx_exit+0x50/0x70 scx_bpf_error_bstr+0x78/0x90 bpf_prog_1fed99378f3a8055_ktstr_dispatch+0x4d/0x1cb bpf__sched_ext_ops_dispatch+0x4b/0xa7 do_pick_task_scx+0x379/0x770 __schedule+0x5ca/0xfc0 schedule+0x44/0x1b0 worker_thread+0xa2/0x2d0 kthread+0xf3/0x130 ret_from_fork+0x19b/0x260 ret_from_fork_asm+0x1a/0x30 ktstr scheduler state: stall=0 crash=1 degrade_rt=0 rodata: degrade=0 slow=0 scattershot=0 verify_loop=0 fail_verify=0 ktstr_alloc_count=87 degrade_cnt=0 slow_cnt=0 ... Error: EXIT: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)
Three lines identify the cause:
- The
BUG SUMMARY/ exit line carries the message and C source line your scheduler passed toscx_bpf_error(). - The backtrace names the BPF program that raised it
(
…ktstr_dispatch) and shows it fired from the kernel’s pick path (do_pick_task_scxinside__schedule). - The
ktstr scheduler stateblock is the scheduler’s own dump callback output — whatever your scheduler prints in its.dump()op appears here. In this casecrash=1confirms the host-written trigger was read.
Auto-repro
auto_repro defaults to true in #[ktstr_test]. When the
scheduler crashes, ktstr automatically:
- Captures the crash stack trace from the scenario output.
- Boots a second VM with kprobes (kernel functions) and fentry
probes (BPF callbacks) on each function in the crash chain, plus
a
tp_btf/sched_ext_exittracepoint trigger. - Reruns the scenario to capture function arguments at each crash point.
The --- auto-repro --- section starts with the probe pipeline’s
own accounting, so you can always see how much of the chain
attached. From the run above:
--- auto-repro ---
--- probe pipeline ---
extracted: 10 functions from crash backtrace
traceable: 7 passed, 3 dropped: bpf_prog_1fed99378f3a8055_ktstr_dispatch, bpf__sched_ext_ops_dispatch, ret_from_fork_asm
bpf_discover: 0 programs found
after_expand: 7 total probe targets
kprobes: 0 attached
trigger: attach failed (skeleton load (retry): No such process (os error 3); original error before retry: No such process (os error 3))
probe_data: 0 keys, 0 unmatched IPs
events: 0 captured, 0 after stitch
repro VM duration: 16.9s
Read the trigger: line honestly: the probe trigger needs the
tp_btf/sched_ext_exit tracepoint, which not every kernel carries —
on this 7.0.14 guest the attach failed, so no per-function argument
events were captured. The section still delivers value: after the
pipeline accounting it appends the repro VM’s diagnostics (in this
run, its sched_ext dump and dmesg tail — a scheduler log and the
freeze-coordinator failure dump appear when the repro run produces
them), and here those reproduced the same crash. On kernels with
the tracepoint, the pipeline instead lists each probed function
with the argument values captured on the way to the crash. Cost:
one extra VM boot plus a scenario replay (16.9 s here). Auto-repro
is skipped when expect_err = true — an expected failure is not
worth a repro VM — and can be turned off with auto_repro = false.
See Auto-Repro for how the two-VM
cycle works and its kernel requirements.
Pin the bug as a regression test
Once the crash is understood, pin its signature so the same bug
fails the next CI run instead of silently regressing. Two
#[ktstr_test] attributes attach matchers to the captured
scx_bpf_error text (the combined scheduler log and
--- sched_ext dump --- corpus):
expect_scx_bpf_error_contains = "literal"— substring match. Use for the common case of pinning an exact error fragment without escaping regex metacharacters.expect_scx_bpf_error_matches = "regex"— full regex match via theregexcrate. Use for anchored patterns, character classes, and wildcards.
Both require expect_err = true, and both compose with AND
semantics — set both to require both to match. The test body must
actually set up the trigger; here the host-side BpfMapWrite from
the crash above does it:
use ktstr::prelude::*;
// Host writes 1 into the scheduler's `crash` .bss global after
// load; the scheduler's dispatch path reads it and calls
// scx_bpf_error(...) — the crash this test pins.
static BPF_CRASH: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);
#[ktstr_test(
scheduler = MY_SCHED,
bpf_map_write = BPF_CRASH,
expect_err = true,
expect_scx_bpf_error_contains = "host-triggered crash",
expect_scx_bpf_error_matches = r"src/bpf/main\.bpf\.c:\d+",
)]
fn crash_regression(ctx: &Ctx) -> Result<AssertResult> {
ktstr::scenario::basic::custom_crash_light(ctx)
}
The test fails if either matcher misses. A passing run means the scheduler still hits the pinned bug; a failure means the error text drifted (update the matcher) or the bug was fixed (delete the regression test).
Regex anchors use string-boundary semantics against the whole
captured corpus: ^/$ match its start and end, and . does not
cross \n. Opt in to line-level anchoring with (?m) and to
dotall with (?s). Pattern-validity edge cases (empty patterns,
trivially-matching patterns, builder vs attribute validation) are
covered in the
#[ktstr_test] reference.
A/B Compare Branches
One command answers “did my branch make the scheduler slower?”:
cargo ktstr perf-delta runs
the same performance_mode scenarios against your branch and its
baseline commit, diffs every metric, and exits non-zero when enough
metrics regress to trip the failure gate. (For host-context diffs
or per-thread profiling instead, see the
compare picker.)
Automated: perf-delta --noise-adjust
cd ~/src/my-sched # the scheduler crate under test
cargo ktstr perf-delta --noise-adjust 5 --kernel ../linux # HEAD vs merge-base(HEAD, main)
cargo ktstr perf-delta --noise-adjust 5 --kernel ../linux --base-ref release # vs merge-base(HEAD, release)
cargo ktstr perf-delta --noise-adjust 5 --kernel ../linux -E cgroup_steady # narrow the perf set
perf-delta resolves the baseline as merge-base(HEAD, <ref>) (or
a $GITHUB_BASE_REF PR target), then --noise-adjust N checks
both commits out into their own plain checkouts, runs each
side’s performance_mode tests N times, and compares from the
observed spread — no manual worktree bookkeeping.
A single run per side cannot tell a real regression from
run-to-run noise, so --noise-adjust gates a confident regression
on two conditions: the sides must be separated (a Welch
two-sample t-test, or fully disjoint [min, max] bands) and the
delta must be material (each metric’s registry significance
gate). N must be at least 2 — variance needs two samples — and 5
or more is recommended for a well-powered test. Budget wall time
accordingly: the command produces 2×N full runs of your
performance_mode set, so at N=5 a one-minute suite costs about
ten minutes.
The command exits non-zero once enough metrics regress to trip the
failure gate — 5 or more by default, so a lone noisy regression
does not flip CI red. --fail-threshold tunes the count;
--must-fail M1,M2 fails on the named metrics regardless of
count. This drops straight into a CI perf-gate on a pull request —
see CI for the workflow.
Manual: compare already-pooled runs
Every cargo ktstr test run writes one stats sidecar per test into
target/ktstr/{kernel}-{project_commit}/; the accumulated sidecars
are the pool that perf-delta compares. When you want control
over the worktrees or test selection — or you already have both
runs’ sidecars from CI artifacts — run the two branches yourself
and point perf-delta --base at the baseline commit. It compares
the cached pool without producing new runs (so it needs no
--kernel).
cd ~/src/my-sched
# Baseline: check out and run the baseline branch's suite.
git worktree add ~/src/my-sched-main upstream/main
cd ~/src/my-sched-main
cargo ktstr test --kernel ../linux -- -E 'test(/performance_mode/)'
# Experimental: run HEAD's suite.
cd ~/src/my-sched
cargo ktstr test --kernel ../linux -- -E 'test(/performance_mode/)'
# Compare the pooled sidecars: HEAD vs the baseline commit.
cargo ktstr perf-delta --base <baseline-short-hex>
The {project_commit} half of the sidecar directory is the
project tree’s HEAD short hex captured at first sidecar write
(suffixed -dirty when the worktree differs from HEAD), so two
branches with distinct HEADs land in distinct directories and
coexist under one runs root. perf-delta --base <hex> partitions
that pool by project_commit: the baseline commit’s sidecars are
side A, HEAD’s are side B.
Warning
The two runs must be at distinct commits. If both checkouts share the same HEAD they land in the same directory and the second run’s pre-clear overwrites the first — the comparison degenerates to an identical pool. Confirm distinct commits with
git -C ~/src/my-sched rev-parse HEADbefore the second run.
The project commit is discovered by walking up from the test
process’s current working directory to the enclosing .git, so the
cd steps are load-bearing: without them the probe records the
wrong commit. Use cargo ktstr stats list-values to see the
project_commit values a pool actually carries before choosing
--base.
Comparing configurations (not commits)
perf-delta compares on the commit axis (HEAD vs a baseline). A
cross-config question — scheduler A vs scheduler B, or two tunings,
at the same commit — is answered in-test: run both configurations
as phases of one scenario and assert the relationship directly
(e.g. VmResult::better_across_phases), so the verdict travels
with the test rather than a separate compare invocation.
Compare a Scheduler vs EEVDF is the worked
example of that pattern.
Cleanup
git worktree remove ~/src/my-sched-main
Capture and Compare Host State
When a gauntlet run passes on one machine and fails on another — or
passes on Monday and fails on Wednesday — the first thing to check
is whether the host itself changed. cargo ktstr show-host
captures a snapshot of the kernel, CPU, memory, scheduler tunables,
and kernel cmdline; cargo ktstr perf-delta surfaces the changes
between two runs in a host-delta section so you can see what moved.
(For per-thread profiling see
ctprof; for scheduler-behavior diffs
between commits see A/B Compare Branches.)
Live vs archived
Two subcommands print host context; pick the one whose target matches your question:
cargo ktstr show-hostreads the live host (/proc,/sys,uname()) at invocation time. Use it to inspect the current machine — before a benchmark, after a sysctl change, or to confirm what the next run here would record.cargo ktstr stats show-host --run RUN_IDprints the archived host context captured at sidecar-write time for a past run (run keys fromcargo ktstr stats list). Use it when investigating a regression in a past run — what looked like a code change might trace back to a host change.
Both render through the same formatter, so the two outputs are byte-for-byte comparable when the host is unchanged.
Capture: show-host
cargo ktstr show-host
Prints a key: value report, one field per line, in a fixed order
(the order and field set are pinned by unit tests):
kernel_name, kernel_release, arch — the uname() triple ·
cpu_model, cpu_vendor — first /proc/cpuinfo entry ·
total_memory_kib, hugepages_total, hugepages_free,
hugepages_size_kib — from /proc/meminfo · online_cpus,
numa_nodes — node count from the CPU→node mapping (memory-only
nodes are not counted) · thp_enabled, thp_defrag — transparent
hugepage policy with the bracketed selection preserved verbatim ·
kernel_cmdline — /proc/cmdline verbatim · task_delayacct —
delay-accounting state (on, runtime-off, or config-off);
gates which taskstats delay fields populate ·
config_task_xacct — CONFIG_TASK_XACCT build state; gates the
taskstats memory-watermark fields · cpufreq_governor — one line
per CPU · sched_tunables — every /proc/sys/kernel/sched_*
sysctl, one entry per line · heap_state — the process’s own
jemalloc allocator state (rarely relevant to host comparison).
Missing-value rendering is consistent everywhere: a field that
failed to populate prints (unknown); a map that was captured but
empty prints (empty); in per-key diffs, a key present on only one
side prints (absent). The distinction matters — (empty) means
the dimension was inspected and had nothing, (unknown) means the
capture itself failed.
The output is human-oriented. The same data, same schema, is
attached to every gauntlet-run sidecar under its host field:
jq '.host' path/to/sidecar.ktstr.json
Compare: perf-delta’s host-delta section
cargo ktstr perf-delta --noise-adjust 5 --kernel 6.14
perf-delta picks the first sidecar with a
populated host field from each side and prints one of: nothing
(neither side carried host context), host: captured in 'A' only, delta unavailable (a one-sided capture failure — your capture
broke, not the host), host: identical between 'A' and 'B' (arch: x86_64), or a diff. The diff suppresses fields that match —
it is a diff, not a snapshot — and renders one row per changed
field, in this shape:
host delta ('<baseline>' → '<candidate>'):
<field>: <baseline value> → <candidate value>
...
An unchanged host is the precondition for a clean A/B of scheduler
behavior. A CI perf-gate that runs perf-delta on a pull request
surfaces this section automatically whenever any host field differs
— treat its appearance as a signal the comparison may not hold, and
fail or annotate the PR.
Typical hits
Each row names the show-host field carrying the signal, so you
can cargo ktstr show-host | grep <field> or
jq '.host.<field>' sidecar.ktstr.json directly.
| Field | Symptom | Fix / interpretation |
|---|---|---|
thp_enabled / thp_defrag | Latency-sensitive regressions that come and go between runs | Compare the bracket position (the active setting), not the whole string; pin via transparent_hugepage= on the kernel cmdline |
sched_tunables.* | Idle-steal pressure shifted on scx_* schedulers that read these sysctls | Restore the changed sysctl; the captured set is whatever /proc/sys/kernel/sched_* lists at capture time |
kernel_cmdline | Whole scheduling surface changed (isolcpus=, nohz_full=, mitigations=, numa_balancing= are all boot-time) | Reboot the host to match — the only remediation that makes the comparison hold |
kernel_release (with kernel_name, arch) | Everything is suspect | Cross-kernel comparison; rebuild the baseline on the same kernel |
hugepages_total / hugepages_free / hugepages_size_kib | performance_mode throughput flips when the 2 MiB pool shrinks | Restore the hugepage reservation |
numa_nodes | Cross-node migration and locality signals mean different things across the runs | Hardware/firmware or topology reconfiguration; note memory-only nodes are not counted |
cpu_model / cpu_vendor | Cache-sensitive benchmarks moved | Different machine — inspect alongside kernel_cmdline, which will usually differ too |
Two disambiguations worth knowing:
- The field is named
kernel_cmdline(notcmdline) in both the printed output and the sidecar JSON, to distinguish it fromSidecarResult.kargs— the extra kargs the ktstr VMM appended when booting the guest, not the running host’s boot line. - The CFS/EEVDF tuning knobs (
base_slice_ns,migration_cost_ns,latency_ns, …) live in debugfs (/sys/kernel/debug/sched/), where they moved in Linux 5.13 — not in/proc/sys/kernel.show-hostreads only/proc/sys/kernel, sosched_tunablesnever captures them on any kernel.
The three-line investigation
The shape this recipe usually takes: a suite regresses with no
scheduler change in the diff. Compare cargo ktstr stats show-host --run <old-run> against live cargo ktstr show-host; one delta
appears — say, the thp_enabled bracket moved after a distro
update. Pin the setting on the kernel cmdline, rerun, baseline
restored: it was never a scheduler bug.
Diagnose a Slow Scheduler with ctprof
When a scheduler change makes the workload slower but the test
suite still passes, the regression is usually buried in per-thread
off-CPU time. ktstr ctprof capture snapshots every live thread’s
scheduling, memory, I/O, and taskstats delay counters;
ktstr ctprof compare diffs two snapshots and surfaces the buckets
where time went. This recipe walks through a typical A/B
comparison.
See the ctprof reference for the full metric registry, aggregation rules, derived-metric formulas, and taskstats kconfig gating.
Capture before and after
Before capturing on a fresh boot, run the taskstats pre-flight below — it saves a capture round-trip.
# Baseline: scheduler A loaded, workload running.
ktstr ctprof capture --output baseline.ctprof.zst
# Switch schedulers, restart workload, wait for steady state.
# ...
# Candidate: scheduler B, same workload.
ktstr ctprof capture --output candidate.ctprof.zst
capture walks /proc once and writes the snapshot. The
scheduling, I/O, and taskstats delay data are read from procfs and
netlink (genetlink) with no kernel tracing. The jemalloc memory
counters require a ptrace(PTRACE_SEIZE) attach that briefly stops
each probed thread — an observer effect bounded per thread; every
counter is cumulative-from-birth, so the recorded values are
unbiased by attach timing. The default capture covers every live
tgid; on a busy host this is hundreds of threads. The snapshot is
zstd-compressed JSON, typically a few MB, and capturing is fast: in
the session below (≈800 processes, ≈1,200 threads, no
jemalloc-probed targets), each snapshot completed in under a
second.
Reading the compare table
Each compared metric renders as a baseline → candidate arrow
cell plus delta, %, and %uptime (what fraction of the
snapshot interval the group was alive) — the default arrow
layout; --display-format full splits baseline and candidate into
separate columns. Rows group by process family and sort so the
biggest movers land on top. This excerpt brackets a compile job, and
the table points straight at it — the mm_percpu_wq kworkers did
~10× more waiting:
## Primary metrics comm threads metric value delta % %uptime kworker/{N}:{N}-mm_percpu_wq kworker/{N}:{N}-mm_percpu_wq 11→37 voluntary_csw 8.697K → 101.154K +92.457K +1063.1% 93% kworker/{N}:{N}-mm_percpu_wq 11→37 timeslices 8.699K → 101.166K +92.467K +1063.0% 93% kworker/{N}:{N}-mm_percpu_wq 11→37 wait_time_ns 2.684s → 27.653s +24.969s +930.2% 93% kworker/{N}:{N}-mm_percpu_wq 11→37 stime_clock_ticks 22ticks → 217ticks +195ticks +886.4% 93% kworker/{N}:{N}-mm_percpu_wq 11→37 run_time_ns 243.378ms → 2.320s +2.077s +853.4% 93% ... kworker/{N}:{N}-events kworker/{N}:{N}-events 87→60 nonvoluntary_csw 22 → 11 -11 -50.0% 95% kworker/{N}:{N}-events 87→60 timeslices 222.140K → 127.813K -94.327K -42.5% 95% kworker/{N}:{N}-events 87→60 voluntary_csw 222.118K → 127.802K -94.316K -42.5% 95% kworker/{N}:{N}-events 87→60 wait_time_ns 64.861s → 39.243s -25.618s -39.5% 95% ...
Large positive deltas on a process that should not have moved are
the suspects — here wait_time_ns grew 930% (runqueue wait, not
work) while run_time_ns grew less, the signature of queueing
pressure. The taskstats-delay lens below renders rows in exactly
the same shape.
Compare with the taskstats lens
The taskstats-delay section bundles the eight kernel
delay-accounting buckets (CPU, blkio, swapin, freepages, thrashing,
compact, wpcopy, irq) plus their nine derived metrics
(avg_*_delay_ns per bucket, and the total_offcpu_delay_ns
rollup). Filter the output down to just the off-CPU view:
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
--sections taskstats-delay \
--sort-by total_offcpu_delay_ns:desc
The sort puts the processes with the largest absolute off-CPU
growth at the top. The total_offcpu_delay_ns derivation is:
cpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing)
max(swapin, thrashing) rather than swapin + thrashing because
every thrashing event is also a swapin event from the syscall
perspective; summing both would double-count.
Drill into the per-bucket averages
If total_offcpu_delay_ns jumped on a process, the per-bucket
avg_*_delay_ns derivations identify which off-CPU phase grew
(the --sections taskstats-delay filter keeps the raw counters and
all nine derivations together):
| Bucket | Average derivation | Meaning |
|---|---|---|
| CPU runqueue wait | avg_cpu_delay_ns | Time waiting for the scheduler to pick the task |
| Block I/O wait | avg_blkio_delay_ns | Synchronous block-device wait; the canonical delay-accounting reading, distinct from schedstat iowait_sum |
| Swap-in / Thrashing | avg_swapin_delay_ns / avg_thrashing_delay_ns | Memory pressure; the two overlap (a thrashing event is also a swapin) |
| Direct memory reclaim | avg_freepages_delay_ns | Allocator hit the __alloc_pages slowpath |
| Memory compaction | avg_compact_delay_ns | High-order allocation stalled on compaction |
| CoW page-fault | avg_wpcopy_delay_ns | Write-protect-copy fault, e.g. fork-then-write |
| IRQ handling | avg_irq_delay_ns | Time charged to the task by IRQ accounting |
One caveat on avg_cpu_delay_ns: the kernel updates its count and
total locklessly, so a reader can catch one ahead of the other —
the quotient is approximate at the sub-event scale and stable at
the integrated scale.
A growing avg_cpu_delay_ns with flat blkio/swap/freepages
suggests the new scheduler is making poor placement choices — the
task is queueing more often or for longer, but no other subsystem
is to blame. A growing avg_blkio_delay_ns with flat
avg_cpu_delay_ns points away from the scheduler entirely (disk,
network filesystem, or a userspace lock pattern).
Cross-reference the primary table
Once a bucket is identified, look at the underlying counters without the section filter:
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
--metrics nr_wakeups,nr_migrations,wait_sum,wait_count,run_time_ns,timeslices
Useful pairings when the suspect bucket is CPU runqueue wait:
wait_sum / wait_count— schedstat’s average wait per scheduling event (theavg_wait_nsderivation). If this confirmsavg_cpu_delay_ns, both delay-accounting paths agree.nr_migrations— the new scheduler may be moving the task more aggressively; cross-CPU migrations cost wall-clock time even whenrun_time_nsis identical.nr_wakeups_affine / nr_wakeups_affine_attempts— theaffine_success_ratioderivation (CFS-only). A large drop with growingavg_cpu_delay_nsis a strong signal for cache-unfriendly placement.
Confirm taskstats data is actually populated
Pre-flight (save a capture round-trip): verify the host has the delayacct runtime toggle enabled before capturing on a fresh boot:
sysctl kernel.task_delayacct # must report `kernel.task_delayacct = 1`
zcat /proc/config.gz | grep -E 'CONFIG_TASKSTATS|CONFIG_TASK_DELAY_ACCT'
# both must read `=y` for delayacct to fire
If task_delayacct is 0, set it with
sysctl -w kernel.task_delayacct=1 (or persist via
/etc/sysctl.d/) before capture. If the kconfig lines are absent,
the running kernel was built without delayacct support and
re-capturing won’t help. capture also logs read-failure tallies
with a hint when a whole class of files is unreadable — a real
example from a kernel missing two kconfigs:
2026-07-04T22:00:48.765819Z INFO ktstr::ctprof: ctprof parse: 1215 tids walked, 1646 read failures (dominant: io); hint: schedstat / io read failures dominate — kernel may be built without CONFIG_SCHED_INFO and/or CONFIG_TASK_IO_ACCOUNTING
If every taskstats column reads zero after the pre-flight, the snapshot likely hit a gating problem rather than a real “no delay” reading. The snapshot records a structured per-capture tally; no subcommand prints it, but the snapshot is zstd-compressed JSON, so read it directly:
zstd -dc candidate.ctprof.zst | \
jq '{taskstats: .taskstats_summary, tids_walked: .parse_summary.tids_walked}'
eperm_count > 0— the capturing process lackedCAP_NET_ADMIN. Re-run as root, or grantcap_net_admin+eipviasetcap.esrch_countneartids_walked— every tid raced exit before the per-tid query landed. Lengthen the workload’s steady-state window and re-capture.ok_count == 0andeperm_count == 0— the netlink open failed, almost always meaning the kernel was built withoutCONFIG_TASKSTATS. Rebuild with the kconfig.ok_count > 0but every delay column reads zero — kernel built with the kconfigs but launched without the runtime toggle. Adddelayacctto the kernel cmdline, or setsysctl kernel.task_delayacct=1and re-capture.
What next
- If the candidate scheduler is a branch of the baseline, gate
the regression with
cargo ktstr perf-deltaso CI catches the next one. - If the question is “is this scheduler worth its overhead at all”, run the scheduler-vs-EEVDF comparison — one test, both schedulers, per-phase deltas.
- ctprof reference — full metric registry and gating documentation.
Customize Checking
Override checking thresholds for schedulers that tolerate higher imbalance, different gap thresholds, or relaxed event rates — and opt in to the checks that are off by default.
Warning
Assert::default_checks()isAssert::NO_OVERRIDES— every fieldNone. Until a scheduler-level or per-test override sets a threshold, no worker assertions run. A green suite with no overrides proves only that the VM booted and the scheduler didn’t crash.
What a tripped gate looks like
Here a test set min_iteration_rate to a floor the workload could
never meet (deliberately, to force the failure). The report names
each worker that missed the gate, with the measured rate and the
floor it was compared against:
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
events+: refill_slice_dfl=210
schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
bpf: ktstr_select_cp cnt=189 145ns/call
bpf: ktstr_enqueue cnt=373 34ns/call
bpf: ktstr_dispatch cnt=584 237ns/call
verdict: monitor OK
Note the two channels: the worker gate tripped (the two below floor lines) while the monitor verdict is OK — worker checks and
host-side monitor checks are evaluated independently. See
Checking for the model. The fix is
whichever of these matches the intent: set a floor the scheduler
can actually meet, or fix the scheduler until it meets the floor
you wrote.
Scheduler-level overrides
Declare a scheduler with assertion overrides that apply to every test using it:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(RELAXED, {
name = "relaxed",
binary = "scx_relaxed",
assert = Assert::NO_OVERRIDES
.max_imbalance_ratio(5.0) // tolerate 5:1 imbalance
.max_fallback_rate(500.0) // higher fallback rate ok
.fail_on_stall(false), // don't fail on stall
});
These are the first layer that can carry an actual check — without them (or a per-test override), nothing asserts.
Per-test overrides
Attributes on #[ktstr_test] merge last and win:
#[ktstr_test(
scheduler = RELAXED,
not_starved = true,
max_gap_ms = 5000,
max_imbalance_ratio = 10.0,
sustained_samples = 10,
)]
fn high_imbalance_test(ctx: &Ctx) -> Result<AssertResult> {
// Inherits topology from RELAXED
Ok(AssertResult::pass())
}
not_starved = true enables the starvation, fairness-spread, and
scheduling-gap checks as a group; each threshold can still be
overridden independently. The full attribute list and default
thresholds live in the
#[ktstr_test] reference.
Merge order
The runtime evaluates
Assert::default_checks().merge(&scheduler.assert).merge(&test.assert)
— three layers, last-Some-wins per field. Worked example:
- scheduler layer:
max_imbalance_ratio(5.0) - test layer:
max_imbalance_ratio = 10.0 - effective:
10.0— the test’sSomewins; every field the test leavesNonefalls through to the scheduler layer, then to the (all-None) defaults.
To see the merged result for a registered test without reading source:
cargo ktstr show-thresholds high_imbalance_test
It prints every threshold field of the exact Assert the runtime
will evaluate, with none for unset fields.
Using Assert directly in ops scenarios
fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
let checks = Assert::NO_OVERRIDES
.check_not_starved()
.max_gap_ms(3000);
let steps = vec![/* ... */];
execute_steps_with(ctx, steps, Some(&checks))
}
execute_steps_with applies the given Assert for worker checks,
overriding the merged config. execute_steps (without _with)
passes None and falls back to ctx.assert — the merged
three-layer config above. Reaching for _with when you meant to
add to the merged config is a classic trap: the explicit Assert
replaces ctx.assert, it does not compose with it.
See Ops, Steps, and Backdrop for the step execution model.
Benchmark Gates and Negative Tests
A performance gate you have never seen fail proves nothing. This recipe builds gates in pairs: a positive test that holds the scheduler to a floor, and a negative twin that degrades the scheduler on purpose and asserts the same gate trips. (Extracting metrics from benchmark payloads like schbench or fio is covered in Payloads and Included Files; cross-commit regression gates are A/B Compare Branches.)
Positive: gate a scenario
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 1, 2, 1),
});
#[ktstr_test(
scheduler = MY_SCHED,
performance_mode = true,
duration_s = 5,
sustained_samples = 15,
)]
fn perf_positive(ctx: &Ctx) -> Result<AssertResult> {
let checks = Assert::default_checks()
.min_iteration_rate(5000.0)
.max_gap_ms(500);
let steps = vec![Step::with_defs(
vec![CgroupDef::named("cg_0").workers(2)],
HoldSpec::FULL,
)];
execute_steps_with(ctx, steps, Some(&checks))
}
Key points:
performance_mode = truepins vCPUs to reserved host cores so the measurement isn’t host noise — see Performance Mode.Assert::default_checks()is all-None: every gate is opt-in, and nothing fails until you chain a setter. Threshold layering and merge order live in Customize Checking.execute_steps_withapplies theAssertduring worker checks.
A passing gated run looks like any passing run:
cargo ktstr: resolved kernel "7.0"
...
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 22.158s] (1/1) ktstr::ktstr_sched_tests ktstr/sched_basic_proportional
────────────
Summary [ 22.197s] 1 test run: 1 passed, 12531 skipped
Choosing the floor
Don’t guess thresholds. Run the scenario once with no gates, read
the observed per-cgroup iteration counts from the --- stats ---
block (or the run’s stats sidecar), and set the floor at a healthy
margin below the observed rate — far enough down that run-to-run
noise never trips it, close enough up that a real regression does.
Tighten later once
perf-delta --noise-adjust has shown you the
actual run-to-run spread.
Negative: prove the gate fires
expect_err = true inverts the harness: the test must fail. An
Ok return panics with expected test to fail but it passed, so a
gate that silently stopped firing turns the negative test red.
Skips are handled separately: a run that could not boot a kernel or
lost the host-resources race emits a SKIP banner and does not
count as the expected failure — an environment problem is never
mistaken for proof that the gate fired (the skip-vs-fail taxonomy
lives in Troubleshooting). Auto-repro is
disabled automatically for expected-error tests.
#[ktstr_test(
scheduler = MY_SCHED,
performance_mode = true,
duration_s = 5,
extra_sched_args = ["--degrade"],
expect_err = true,
)]
fn perf_negative(ctx: &Ctx) -> Result<AssertResult> {
let checks = Assert::default_checks()
.min_iteration_rate(5000.0)
.max_gap_ms(500);
let steps = vec![Step::with_defs(
vec![CgroupDef::named("cg_0").workers(2)],
HoldSpec::FULL,
)];
execute_steps_with(ctx, steps, Some(&checks))
}
extra_sched_args passes CLI args to the scheduler binary.
--degrade is a real knob on ktstr’s fixture scheduler
(scx-ktstr) that deliberately worsens its scheduling; the fixture
also exposes --slow, --stall-after, and --fail-verify for
other failure classes. Substitute your own scheduler’s equivalent —
a degradation flag your scheduler ships for exactly this purpose is
a feature, not a wart.
When the degraded run trips the gate, the failure the harness expects looks like the real thing, because it is:
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
The in-repo pattern to copy is tests/assert_gate_matrix.rs: a
macro stamps out a positive and a negative variant for each of ten
worker gates (max_p99_wake_latency_ns, max_wake_latency_cv,
min_iteration_rate, max_gap_ms, …), with the negative variant
passing --degrade and setting expect_err. Each gate in the
matrix is proven to fire by its own negative test — hold your gates
to the same standard.
Related
- Payloads and Included Files — metric extraction from real benchmark binaries, declarative include files.
- A/B Compare Branches — the cross-commit complement: gates on the delta between two commits instead of an absolute floor.
- Customize Checking — threshold layers and merge order.
Compare a Scheduler vs EEVDF
A standard regression guard for a sched_ext scheduler: does it match (or beat) the kernel default (EEVDF) on the same workload — not just for throughput, but for latency and CPU overhead too? Run the workload under the scheduler in one phase, detach the scheduler mid-run so the kernel default takes over for a second phase, then compare the two phases metric by metric.
The workload must persist across the detach — a Backdrop population,
not per-step workers — so its cumulative counters span both phases. That
shared, continuous measurement is what makes a per-phase delta meaningful
(per-step workers reset each phase and read ~0).
Two readers cover the comparison, both on the &VmResult a post_vm
callback receives (the host-side hook that runs after the VM exits):
VmResult::throughput_ratio(a, b)— iterations/sec from the stimulus timeline. The timeline carries per-step boundaries independent of the periodic-capture pipeline, so throughput works even for--cell-parent-cgroupschedulers.VmResult::phase_metric(phase, name)— any other per-phase metric by its registry name (see Checking): CPU overhead (system_time_ns,user_time_ns) and scheduling quality (avg_imbalance_ratio,avg_dsq_depth). Wake-latency and run-delay distributions are run-level — pooled across cgroups into one whole-run value — so they cannot be split into the scheduler phase vs the EEVDF phase; to compare them, run the scheduler and EEVDF as two separate tests and read each run’s run-level metric. Everything else flows through the one per-phase bucket pipeline, so a new metric becomes comparable here the moment it lands in that pipeline.
use anyhow::{ensure, Result};
use ktstr::assert::{AssertResult, Phase};
use ktstr::ktstr_test;
use ktstr::prelude::{Backdrop, VmResult};
use ktstr::scenario::Ctx;
use ktstr::scenario::ops::{execute_scenario, CgroupDef, HoldSpec, Op, Step};
use ktstr::test_support::{Scheduler, SchedulerSpec};
// Built directly rather than via declare_scheduler! so this comparison
// harness stays out of the verifier sweep (manual consts are not
// registered for sweeping). Use declare_scheduler! for the scheduler
// definition you ship.
const MY_SCHED: Scheduler =
Scheduler::named("my_sched").binary(SchedulerSpec::Discover("scx_my_sched"));
// Runs on the host after the VM exits; the &VmResult carries the stimulus
// timeline and the per-phase metric buckets the comparison reads.
fn compare_vs_eevdf(result: &VmResult) -> Result<()> {
let sched = Phase::step(0); // first Step ran under the scheduler under test
let eevdf = Phase::step(1); // second Step ran under EEVDF, after the detach
// Throughput: > 1.0 means the scheduler out-throughputs EEVDF; < 1.0
// is a regression.
let throughput = result
.throughput_ratio(sched, eevdf)
.ok_or_else(|| anyhow::anyhow!("no per-phase throughput — did both phases run?"))?;
ensure!(
throughput >= 0.8,
"my_sched throughput is {throughput:.2}x EEVDF (below the 0.8x floor)"
);
// Scheduling quality: any per-phase metric compares the same way via
// phase_metric. Skip the gate when a phase has no reading (None)
// rather than failing. (Wake-latency / run-delay distributions are
// run-level and not readable here — see the reader list above.)
if let (Some(s), Some(e)) = (
result.phase_metric(sched, "avg_imbalance_ratio"),
result.phase_metric(eevdf, "avg_imbalance_ratio"),
) {
ensure!(s <= e * 1.5, "my_sched imbalance {s:.2} is >1.5x EEVDF {e:.2}");
}
// CPU overhead: per-phase kernel (system) CPU time.
if let (Some(s), Some(e)) = (
result.phase_metric(sched, "system_time_ns"),
result.phase_metric(eevdf, "system_time_ns"),
) {
ensure!(s <= e * 2.0, "my_sched system time {s:.0}ns is >2x EEVDF {e:.0}ns");
}
Ok(())
}
#[ktstr_test(
scheduler = MY_SCHED,
duration_s = 10,
watchdog_timeout_s = 10,
post_vm = compare_vs_eevdf,
)]
fn scheduler_vs_eevdf(ctx: &Ctx) -> Result<AssertResult> {
// Persistent Backdrop population: runs across both phases so its
// cumulative counters span the detach.
let backdrop = Backdrop::new().push_cgroup(CgroupDef::named("cg").workers(4));
let steps = vec![
// Phase A: workload under the scheduler under test.
Step::new(vec![], HoldSpec::frac(0.5)),
// Phase B: detach -> the kernel default (EEVDF) takes over.
Step::new(vec![Op::detach_scheduler()], HoldSpec::frac(0.5)),
];
execute_scenario(ctx, backdrop, steps)
}
The 0.8x / 1.5x / 2.0x bounds above are illustrative, not
recommendations. Calibrate yours: run the test a few times with
generous bounds, note the observed ratios (each ensure! message
prints them; a run’s failure output leads with whichever message
tripped), and set each floor just outside the observed noise band.
A gate inside the noise band fails honest runs; one far outside it
never fails at all.
Notes:
Op::detach_scheduler()cleanly hands the workload to the kernel default. Each step emits its own boundary, so no trailing closer step is needed, and the intentional detach is not promoted to a scheduler-died failure.- Phases are keyed by
Phase:Phase::step(0)is the first scenario Step,Phase::step(1)the second.Phase::BASELINEis the pre-Step settle window. UsePhaserather than the raw stimulusstep_index. phase_metricreturnsNonewhen a phase has no reading for a metric, so gate insideif let (Some(..), Some(..))rather than unwrapping — a metric that did not populate skips its gate instead of failing the run.- For cross-cell balance rather than a phase-vs-phase comparison, read
result.stats.cgroup_balance_ratio()in the test body (the test body’sAssertResultcarriesstats).
This test gates scheduler-vs-EEVDF within one run. To gate your
scheduler against its own past self across commits, use
cargo ktstr perf-delta — the two nets catch
different regressions, and CI wants both.
Architecture Overview
A scheduler bug rarely returns an error code — it wedges a CPU, strands a runqueue, or panics the kernel. ktstr’s architecture follows from that: every test boots its own KVM microVM, so a crash takes down a disposable guest instead of your machine; the CPU topology is whatever the test declared; and the kernel is the exact build you targeted.
ktstr has three execution domains:
-
Host process — the test binary running on the host. Manages VM lifecycle, monitors guest memory, evaluates results.
-
Guest process — the same test binary running inside the VM as PID 1. Mounts filesystems, starts the scheduler, creates cgroups, forks workers, runs scenarios, and writes results back to the host.
-
Monitor thread — runs on the host while the guest executes. Reads guest VM memory directly to observe scheduler state without instrumenting it.
Execution flow
Host Guest
---- -----
test binary
|
+-- build initramfs
| (test binary as /init
| + optional scheduler)
|
+-- boot KVM VM
| test binary (PID 1 init)
| |
+-- start monitor thread +-- mount filesystems
| (reads guest memory) +-- start scheduler (if any)
| +-- create cgroups
| +-- fork workers
| +-- move workers to cgroups
| +-- signal workers to start
| +-- poll scheduler liveness
| +-- stop workers, collect reports
| +-- evaluate results
| +-- write result to virtio-console port 1
|
+-- read result from virtio-console port 1
+-- evaluate monitor data
+-- report pass/fail
Results travel on virtio-console port 1; panics, crashes, and other non-blockable diagnostics fall back to the COM2 serial port (see VMM — guest–host transports).
From the host, a passing run looks like this:
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
...
Summary [ 34.490s] 1 test run: 1 passed, 12531 skipped
Key design decisions
Same binary, two roles. The test binary serves as both host
controller and guest test runner. The initramfs embeds the binary as
/init; when the binary finds itself running as PID 1, it executes
the guest lifecycle (mounts, scheduler start, test dispatch, reboot)
instead of the host one. One cargo build produces everything needed
for both sides — there is no separate guest agent to version or ship.
Forked workers (default), threads optional. The default Fork
clone mode spawns each worker as its own process so cgroup placement
via cgroup.procs is tgid-granular. The Thread clone mode shares
the harness’s tgid and routes placement through cgroup.threads
instead — useful when workers need a shared address space or when
measuring thread-only scheduler paths. See
Workers and Workloads.
Host-side monitoring. The monitor reads guest memory via KVM, avoiding BPF instrumentation of the scheduler under test. This eliminates observer effects on scheduling decisions.
Where to go next
- VMM — how VMs boot, topology modeling, guest–host transports.
- Monitor — what is observed from the host and how violations become verdicts.
- Workers and Workloads — worker lifecycle and the telemetry each worker reports.
- CgroupManager / CgroupGroup — cgroup plumbing and RAII cleanup inside the guest.
VMM
ktstr includes a purpose-built VMM (virtual machine monitor) that boots Linux kernels in KVM for testing.
Why a purpose-built VMM
Three requirements rule out reusing a general-purpose VMM:
- Direct guest-memory access. The monitor reads scheduler state straight out of guest DRAM through a host-side pointer into the VM’s memory mapping. Owning the VMM means owning that mapping — no guest agent, no hypercall surface, no negotiation with someone else’s memory model.
- Topology is the product. Tests declare NUMA nodes, LLCs, cores, and SMT threads, and the guest must actually have that shape — down to asymmetric node sizes, inter-node distances, and CXL memory-only nodes. The VMM builds the ACPI tables to the declared shape rather than approximating it with generic knobs.
- Boot cost is paid per test. Every
#[ktstr_test]boots a fresh VM, so setup has to be cheap. From a real run (2-vCPU guest, warm caches):
initramfs spawn: 55.583µs
kvm+kernel: 867.005µs
setup_memory (joins initramfs): 1.409360963s
setup_vcpus: 1.409565321s
VM setup total: 1.409619773s
Creating the KVM VM and loading the kernel costs under a millisecond; the dominant cost is populating guest memory, which joins the cached initramfs build (below). After setup, the guest still has to boot the kernel — total wall-clock per test is dominated by the scenario’s own duration.
KtstrVm builder
let result = vmm::KtstrVm::builder()
.kernel(&kernel_path)
.init_binary(&ktstr_binary)
.topology(Topology::new(numa_nodes, llcs, cores_per_llc, threads_per_core))
.memory_mib(4096)
.run_args(&["run".into(), "--ktstr-test-fn".into(), "my_test".into()])
.build()?
.run()?;
Test authors do not touch this directly — #[ktstr_test] drives it —
but every attribute on the macro (topology dims, memory, kargs) lands
here.
Topology
The VM topology is specified as (numa_nodes, llcs, cores_per_llc, threads_per_core). On x86_64, the VMM creates ACPI tables (MADT,
SRAT, SLIT, and HMAT when numa_nodes > 1) and MP tables. On
aarch64, topology is expressed via FDT cpu nodes with MPIDR-derived
reg properties.
pub struct Topology {
pub llcs: u32,
pub cores_per_llc: u32,
pub threads_per_core: u32,
pub numa_nodes: u32,
pub nodes: Option<&'static [NumaNode]>,
pub distances: Option<&'static NumaDistance>,
}
total_cpus() = llcs × cores_per_llc × threads_per_core.
When nodes is None (the default), memory and LLCs are distributed
uniformly across NUMA nodes with default 10/20 distances. When
Some, each NumaNode specifies its LLC count, memory size, and
optional HMAT attributes (latency_ns, bandwidth_mbs,
mem_side_cache). A NumaNode with llcs = 0 models a CXL
memory-only node.
NumaDistance is an NxN inter-node distance matrix. Diagonal entries
must be 10 and off-diagonal > 10 (ACPI SLIT requirements); ktstr
additionally requires the matrix to be symmetric.
Use Topology::new(numa_nodes, llcs, cores, threads) for uniform
topologies, or Topology::with_nodes(cores, threads, &nodes) for
explicit per-node configuration. The test-author view of all this is
Topology.
initramfs
The VMM builds a cpio initramfs containing:
- The test binary (as
/init) - Optional scheduler binary (as
/scheduler) - Shared library dependencies (resolved via ELF
DT_NEEDEDparsing)
The initramfs is split into a cached base plus a per-run suffix. The base cache key is derived from the payload’s shared-library set and the content hashes of the packed scheduler/probe/worker binaries and include files — not the test binary’s own bytes, which ride the per-run suffix. So recompiling your tests keeps the base cache warm, while recompiling the scheduler invalidates it. The cached base lives in a shared-memory segment that concurrent VMs map zero-copy, sharing physical pages across parallel tests.
Guest–host transports
| Transport | Carries |
|---|---|
| COM1 (serial) | Guest kernel console. Forwarded to stderr with --dmesg. |
| COM2 (serial) | Crash diagnostics only: the guest panic hook writes PANIC: <info> plus a backtrace here. |
/dev/hvc0 (console port 0) | Interactive console for ktstr shell. |
| Console port 1 | The primary guest-to-host data channel: test results, exit codes, scenario markers, payload metrics, coverage data, scheduler-exit notifications. |
| Console port 2 | Transparent byte relay for scx_stats requests/responses between the host and the in-guest scheduler. |
Two details worth internalizing:
- COM2 is crash-only. Ordinary guest stdout/stderr does not use
COM2 — it travels over the port-1 stream as framed messages. COM2
exists for diagnostics that must get out even when the framed
transport can’t be trusted (panics, fatal signals). The host parses
the
PANIC:header and surfaces the backtrace in test failure output. - Port 1 frames are integrity-checked. Each frame on the port-1 stream carries a CRC32, so a corrupted result is detected rather than mis-parsed.
Performance mode
When performance mode is enabled, the VMM applies host-side isolation
(vCPU pinning, hugepages, NUMA mbind, RT scheduling), guest-visible
hints (KVM_HINTS_REALTIME CPUID), and KVM exit suppression.
Non-performance-mode VMs set the KVM halt-poll interval to 200µs;
overcommitted topologies set it to 0. See
Performance Mode.
Dual-role dispatch
The same test binary is the host controller and the guest /init —
Architecture Overview tells the story. The
mechanics: a constructor function runs before main() in every
ktstr-linked binary. Running as PID 1, it executes the guest init
path (mounts, scheduler start, test dispatch, reboot); given
--ktstr-test-fn plus a topology argument, it boots a VM as the host
side; given only --ktstr-test-fn, it runs the test function
directly because it is already inside a VM.
Boot process
- Load the kernel (bzImage on x86_64, Image on aarch64).
- Create KVM vCPUs matching the declared topology. High vCPU counts add measurable boot latency — see Performance Mode for sizing.
- Build and load the initramfs.
- Set up serial devices (COM1 kernel console, COM2 crash diagnostics), the virtio console, and virtio block/net devices for disk- and network-shaped workloads.
- Boot the kernel.
- The kernel starts
/init(the test binary); PID 1 detection routes into the guest lifecycle: mount filesystems, start the scheduler, dispatch the test function, reboot.
Monitor
When your scheduler leaves a CPU starving, the monitor is what notices. It runs on the host while the guest executes and reads scheduler state directly out of guest memory — the scheduler under test never executes an extra instruction to be observed, and no BPF probe perturbs the decisions being measured.
Every test report ends with the monitor’s summary. From a real run:
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
events+: refill_slice_dfl=210
schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
bpf: ktstr_select_cp cnt=189 145ns/call
bpf: ktstr_enqueue cnt=373 34ns/call
bpf: ktstr_dispatch cnt=584 237ns/call
verdict: monitor OK
Reading it: samples is how many point-in-time snapshots the monitor
took; max_imbalance/max_dsq_depth/stuck are the peaks the
threshold checks evaluate; events are sched_ext event-counter rates
(select-cpu fallbacks, keep-last dispatches); the bpf: lines are
per-callback invocation counts and mean cost from the guest kernel’s
BPF program-runtime stats; verdict is what folds into the test
result.
What it reads
The monitor resolves kernel structure offsets from the guest kernel’s
BTF — nothing is hardcoded per kernel version. Per CPU, it reads the
runqueue’s nr_running, scx_nr_running, rq_clock,
local_dsq_depth, and scx_flags, plus the sched_ext event counters
(select-cpu fallback, dispatch keep-last, bypass activity, and the
rest of the family). When the guest kernel has CONFIG_SCHEDSTATS, it
also reads per-CPU struct rq schedstat fields (run_delay, pcount,
ttwu_count, …).
It also walks the struct sched_domain tree — from rq->sd up the
sd->parent chain — whenever the BTF exposes it, capturing per-level
topology metadata (level, name, flags, span_weight) and
runtime fields (balance_interval, nr_balance_failed,
max_newidle_lb_cost), plus load-balancing stats when
CONFIG_SCHEDSTATS is enabled. Fields that newer kernels added (the
proportional-newidle counters newidle_call / newidle_success /
newidle_ratio, new in 7.0 with some stable backports) resolve as
optional: on kernels whose BTF lacks them they are simply absent, and
the rest of the walk proceeds.
Sampling
The monitor takes periodic snapshots (MonitorSample) of all per-CPU
state; each sample is a point-in-time view of every CPU.
MonitorSummary aggregates samples into peak values (max imbalance
ratio, max DSQ depth, stall detection), per-sample averages, and
event-counter deltas. Averages are computed over valid samples only
(excluding uninitialized guest memory — see below).
Threshold evaluation
MonitorThresholds defines the pass/fail conditions:
| Threshold | Default | Trips when |
|---|---|---|
max_imbalance_ratio | 4.0 | max/min per-CPU nr_running exceeds the ratio |
max_local_dsq_depth | 50 | any CPU’s local DSQ exceeds the depth |
fail_on_stall | true | a CPU’s rq_clock stops advancing (exemptions below) |
max_fallback_rate | 200.0/s | sustained select-cpu-fallback event rate |
max_keep_last_rate | 100.0/s | sustained dispatch-keep-last event rate |
sustained_samples | 5 | — window: a violation must persist this many consecutive samples |
A violation must persist for sustained_samples consecutive samples
before it counts — at the ~100ms sample interval, the default 5 means
roughly 500ms of sustained violation. This filters transient spikes
from cpuset transitions and cgroup creation/destruction. The
reasoning behind each default value lives in
Checking.
Test authors do not construct MonitorThresholds directly: the
#[ktstr_test] threshold attributes (max_imbalance_ratio,
fail_on_stall, …) and Assert::with_monitor_defaults() feed it —
see the macro reference.
enforce is the on/off gate for the threshold-violation path.
The default is report-only: monitor evaluations record every
violation in the verdict’s details, but the verdict passes. Opting in
— via Assert::with_monitor_defaults(), which fills unset threshold
fields and sets enforce — promotes recorded violations to failures.
Setting a field like fail_on_stall without enforcement is a no-op
for the violation path: the violation appears in the monitor report,
the verdict still passes, and the summary carries a report-only
advisory flagging the missing enforcement.
The no-signal arms bypass enforce entirely. An empty sample buffer,
or data that fails the plausibility check below, always produces a
verdict with passed: false, inconclusive: true, which folds into
the test’s AssertResult as Inconclusive (exit code 2). “Couldn’t
evaluate” is not the same as “evaluated and OK,” so the no-signal
path always surfaces distinct from Pass. Only threshold violations
are gated by enforce.
Stall detection
A stall is detected when a CPU’s rq_clock does not advance between
consecutive samples. Three exemptions prevent false positives:
- Idle CPUs: when
nr_running == 0in both the current and previous sample, the CPU has no runnable tasks. The kernel stops the tick (NOHZ) on idle CPUs, sorq_clocklegitimately does not advance. - Preempted vCPUs: when the vCPU thread’s CPU time did not advance past the preemption threshold between samples, the host preempted the vCPU — the guest never got a chance to run, which is not the scheduler’s fault.
- Sustained window: stall detection uses per-CPU consecutive
counters and the
sustained_samplesthreshold, matching the other checks. A single stuck sample does not trigger failure.
Uninitialized memory detection
Before the guest kernel initializes per-CPU structures, monitor reads
return garbage. Two layers handle this: summary computation skips
individual samples where any CPU’s local_dsq_depth exceeds a
plausibility ceiling (10,000), and threshold evaluation checks the
whole report — if all rq_clock values are identical across every
CPU and sample, or any sample exceeds the ceiling, the report is
classified “not yet initialized” and no per-threshold checks run
(this is one of the Inconclusive arms above).
The monitor never instruments the guest
Everything above is passive memory reading. The one ktstr feature that does load BPF probes into a guest is auto-repro — and it runs them in a separate, disposable repro VM booted after the original test failed, never in the VM whose behavior is being measured. The run your verdict is based on is unperturbed.
How guest memory is read
Three address-translation modes cover the kernel’s address spaces:
- Text/data/bss — linear offset from the kernel’s static map, for statically-linked kernel variables.
- Direct mapping —
kva - PAGE_OFFSET, for SLAB allocations and per-CPU data. - Vmalloc/vmap — a real page-table walk through the guest’s CR3 (4- and 5-level paging on x86_64; 4/16/64 KB granules on aarch64), for BPF maps and vmalloc’d memory.
All reads are bounds-checked and volatile (the guest modifies memory
concurrently), and the runtime KASLR offset is recovered at startup
so ELF symbols from the matching vmlinux resolve to live guest
addresses. vmlinux must match the guest kernel — it supplies both
the symbol table and the BTF. The implementing types (GuestMem,
GuestKernel, GuestMemMapAccessor) are documented in the monitor
module’s rustdoc (cargo doc --document-private-items).
BPF map access
The monitor also discovers and reads/writes the scheduler’s BPF maps
directly through guest physical memory — no guest cooperation, no BPF
syscalls. Maps are found by walking the kernel’s map_idr and
matched by name suffix (".bss" matches "mitosis.bss"); values are
read at BTF-resolved offsets, including per-CPU array maps. When a
map carries program BTF, the dump renderer uses it to render the
value struct field by field — which is why failure dumps show your
BPF globals by name.
Tests use this through BpfMapWrite: a host-side write to a BPF map
during VM execution. The test runner waits for the scheduler to load
(the map becomes discoverable), writes the value, then signals the
guest to start the scenario:
const BPF_CRASH: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);
#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
Ok(AssertResult::pass())
}
The field’s byte offset and width are resolved from the map’s program BTF at write time (which also disambiguates same-suffix maps by picking the one whose BTF names the field). Only array maps and 4-byte scalar fields are supported. For reading map state from test code, see Snapshots.
What a dump shows: cast analysis
When a test fails, the failure dump renders the scheduler’s BPF map state — with BTF names, not raw bytes. From a real failure:
map bpf_bpf.bss (type=array, value_size=448, max_entries=1)
.bss:
scx_arena_verify_once=true ktstr_alloc_count=76 nr_dispatched=907
nr_enqueued=495 nr_select_cpu=372 stats_magic=6004496034161779060
...
root 0x100000006000 → sdt_desc:
nr_free=512
chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{}
ktstr_bss_arena_holder ktstr_bss_arena_holder:
bss_plain_counter=76
arena_target 0x10000000aa80 (cast→arena) [chase: arena chase: STX-flow path tagged slot as Arena with deferred resolve; bridge had no entry for 0x10000000aa80]
BPF schedulers frequently store kernel and arena pointers in u64
fields, because BTF cannot express a pointer to a per-allocation
type. Without help, the renderer would print those as meaningless
integers. The cast analyzer closes that gap by analyzing the
scheduler binary’s BPF bytecode to learn which u64 fields actually
hold pointers, so the renderer can chase them. The annotations tell
you what happened:
(cast→arena)/(cast→kernel)— the pointer was recovered by cast analysis and chased into arena or kernel memory; the rendered fields after it are real dereferenced state, visually distinct from natively BTF-typed pointers.(sdt_alloc)— the chase resolved the pointee’s type through a live arena-allocator slot (the common pattern forscx_task_data()-style per-task state).[chase: …]— the chase stopped, and this is why. A stopped chase falls back to showing the raw value; the analyzer is deliberately conservative, preferring a rawu64over chasing garbage.
The analysis is unconditional — no opt-in, no test-author configuration — and applies to every snapshot, periodic capture, and failure dump. Failure dumps are also written machine-readable as a JSON artifact next to the test outputs, carrying the same BTF-resolved field names and cast annotations:
{
"name": "bpf_bpf.bss",
"map_kva": 18400526959283003096,
"map_type": 2,
"value_size": 448,
"max_entries": 1,
"value": {
"kind": "struct",
"type_name": ".bss",
"members": [
{
"name": "scx_arena_verify_once",
"value": {
"kind": "bool",
"value": true
}
},
...
{
"name": "nr_dispatched",
"value": {
"kind": "uint",
"bits": 64,
"value": 849
}
},
...
Every field name here was resolved on the host from the guest’s BTF — the guest wrote nothing but its normal map state. See Reading Failure Output for the full anatomy of a failure report.
Workers and Workloads
Workers are the processes that generate load for scenarios. They run
inside the VM, each placed in a cgroup, and each one reports detailed
telemetry (WorkerReport) when the workload stops. WorkloadHandle
is the RAII handle that owns their whole lifecycle: spawn → place →
start → stop and collect → drop.
Spawning
let config = WorkloadConfig {
num_workers: 4,
work_type: WorkType::Mixed,
..Default::default()
};
let mut handle = WorkloadHandle::spawn(&config)?;
Set only the fields that matter for the test and let
..Default::default() fill in the rest — WorkloadConfig’s default
is a known-good single-worker SpinWait baseline, and the spread
form keeps examples pinned to intent as fields are added. Consult the
WorkloadConfig rustdoc for the current field list. (Do not
extrapolate this to every ktstr type: CgroupDef deliberately has no
Default because a derived empty name would silently produce an
invalid cgroup — use CgroupDef::named(...).)
The worker-creation primitive is selected by CloneMode:
CloneMode::Fork(default): forks N child processes; each child installs a SIGUSR1 handler, then blocks on a pipe waiting for the start signal. Each worker has its own tgid, socgroup.procsplacement is per-worker.CloneMode::Thread: spawns N threads inside the harness; each blocks on a rendezvous channel untilstart(). Workers share the harness’s tgid, so cgroup placement must go throughcgroup.threads(see Placement below).
For grouped work types (PipeIo, FutexPingPong, FutexFanOut,
MutexContention, ThunderingHerd, and the rest of the
communicating families), spawn() validates that num_workers is
divisible by the variant’s group size and sets up the inter-worker
plumbing the variant requires (pipes, shared futex pages). See
Work Types for choosing a variant.
pcomm containers are not created by spawn() — it bails when a
composed WorkSpec::pcomm is set, pointing at
WorkloadHandle::spawn_pcomm_cgroup (or the CgroupDef::pcomm
path), which spawns one thread-group-leader process hosting N worker
threads so the group’s comm matches the pcomm name.
Two-phase start
Workers wait for a “start” signal after spawn:
- Parent spawns the worker (fork or thread), which blocks.
- Parent moves the worker to its target cgroup.
- Parent calls
start(), releasing all workers at once.
This ensures workers run inside their target cgroup from the first instruction of their workload — there is no window where load runs in the wrong cgroup and pollutes the measurement.
Placement
// 1. Spawn workers (blocked, waiting for start signal)
let mut handle = WorkloadHandle::spawn(&config)?;
// 2. Move workers into their target cgroup. `cgroup.procs` is
// tgid-scoped, so use `worker_pids_for_cgroup_procs()` — it
// bails for Thread-mode workers (whose pids share the harness's
// tgid) and points at `cgroup.threads` instead. Plain
// `worker_pids()` returns the raw pid set without that check.
ctx.cgroups.move_tasks("cg_0", &handle.worker_pids_for_cgroup_procs()?)?;
// 3. Signal workers to start
handle.start();
// 4. Wait for the workload duration
std::thread::sleep(ctx.duration);
// 5. Stop workers and collect telemetry
let reports: Vec<WorkerReport> = handle.stop_and_collect();
Step 2’s Thread-mode bail exists because the kernel resolves any pid
written to cgroup.procs to its thread-group leader — writing a
Thread-mode worker’s pid there would migrate the entire harness
into the test cgroup.
Which placement tool, when:
| You want to | Use |
|---|---|
| Pin one worker to CPUs | handle.set_affinity(idx, cpus) |
| Pin a whole cgroup of workers | CgroupGroup::add_cgroup (writes cpuset.cpus once, RAII-removes on drop) |
| A cgroup that outlives the current scope | CgroupManager directly |
Start and observing progress
start() signals all workers to begin (a start-pipe byte for
fork children, a channel send for threads). Idempotent — the second
call is a no-op. Call it after cgroup placement.
snapshot_iterations() reads every worker’s current iteration
count from a shared-memory region without stopping anything. Call it
periodically during the run window to detect stalls or compute
instantaneous rates; final totals come from stop_and_collect().
Stop and collect
stop_and_collect(self) signals workers to stop (SIGUSR1 flips a
stop flag in fork children; a per-thread flag for thread workers),
then collects each worker’s WorkerReport — read from a report pipe
under a shared 5-second deadline for fork children, returned from the
thread join for thread workers. It auto-starts workers if start()
was never called, and consumes the handle — workers cannot be
restarted.
A worker that fails to produce a report (died, timed out, wrote
corrupt data) gets a zeroed sentinel report: completed: false,
work_units: 0, and exit_info: Some(_) preserving how it ended
(Exited(code) / Signaled(sig) / TimedOut / WaitFailed /
Panicked). Live-worker reports always carry exit_info: None, so
consumers can distinguish “ran to completion and did nothing” from
“died before reporting” — and the starvation gate counts dead workers
as starved instead of silently passing.
After collection, SIGKILL is delivered to each fork worker’s process group unconditionally to reap stragglers.
Warning
The teardown SIGKILL is a process-group sweep. Every worker calls
setpgid(0, 0)after fork, so any child aCustomwork function spawns (a helper viaexecv, a subshell) inherits the worker’s pgid and is SIGKILLed at teardown. A child that must outlive the worker needssetpgid(child_pid, 0)after fork, or an explicit wait before the worker returns its report. Details in Work Types — Custom.
Drop behavior
Dropping a WorkloadHandle without calling stop_and_collect()
sends SIGKILL to all child processes (the same process-group sweep)
and waits for them, so error paths never leak orphaned workers.
Shared mmap regions (futex pages, iteration counters) are unmapped on
drop. The type is #[must_use] — an accidentally dropped handle
tears its workload down immediately.
Telemetry: WorkerReport
Each worker produces one WorkerReport. The fields you will actually
assert on:
| Field | Meaning | Populated by |
|---|---|---|
work_units | Cumulative work counter; feeds the starvation gate | Every framework work type |
iterations | Outer-loop count; feeds throughput rates | Every framework work type |
cpu_time_ns / wall_time_ns / off_cpu_ns | On-CPU vs total vs off-CPU time | Every framework work type |
migration_count, migrations, cpus_used | Cross-CPU movement | Checked every 1024 work units |
max_gap_ms (+ _cpu, _at_ms) | Longest wall-clock gap between checkpoints — the starvation/preemption tell | Every framework work type |
wake_latencies_ns + wake_sample_total | Per-wakeup latency samples | Blocking work types only (futex, pipe, I/O, yield, sleep) |
iteration_costs_ns + iteration_cost_sample_total | Per-iteration wall-clock cost | Pure-compute variants (AluHot, SmtSiblingSpin, IpcVariance) |
timer_latencies_ns + timer_sample_total | Timer-wake jitter vs absolute deadline | TimerLatency only |
schedstat_run_delay_ns / schedstat_run_count / schedstat_cpu_time_ns | /proc/self/schedstat deltas over the work loop | Every framework work type |
numa_pages, vmstat_numa_pages_migrated | Per-node residency and migration counters | Every framework work type; feed the NUMA checks |
completed, exit_info | Natural end vs sentinel (see above) | Framework |
affinity_error, sched_policy_error | Setup calls that failed; worker ran anyway | Framework |
Consult the WorkerReport rustdoc for the full field list and
per-field semantics — the table above summarizes, the rustdoc is
authoritative.
Semantics worth knowing before asserting:
- Sampling caps.
wake_latencies_nsis reservoir-sampled and capped at 100,000 entries;wake_sample_totalkeeps counting past the cap. Report “total wakeups” from the total; compute percentiles from the vector. (The cap is pinned by a unit test —max_wake_samples_pins_doc_value— so this paragraph cannot silently rot.) schedstat_run_countis pcount, not context switches. It increments each time the scheduler picks the task to run; a task that keeps running on one CPU does not advance it. For true context-switch counts read/proc/<pid>/status.- Checkpoint cadence. Migration and gap checks run when
work_unitsis a multiple of 1024, so a variant contributing N units per outer iteration checks every1024 / gcd(N, 1024)iterations. Per-variant unit contributions live in the worker source and its rustdoc; the key defaults are pinned by unit tests. Custompopulates nothing. The framework fills no telemetry forWorkType::Custom— migration tracking, gap detection, schedstat deltas, and iteration counts exist only if the user’srunfunction fills them.
What the reports become
Test output rolls WorkerReports up per cgroup. From a real failing
run:
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
iter sums iterations, gap is the worst max_gap_ms,
migrations sums migration_count, and cpus counts distinct
cpus_used entries. Reading this one: both cgroups made steady
progress with sub-25ms worst gaps — the workers were scheduled fine;
this failure came from a throughput floor, not starvation. A report
showing migrations=0 plus a growing gap on a multi-CPU cpuset
would tell the opposite story: the scheduler is not spreading.
How reports become verdicts — thresholds, defaults, and the merge rules — is Checking’s territory.
CgroupManager
sched_ext schedulers see cgroups — weights, cpusets, hierarchy — so
ktstr scenarios create, mutate, and destroy cgroups mid-test, and the
cleanup has to survive kernel-side hangs a buggy scheduler can cause.
CgroupManager is that layer: cgroup v2 filesystem operations under a
parent directory, with timeouts and failure caps where the kernel can
wedge.
Scenarios reach it through Ctx.cgroups. The typical pattern pairs it
with the RAII guard:
fn custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
let mut guard = CgroupGroup::new(ctx.cgroups);
guard.add_cgroup("cg_0", &cpuset)?;
let mut h = WorkloadHandle::spawn(&config)?;
ctx.cgroups.move_tasks("cg_0", &h.worker_pids_for_cgroup_procs()?)?;
h.start(); // workers block until start() is called
// ... run workload ...
// `guard` drops at end of scope and removes cg_0 even on error.
Ok(result)
}
Bypass CgroupGroup only when the cgroup’s
lifetime must outlive the current scope; the RAII wrapper removes the
cgroup on every error path, not just the happy one.
Construction
use std::collections::BTreeSet;
use ktstr::cgroup::Controller;
let cgroups = CgroupManager::new("/sys/fs/cgroup/ktstr");
let mut controllers = BTreeSet::new();
controllers.insert(Controller::Cpuset);
controllers.insert(Controller::Cpu);
cgroups.setup(&controllers)?; // create parent dir, enable cpuset + cpu
setup() creates the parent directory, checks each requested
controller (Cpuset, Cpu, Memory, Pids, Io) against
/sys/fs/cgroup/cgroup.controllers, and enables the requested set on
every ancestor down to and including the parent. A missing controller
fails early with a diagnostic of this shape rather than a later
ENOENT:
cgroup controller 'memory' not available at /sys/fs/cgroup/cgroup.controllers;
cgroup.controllers reports {...}. CONFIG_MEMORY_CONTROLLER may be unset, or
the controller is masked at this level of the hierarchy
Walk root. By default the ancestor walk and task-drain destination
is /sys/fs/cgroup (a root-owned tree). with_walk_root(root)
retargets both for cgroup-v2 user delegation (systemd Delegate=yes,
container nsdelegate): the walk stops at the delegated subtree, and
the constructor validates that parent sits at or below it.
Routine operations
| Method | Effect |
|---|---|
create_cgroup(name) | Create a child directory; idempotent; supports nested paths |
set_cpuset(name, cpus) / clear_cpuset(name) | Write cpuset.cpus as a compact range string ("0-3,5"); clear inherits the parent |
set_cpuset_mems / clear_cpuset_mems | NUMA-node analogue (cpuset.mems) |
move_task(name, pid) | Write one PID to the child’s cgroup.procs |
set_cpu_max / set_cpu_weight | cpu controller knobs |
set_memory_max / set_memory_high / set_memory_low / set_memory_swap_max | memory controller knobs |
set_io_weight / set_pids_max / set_freeze | io / pids / freezer knobs |
The CgroupDef builder routes its per-controller setters through
these, and the CgroupOps trait abstracts the surface so scenarios
consume &dyn CgroupOps (test doubles substitute cleanly). Cgroup
names are validated at every entry point: empty names, leading
slashes, NUL bytes, and ../. components are rejected.
Warning
For nested paths (
"nested/leaf"), only+cpusetis propagated to intermediate cgroups’subtree_control—+cpu,+memory,+pids, and+ioare not. A nested leaf exposescpuset.*knobs, but driving a memory/pids/io knob on it (e.g.CgroupDef::named("nested/leaf").memory_max(N)) fails withENOENTat apply-setup time. See Troubleshooting for the operator-facing diagnostic.
Operations with a story
move_tasks(name, pids) — moves a batch of PIDs into a child
cgroup. Tolerates ESRCH (a task exited between listing and
migration) with a warning, but bails when every supplied pid
vanished — silence there would mask a dead-worker cascade. Retries
transient EBUSY from sched_ext cgroup_prep_move callbacks up to 3
attempts with 100ms backoff, then propagates. And it refuses to write
cgroup.procs at all when the destination has cpuset.cpus set but
cpuset.mems.effective reads empty — a half-configured cgroup whose
kernel behavior is path-dependent; the refusal names the fix
(set_cpuset_mems or widen an ancestor).
remove_cgroup(name) — auto-unfreezes frozen tasks (a frozen task
cannot be reparented), drains tasks to the walk root, waits for
cgroup.events to report populated 0 (inotify-driven, 1s
deadline), then removes the directory. Draining targets the walk root
because the parent has subtree_control set, and the kernel’s
no-internal-process constraint rejects task writes to a cgroup with
active controllers. Removing a cgroup that does not exist is Ok.
drain_tasks(name) / cleanup_all() — the pieces of the
above: drain one cgroup’s tasks to the walk root; or recursively
remove every child under the parent, depth-first, draining at each
level.
Failure modes
Write timeout. Every cgroup filesystem write runs under a 2-second timeout in a helper thread. A write the kernel never completes (scheduler bug, wedged freezer) errors with
cgroup write to <path> timed out after 2000ms
instead of hanging the test forever.
Stuck-cgroup cap. Each failed remove increments an
outstanding-removes counter (successful removes decrement it). Past
10 outstanding, further remove_cgroup calls fail fast with a
message of this shape:
remove_cgroup 'cg_42' refused: 11 cgroups outstanding (cap 10); cgroup.procs
draining wedged or churn loop outpacing the kernel's RCU grace period —
bailing to avoid unbounded cgroupfs accumulation
This bounds the leak from a churn scenario outrunning the kernel’s
cleanup instead of accumulating writer threads without limit.
outstanding_removes() exposes the count for diagnostics.
See also: CgroupGroup for RAII cleanup, Workers and Workloads for worker lifecycle, Topology for cpuset generation.
CgroupGroup
CgroupGroup is an RAII guard that removes cgroups on drop. It
prevents cgroup leaks when workload spawning or any other operation
fails between cgroup creation and cleanup.
#[must_use = "dropping a CgroupGroup immediately destroys the cgroups it manages"]
pub struct CgroupGroup<'a> { /* ... */ }
The #[must_use] is deliberate: binding the guard to _ (rather
than _guard) drops it immediately and destroys the cgroups before
the workload runs.
Methods
new(cgroups: &dyn CgroupOps) — creates an empty group bound to
any CgroupOps implementor (CgroupManager in
production, an in-memory fake in tests).
add_cgroup(name, cpuset) — creates a cgroup and sets its
cpuset. Auto-enables the Cpuset controller on the parent’s
cgroup.subtree_control first — the difference that matters vs
add_cgroup_no_cpuset, which creates the cgroup without a cpuset and
without touching controllers. Both track the cgroup for removal on
drop.
names() — the names of all tracked cgroups.
Drop behavior
On drop, the group calls remove_cgroup() on each tracked cgroup in
reverse insertion order, so nested children are removed before their
parents (a parent still holding child directories fails with
ENOTEMPTY).
ENOENT is the one errno the drop swallows silently: it means the
directory is already gone, so the post-condition already holds and no
cleanup is owed. (It can legitimately appear via a narrow race
between the existence check and remove_dir.) Every other error
surfaces as a tracing::warn! record carrying the cgroup name and
the full error chain — the drop never panics, but teardown failures
are visible in logs rather than silently swallowed. The record’s
shape:
CgroupGroup::drop: remove_cgroup returned non-ENOENT error
cgroup=<name> err=<error chain>
hint=EBUSY: cgroup still has live tasks — workloads were not drained before teardown
EBUSY at drop means exactly what the hint says: something is still
running in the cgroup — typically a WorkloadHandle that outlives
the guard, so its workers were never stopped before teardown. Drop
(or stop_and_collect) the handle before the guard goes out of
scope. EACCES gets its own hint pointing at cgroup ownership and
delegation.
Usage
CgroupGroup is the standard cgroup-lifecycle pattern for custom
scenarios — CgroupManager shows the full worked
example. The shape in brief:
let mut guard = CgroupGroup::new(ctx.cgroups);
guard.add_cgroup("cg_0", &cpuset_a)?;
guard.add_cgroup("cg_1", &cpuset_b)?;
// If anything below fails, `guard` drops and removes both cgroups.
The helper setup_cgroups(ctx, n, &wl) bundles the pattern: it
creates n cgroups, spawns workers in each, and returns the handles
alongside the guard.
See also: CgroupManager for filesystem operations, Workers and Workloads for worker lifecycle.
CI
ktstr boots KVM microVMs and builds Linux kernels, so a CI job has two
unusual needs: runners that expose /dev/kvm, and aggressive caching —
the first run on a fresh runner downloads and compiles a full Linux
kernel, by far the slowest step in any workflow. Once the kernel cache
is warm, the kernel resolves in under a second and wall-clock is
dominated by the tests themselves:
cargo ktstr: fetching latest 7.0.x kernel version
cargo ktstr: latest 7.0.x kernel: 7.0.14
cargo ktstr: resolved kernel "7.0"
...
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.23s
────────────
Nextest run ID 24c18577-cd34-43bd-9d14-b0197701c187 with nextest profile: default
Starting 1 test across 121 binaries (12531 tests skipped)
PASS [ 34.451s] (1/1) ktstr::failure_dump_e2e ktstr/failure_dump_renders_bss_fields
────────────
Summary [ 34.490s] 1 test run: 1 passed, 12531 skipped
Everything below is a variation on: get KVM, cache the kernel, run the
tests, keep the stats. This repo’s own CI is the living reference:
.github/workflows/ci.yml.
Runner requirements
GitHub-hosted ubuntu-latest runners do not expose /dev/kvm.
Use self-hosted runners with project-specific labels (this repo uses
ktstr-x64 and ktstr-arm64; substitute your own pool’s labels):
runs-on: [ktstr-x64] # x86_64 self-hosted KVM runner
runs-on: [ktstr-arm64] # aarch64 self-hosted KVM runner
See Troubleshooting: /dev/kvm not accessible
for diagnosing KVM on runners, including cloud-VM nested
virtualization setup (GCP, AWS, Azure). Runners also need the build
dependencies from Getting Started and at least
5 GB of free disk for kernel sources, build artifacts, and cached
images. Gauntlet topology presets go up to 252 vCPUs; tests whose
preset exceeds the runner’s capacity skip cleanly (or fail under
--no-skip-mode), so small runners run a subset rather than
breaking.
A minimal workflow
Builds a kernel, caches it, runs the tests:
name: CI
on:
push:
branches: [main]
pull_request:
jobs:
test:
runs-on: [ktstr-x64]
env:
KTSTR_GHA_CACHE: "1"
steps:
- uses: actions/checkout@v5
- uses: dtolnay/rust-toolchain@stable
- uses: taiki-e/install-action@v2
with:
tool: cargo-nextest
- name: Install ktstr
run: cargo install --path . --locked --features remote-cache
- name: Cache kernel images
uses: actions/cache@v4
with:
path: ~/.cache/ktstr/kernels
key: ktstr-kernels-x64-${{ hashFiles('ktstr.kconfig') }}
restore-keys: ktstr-kernels-x64-
- name: Build test kernel
run: cargo ktstr kernel build
- run: cargo ktstr test -- --profile ci --features integration
The load-bearing lines: KTSTR_GHA_CACHE: "1" enables a remote
kernel-cache layer on top of the local one (Caching); the
actions/cache key hashes ktstr.kconfig, so a kconfig change
invalidates cached kernels; --profile ci selects the nextest
profile tuned for contended runners
(Nextest CI profile); --features integration
enables ktstr’s full end-to-end suite when testing ktstr itself — in
a scheduler repo, pass your own crate’s feature flags or drop it.
The test harness auto-discovers the built kernel; to pin versions,
use the matrix below.
Kernel pinning
Pin kernel versions via the matrix strategy (this repo’s CI tests
6.14 and 7.1 this way):
strategy:
fail-fast: false
matrix:
kernel-version: ['6.14', '7.1']
# then, in steps:
- run: cargo ktstr kernel build --kernel ${{ matrix.kernel-version }}
- run: cargo ktstr test --kernel ${{ matrix.kernel-version }} -- --profile ci --features integration
--kernel tells cargo ktstr test which cached kernel to use at
runtime. A major.minor prefix (e.g. 6.14) resolves to the highest
patch release in that series; see
cargo ktstr kernel for the full
resolution chain.
The cache-key footgun: when testing multiple kernel versions, add
${{ matrix.kernel-version }} to the cache key and restore-keys —
the minimal workflow’s version-less key would make matrix cells
evict each other’s kernels.
Caching
actions/cache persists ~/.cache/ktstr/kernels across runs.
KTSTR_GHA_CACHE=1 adds a remote layer that shares kernels across
jobs and workflow runs; remote failures are non-fatal and the local
cache is authoritative. The remote layer is compiled in only with
--features remote-cache (off by default) — without it the variable
is a no-op, which is why the install steps above pass the feature.
If you set a global RUSTC_WRAPPER: sccache for compile caching (as
this repo’s CI does), sccache must be on $PATH on every targeted
runner — x64 and arm64 alike — or the first cargo invocation fails.
Dynamic matrix: cargo ktstr affected
On a fleet repo with many schedulers, cargo ktstr affected emits
the scheduler packages a base..HEAD diff touches, as a flat JSON
array for a GitHub Actions dynamic matrix — one job per affected
scheduler instead of building and testing everything on every push:
cargo ktstr affected # vs merge-base(HEAD, main)
# -> e.g. ["scx_lavd","scx_rusty"]
Attribution is the union of the cargo dependency closure (shared
Rust library changes) and per-scheduler dep-info parsing of the
compiled BPF sources (shared .bpf.c / header includes). The design
is fail-safe: a false negative — silently skipping an affected
scheduler — is the worst outcome, so every uncertainty (unresolvable
base, diff failure, build-graph or Cargo.lock change, unattributable
non-docs path) widens to the full testable set, never to a skip.
Only a strictly docs-only change (or base == HEAD) emits [].
Only Discover (cargo-package) schedulers appear in the array —
package-less schedulers (EEVDF, kernel-builtin) have no package to
key a matrix cell on and need a separate unconditional CI leg. On a
pull_request event the baseline defaults to
merge-base(HEAD, origin/$GITHUB_BASE_REF); check out with full
history so the merge-base exists.
jobs:
matrix:
runs-on: [ktstr-x64]
outputs:
schedulers: ${{ steps.affected.outputs.schedulers }}
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 0 # merge-base needs history
- name: Install ktstr
run: cargo install ktstr --locked
- id: affected
run: echo "schedulers=$(cargo ktstr affected)" >> "$GITHUB_OUTPUT"
test:
needs: matrix
if: needs.matrix.outputs.schedulers != '[]'
runs-on: [ktstr-x64]
strategy:
fail-fast: false
matrix:
scheduler: ${{ fromJSON(needs.matrix.outputs.schedulers) }}
steps:
- uses: actions/checkout@v5
# ... install ktstr + nextest, restore the kernel cache, and
# `cargo ktstr kernel build` as in the minimal workflow ...
# Adjust the filter to how your repo organizes per-scheduler tests.
- run: cargo ktstr test -- --profile ci -E 'package(${{ matrix.scheduler }})'
The local counterpart is cargo ktstr test --relevant, which runs
the same attribution against your working tree — see
cargo-ktstr.
Perf gate on pull requests
cargo ktstr perf-delta --noise-adjust runs the performance_mode
tests at HEAD and at the PR’s merge-base, then exits non-zero when
metrics regress with statistical confidence — a performance gate in
one step. On pull_request events the baseline resolves from
$GITHUB_BASE_REF automatically:
perf-gate:
if: github.event_name == 'pull_request'
runs-on: [ktstr-x64]
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 0 # merge-base needs history
# ... install ktstr + nextest as in the minimal workflow ...
- run: cargo ktstr perf-delta --noise-adjust 5 --kernel 7.0
Budget for it: --noise-adjust 5 runs every performance_mode test
ten times (five per side). Narrow with -E or --relevant, and add
--must-fail <metric> for metrics that must never regress. See
Runs and Regression Gates for how the
verdict is computed and
A/B Compare Branches for the local
equivalent.
Budget-based test selection
Set KTSTR_BUDGET_SECS (e.g. "300") on the test step to bound a
smoke-test job: the selector greedily picks the tests that maximize
feature coverage within the time budget. See
Running Tests for the selection model.
Coverage
Same job shape as the minimal workflow, with the
llvm-tools-preview rustup component and cargo-llvm-cov added, and
the test step swapped for:
- run: cargo ktstr coverage -- --profile ci --lcov --output-path lcov.info --features integration --exclude-from-report scx-ktstr
--exclude-from-report <crate> keeps scheduler crates out of the
coverage report — the example excludes scx-ktstr, ktstr’s own
fixture scheduler.
Test statistics
- name: Test statistics
if: ${{ !cancelled() }}
run: cargo ktstr stats
stats reads the sidecar JSON files under target/ktstr/ and prints
gauntlet analysis, BPF verifier stats, callback profile, and KVM
stats (Runs and Regression Gates).
if: !cancelled() collects stats even when the test step failed —
which is exactly when you want them.
aarch64
aarch64 runners use the same workflows with two substitutions: runner
labels ([ktstr-arm64] or your pool’s) and the cache-key prefix
(arm64 instead of x64). The guest image name differs (Image
instead of bzImage) but ktstr handles that internally.
Performance mode
CI runners often lack CAP_SYS_NICE, rtprio limits, or enough host
CPUs for exclusive LLC reservation. Set KTSTR_NO_PERF_MODE: "1" on
the test step to disable performance mode; tests with
performance_mode=true are then skipped entirely. See
Performance Mode, and
Tests pass locally but fail in CI
for the wider skip/fail triage.
Nextest CI profile
The workspace ships a ci profile in .config/nextest.toml. VM
boots on a contended runner run slower and flake differently than on
a dev box, so the CI profile trades latency for stability — longer
slow-timeouts, one more retry, deferred failure output, and no
fail-fast:
[profile.ci]
slow-timeout = { period = "90s", terminate-after = 3 }
retries = { backoff = "exponential", count = 6, delay = "1s", jitter = true, max-delay = "3s" }
failure-output = "final"
fail-fast = false
# Heavier test classes get their own budgets, e.g.:
[[profile.ci.overrides]]
filter = "test(verifier_)"
slow-timeout = { period = "180s", terminate-after = 3 }
Use it with --profile ci. If a test in your repo drives an
unusually slow boot (huge topology, nested VM), give it its own
override rather than raising the profile-wide timeout.
The CI-relevant environment variables are KTSTR_GHA_CACHE,
KTSTR_BUDGET_SECS, KTSTR_NO_PERF_MODE, KTSTR_KERNEL, KTSTR_CI
(tags sidecars as CI-produced), and KTSTR_CACHE_DIR — see the
full reference.
Troubleshooting
Find your error message, jump to its section:
| You see | Go to |
|---|---|
clang: No such file or directory | Build errors |
pkg-config: command not found | Build errors |
autoreconf: command not found | Build errors |
busybox build requires 'make' | Build errors |
no BTF source found | BTF errors |
failed to obtain busybox source | busybox download failure |
/dev/kvm not found / permission denied | /dev/kvm not accessible |
no kernel found | No kernel found |
scheduler 'NAME' not found | Scheduler not found |
scheduler process died unexpectedly | Scheduler died |
scheduler did not turn on + verifier log | Scheduler fails the BPF verifier |
libbpf: … func_proto … incompatible with vmlinux | Scheduler cannot load: kfunc BTF mismatch |
send_sys_rdy failed within boot budget | send_sys_rdy timeout |
no 2MB hugepages available | Insufficient hugepages |
tid N stuck … / unfair cgroup: spread=… | Worker assertion failures |
cgroup-state-snapshot: … | Cgroup name typos |
requires +cpu in parent cgroup.subtree_control | Cgroup controller not enabled |
CpusetSpec validation failed | CpusetSpec errors |
requires num_workers divisible by | Worker count mismatches |
(corrupt: metadata.json malformed…) | Cache corruption |
HOME is unset; cannot resolve cache directory | Cache directory not found |
entries marked (stale kconfig) | Stale kconfig |
fetch https://www.kernel.org/releases.json: … | Kernel auto-download failures |
version X not found / RC tarball not found | Kernel download failures |
stdin must be a terminal / -i NAME: not found | Shell mode issues |
flock LOCK_EX … timed out / filesystem NFS is not supported | Flock timeout / NFS rejection |
| test marked SLOW, then killed by nextest | Test hangs / nextest timeout |
| green locally, red in CI | Tests pass locally but fail in CI |
Build errors
clang not found
error: failed to run custom build command for `ktstr`
...
clang: No such file or directory
The BPF skeleton build (libbpf-cargo) invokes clang to compile
.bpf.c sources. Install clang:
- Debian/Ubuntu:
sudo apt install clang - Fedora:
sudo dnf install clang
pkg-config not found
error: failed to run custom build command for `libbpf-sys`
...
pkg-config: command not found
libbpf-sys uses pkg-config during its vendored build. Install it:
- Debian/Ubuntu:
sudo apt install pkg-config - Fedora:
sudo dnf install pkgconf
autotools errors (autoconf, autopoint, aclocal)
autoreconf: command not found
aclocal: command not found
autopoint: command not found
The vendored libbpf-sys build compiles bundled libelf and zlib from source using autotools. These libraries are not system dependencies – they ship with libbpf-sys – but the autotools toolchain is needed to build them. Install:
- Debian/Ubuntu:
sudo apt install autoconf autopoint flex bison gawk - Fedora:
sudo dnf install autoconf gettext-devel flex bison gawk
make or gcc not found
busybox build requires 'make' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
busybox build requires 'gcc' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
The build script compiles busybox from source for guest shell mode.
- Debian/Ubuntu:
sudo apt install make gcc - Fedora:
sudo dnf install make gcc
BTF errors
no BTF source found. Set KTSTR_KERNEL to a kernel build directory,
or ensure /sys/kernel/btf/vmlinux exists.
build.rs generates vmlinux.h from kernel BTF data. It searches
the kernel discovery chain (KTSTR_KERNEL, ./linux, ../linux,
installed kernel) for a vmlinux file, falling back to
/sys/kernel/btf/vmlinux. Most distros ship
/sys/kernel/btf/vmlinux with CONFIG_DEBUG_INFO_BTF enabled.
Fixes:
- Verify BTF is available:
ls /sys/kernel/btf/vmlinux - If missing, set
KTSTR_KERNELto a kernel build directory that contains avmlinuxwith BTF:export KTSTR_KERNEL=/path/to/linux - Build a kernel with
CONFIG_DEBUG_INFO_BTF=y. - Some minimal/cloud kernels strip BTF. Use a distro kernel or build your own.
busybox download failure
failed to obtain busybox source after 4 attempts.
tarball (https://github.com/mirror/busybox/archive/refs/tags/1_36_1.tar.gz): ...
Remediation:
• Check network connectivity (the build script needs HTTPS access to github.com to fetch the upstream tarball).
• If behind a proxy, ensure HTTP_PROXY/HTTPS_PROXY environment variables are set.
• Or set KTSTR_BUSYBOX_TARBALL=<path> to point at a pre-fetched local copy.
• Or set KTSTR_SKIP_BUSYBOX_BUILD=1 to skip the busybox compile entirely (shell mode will be unavailable).
build.rs downloads the busybox tarball on first build (4 attempts
with backoff); subsequent builds use the cached binary. Follow the
remediation lines in the error itself — after one successful build,
no network access is needed unless cargo clean removes the cached
binary.
/dev/kvm not accessible
The host-side pre-flight emits one of the following, depending on whether the device node is missing or merely unreadable:
/dev/kvm not found. KVM requires:
- Linux kernel with KVM support (CONFIG_KVM)
- Access to /dev/kvm (check permissions or add user to 'kvm' group)
- Hardware virtualization enabled in BIOS (VT-x/AMD-V)
/dev/kvm: permission denied. Add your user to the 'kvm' group:
sudo usermod -aG kvm $USER
then log out and back in.
ktstr boots Linux kernels in KVM virtual machines. The host must have
KVM enabled and the user must have read+write access to /dev/kvm.
Diagnose:
ls -l /dev/kvm— typical output:crw-rw---- 1 root kvm 10, 232 ....getent group kvm— confirm the group exists and see its members.
Fixes:
- Load the KVM module:
modprobe kvm_intelormodprobe kvm_amd. - Follow the group-membership hint in the error text (log out and back in afterward).
- On cloud VMs (GCP, AWS, Azure) or nested hypervisors, nested
virtualization is typically off by default. Enable it per the
provider’s instructions (e.g. GCP
--enable-nested-virtualization, AWS.metalinstance types, Azure Dv3/Ev3+ with nested virt). - In CI, ensure the runner has KVM access — see CI.
No kernel found
no kernel found — the test harness was likely invoked outside `cargo ktstr test` (which builds and injects a kernel automatically).
hint: run `cargo ktstr test --kernel <path-or-version>` to drive this test, or set KTSTR_TEST_KERNEL=/path/to/{bzImage|Image} to point at a pre-built bootable image directly.
hint: set KTSTR_KERNEL to one of: exact version (`6.14`), inclusive range (`6.14..7.0` or `6.14..=7.0`), git source (`git+URL#tag=NAME`, `git+URL#branch=NAME`, or `git+URL#sha=<40-hex>`), absolute or `~`-prefixed path, or cache key. List cached keys with `cargo ktstr kernel list`; build new ones with `cargo ktstr kernel build`
On aarch64 the first hint’s image filename is Image instead of
bzImage. ktstr needs a bootable kernel image; see
cargo ktstr kernel for the discovery
chain. ktstr shell and cargo ktstr shell auto-download the latest
stable kernel when nothing is found — see
Kernel auto-download failures for
download-specific errors.
Fixes:
- Download and cache a kernel:
cargo ktstr kernel build - Build from a local tree:
cargo ktstr kernel build --kernel ../linux - Set
KTSTR_TEST_KERNELto an explicit image path. - The host’s installed kernel works for basic testing.
Scheduler not found
scheduler 'scx_mitosis' not found. Set KTSTR_SCHEDULER or
place it next to the test binary or in target/{debug,release}/
SchedulerSpec::Discover resolves the scheduler binary entirely on
the host. The order depends on how the test was launched:
Under cargo ktstr test (the normal path):
KTSTR_SCHEDULER_BIN_<NAME>, thenKTSTR_SCHEDULERenv overrides.cargo build -p <scheduler>— the build runs up front, so an edited scheduler is never validated against a stale pre-built binary. If that build fails, the test hard-fails rather than falling back; setKTSTR_SCHEDULER_ALLOW_STALE_FALLBACK=1to re-enable the sibling /target/{debug,release}/pre-built fallback while the workspace build is broken.
Under bare cargo test / cargo nextest run (marked with
KTSTR_CARGO_TEST_MODE=1):
- The env overrides, with
$PATHalso consulted — so an installed scheduler binary resolves without an in-tree build. - Sibling of the test binary, then the
target/release/andtarget/debug/build dirs — the scheduler’s build profile (release by default) is probed first. - The on-demand
cargo build -p <scheduler>runs last, only after the pre-built probes miss.
Fixes:
cargo build -p scx_mitosis— on the orchestrated path this only primes the cache; on the bare path it makes the probe hit.- Set
KTSTR_SCHEDULER=/path/to/binary(or the per-nameKTSTR_SCHEDULER_BIN_<NAME>variant). - Use
SchedulerSpec::Pathfor an explicit path.
Scheduler died
scheduler process died unexpectedly after completing step 2 of 5 (12.3s into test)
The scheduler process died while the scenario was running — usually a
crash. The exact message varies by when the crash was detected. The
failure output contains diagnostic sections (each present only when
relevant): --- scheduler log --- (the scheduler’s own output,
cycle-collapsed), --- diagnostics --- (init stage, VM exit code,
kernel console tail), and --- sched_ext dump --- (when a SysRq-D
dump fired). Set RUST_BACKTRACE=1 to force --- diagnostics --- on
all failures.
Next steps:
- Read the
--- scheduler log ---for the crash reason; see Reading Failure Output for the full section-by-section anatomy. - A second VM automatically reproduces the crash with BPF probes attached — see Auto-Repro.
- Follow Investigate a Crash for the crash-to-pin workflow.
Scheduler fails the BPF verifier
verifier
scheduler: NOT ATTACHED — scheduler process exited during BPF load/startup
verifier --- verifier stats ---
processed=186 states=7/7
verifier --- scheduler log ---
Global function ktstr_dispatch() doesn't return scalar. Only those are supported.
0: R1=ctx() R10=fp0
; if (crash) @ main.bpf.c:423
0: (18) r1 = 0xff5d3bb3000f60dc ; R1=map_value(map=bpf_bpf.bss,ks=4,vs=280,off=220)
...
; *p = (int)acc; @ main.bpf.c:464
191: (61) r2 = *(u32 *)(r10 -8) ; R2=scalar(id=53,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=mmmmscalar(id=53,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
192: (63) *(u32 *)(r1 +0) = r2
R1 invalid mem access 'scalar'
processed 186 insns (limit 1000000) max_states_per_insn 0 total_states 7 peak_states 7 mark_read 0
The in-guest BPF verifier rejected the program, so the scheduler
never attached. Read the log bottom-up: the last few lines name the
rejected instruction (R1 invalid mem access 'scalar') and the
source line the C-line comments (@ main.bpf.c:464) map it to. The
first line is the verifier’s summary of the top-level complaint.
Verifier acceptance depends on kernel version and topology —
values like nr_cpus bake into .rodata, so a program that
verifies on one CPU count can blow up on another. Sweep your
scheduler across kernels and topologies with
cargo ktstr verifier, which also
collapses repeated loop iterations (--- N identical iterations omitted ---) so real rejections stay readable.
Scheduler cannot load: kfunc BTF mismatch
--- scheduler log ---
libbpf: extern (func ksym) 'scx_bpf_create_dsq': func_proto [755] incompatible with vmlinux [54769]
libbpf: failed to load BPF skeleton 'bpf_bpf': -EINVAL
Error: Failed to load BPF program
ktstr surfaces this as scheduler did not turn on — scheduler process exited during BPF load/startup in verifier cells, or as a scheduler
death / no test result received from guest in test runs — with the
libbpf lines above in the scheduler log.
The cause is the kernel image, not your scheduler. Newer kernels
(first released in v7.1) give scx kfuncs an implicit trailing
struct bpf_prog_aux *aux argument; kernel build tooling
(resolve_btfids, driven by pahole’s decl_tag_kfuncs BTF feature)
is supposed to publish a BPF-facing twin of each kfunc with the
trimmed prototype so schedulers built against released scx headers
and libbpf still match. When the toolchain drops that tag for a
kfunc — observed with some pahole builds, and varying by config —
the plain-name prototype keeps the extra argument and no released
scheduler can load on that kernel.
Check any kernel in one command:
bpftool btf dump file <vmlinux> format raw | grep -E "FUNC 'scx_bpf_(create_dsq|error_bstr)"
# loadable: the plain name points at a trimmed proto (no 'aux' param)
# broken: a single 4-arg entry — released libbpf/scx headers cannot match it
Warning
expect_err = truetests invert this load failure into a pass, andpost_vmassertions skip when the scheduler never attached — so a suite can look green with zero schedulers ever loading. If a kernel’sexpect_errtests all “pass” while everything else reports the scheduler never turned on, check the kernel’s BTF before trusting the run.
Fixes:
- Test against a kernel whose BTF passes the check above (kernels
before the implicit-args change, e.g.
--kernel 7.0or--kernel 6.14, are unaffected). - Rebuild the kernel with a pahole/toolchain combination that preserves the kfunc tags, and re-run the check.
send_sys_rdy timeout
WARN ktstr::vmm::rust_init: ktstr-init: send_sys_rdy failed within boot budget; see https://ktstr.dev/guide/troubleshooting.html#send_sys_rdy-timeout budget_ms=11200 vcpus=8 elapsed_ms=11342 port_exists=false kern_addrs_sent=false
The guest init could not send its “ready” signal to the host within the boot budget (10 s plus 150 ms per vCPU, capped at 90 s). The WARN itself is non-fatal — the guest continues and the host starts sampling anyway — but the test usually then fails through the normal VM-teardown path (see Scheduler died); the authoritative deadline is the host watchdog, which scales with host overcommit.
The diagnostic fields split the cause in two:
port_exists=false— the virtio-console port device never appeared in the guest. Almost always a slow or starved boot (or an early guest panic — check the--- diagnostics ---console tail).port_exists=true— the port exists but writes did not complete. This is a host-side virtio-console issue, not guest CPU contention; file a bug with the failure dump.
Fixes (for the port_exists=false case):
- Pass
--no-perf-mode(orKTSTR_NO_PERF_MODE=1) to reduce host-side contention starving the guest’s vCPU threads. - Reduce the test’s topology — fewer vCPUs boot faster.
- KASAN / KCSAN / lockdep kernels add substantial boot overhead; re-run on a non-instrumented kernel to separate instrumentation cost from a real stall.
Insufficient hugepages
performance_mode: WARNING: no 2MB hugepages available, guest memory will use regular pages
performance_mode: WARNING: need N 2MB hugepages, only K free — falling back to regular pages
Performance mode requests 2MB hugepages for guest memory. The first form fires when none are reserved on the host; the second when fewer than the run needs. In both cases the VM falls back to regular pages and continues to boot.
Fix:
echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Worker assertion failures
tid 2 stuck 4500ms on cpu2 at +3200ms (threshold 3000ms)
unfair cgroup: spread=42% (8-50%) 4 workers on 4 cpus (threshold 35%)
The Assert checks (max_gap_ms, max_spread_pct, etc.) detected a
worker metric outside the configured thresholds. The tid N prefix
names the thread so you can cross-reference the --- timeline ---
and --- stats --- sections, which key per-thread metrics by tid;
unfair cgroup is per-cgroup and cross-references the per-cgroup
spread / workers / cpus columns in --- stats --- instead.
Fixes:
- Check whether the topology has enough CPUs for the scenario — small topologies produce higher contention, larger gaps, and more spread.
- Override thresholds for scenarios that need relaxed limits — see Customize Checking.
- Check the scheduler’s behavior under the specific flag profile that triggered the failure.
Cgroup name typos
A typo’d cgroup name surfaces only when an op tries to write to a non-existent cgroup directory; names are not pre-validated. The diagnostic depends on which op references the typo:
-
Op::RemoveCgroup/Op::StopCgroupagainst a typo silently succeed (rmdir / kill against a non-existent path are no-ops); the failure surfaces on the next op that touches the name. -
Op::SetCpusetfalls through to the kernel’sENOENT, wrapped with a one-line cgroup-state snapshot:cgroup-state-snapshot: parent=/sys/fs/cgroup/ktstr name=nonexistent parent.cgroup.controllers="cpuset cpu memory io pids" parent.cgroup.subtree_control="cpuset cpu memory" child.cgroup.controllers="<read failed: No such file or directory (os error 2)>" child.cpuset.cpus.exists=false child.listing=<read_dir failed: No such file or directory (os error 2)>: No such file or directory (os error 2)The
child.listing=<read_dir failed: ...>segment is the tell: a typo’d name has no directory to list, distinguishing this from “cgroup exists but the write was rejected” (where the listing would enumerate the cgroupfs knobs). -
Other setters (
cpu.max,memory.max,cpuset.mems, …) against a typo produce the same wrapped form as Cgroup controller not enabled — distinguish by checking whether the directory exists. -
Op::AddCgroupcolliding with an already-tracked name bails:Op::AddCgroup 'cg_0' collides with a cgroup already tracked (by a prior Backdrop or step-local CgroupDef) — declare it in exactly one place; use a fresh name for the step-local cgroup
Fixes: verify the name matches its Op::AddCgroup /
CgroupDef::named() / Backdrop.cgroups declaration, and that
dynamically formatted names (format!("cg_{i}")) use the same
formatting everywhere.
Cgroup controller not enabled
cgroup 'cg_0': set cpu.max='100000 100000' (requires +cpu in parent cgroup.subtree_control): No such file or directory (os error 2)
cgroup 'cg_0': set memory.max='4294967296' (requires +memory in parent cgroup.subtree_control): No such file or directory (os error 2)
cgroup 'cg_0': set memory.swap.max='1073741824' (requires +memory in parent cgroup.subtree_control; file absent on CONFIG_SWAP=n kernels): No such file or directory (os error 2)
cgroup 'cg_0': set cpuset.mems='0-1' (requires +cpuset in parent cgroup.subtree_control): No such file or directory (os error 2)
The cgroup exists but the controller knob is missing from its
directory. ktstr’s setup auto-enables the controllers it detects on
the scenario’s CgroupDef / Op set, so a missing controller means
either: the framework’s detection did not see a declared knob (file
a bug); an outer parent (systemd user.slice, container runtime)
stripped controllers from the subtree before ktstr ran; or the
kernel was built without CONFIG_SWAP (the memory.swap.max wrap
spells this out).
Diagnostic command:
cat /sys/fs/cgroup/<parent>/cgroup.subtree_control
A controller named in the wrapped error must appear in this list; if
it does not, fix the parent first (echo '+memory' > .../cgroup.subtree_control from a sufficiently-privileged shell) or
remove the knob from the scenario.
CpusetSpec errors
cgroup 'cg_0': CpusetSpec validation failed: not enough usable CPUs (4) for 8 partitions
cgroup 'cg_1': CpusetSpec validation failed: index 3 >= partition count 3
cgroup 'cg_2': CpusetSpec validation failed: Range fracs must lie in [0.0, 1.0]: start_frac=-1, end_frac=0.5
A CpusetSpec cannot produce a valid cpuset for the test topology;
the step aborts as a hard error before any downstream slicing runs.
Fixes:
- Guard with a topology check before creating the step:
if ctx.topo.usable_cpus().len() < needed { return Ok(AssertResult::skip(...)); } - Call
CpusetSpec::validate(&ctx)in your scenario builder so failures surface beforeexecute_stepsruns. - Reduce the partition count, or use
CpusetSpec::Llcinstead ofDisjointon topologies with fewer CPUs than partitions. - For
Range/Overlap, keep fractions finite and inside[0.0, 1.0];Rangeadditionally requiresstart_frac < end_frac.
Worker count mismatches
PipeIo (group 0) requires num_workers divisible by 2, got 3
Grouped work types (PipeIo, FutexPingPong, CachePipe,
FutexFanOut, FanOutCompute, and the contention / waker families —
see Workers and Workloads) require
num_workers divisible by their group size. The (group N) segment
names the composed entry the violation belongs to, so multi-group
scenarios point at the entry to fix.
Fixes:
- Set
CgroupDef::workers(n)to a multiple of the work type’s group size (2 for pipe/futex pairs,fan_out + 1for FutexFanOut and FanOutCompute). - Use an ungrouped work type (
SpinWait,Mixed,Bursty,IoSyncWrite,IoRandRead,IoConvoy,YieldHeavy) if worker count flexibility is needed.
Cache corruption
6.14.2-tarball-x86_64-kc... (corrupt: metadata.json malformed: ...)
warning: entries marked (corrupt) cannot be used — cached metadata is missing, malformed, or references a missing image. Inspect the entry directory under ~/.cache/ktstr/kernels to remove it manually, or run `kernel clean --corrupt-only --force` which removes ONLY corrupt entries and leaves valid ones intact. ...
A cached kernel entry has missing, unparseable, or schema-drifted
metadata.json, or references an image that is no longer present —
typically after a partial write (disk full, killed process) or a
ktstr upgrade that changed the metadata schema. Corrupt entries are
never used; runs fall through to a rebuild. The JSON listing
(kernel list --json) carries a stable error_kind token per
corrupt entry for CI scripts — see
cargo ktstr kernel.
Fixes:
- Remove only corrupt entries:
cargo ktstr kernel clean --corrupt-only --force - Rebuild a specific version after cleanup:
cargo ktstr kernel build --force --kernel 6.14.2 - Move the cache with
KTSTR_CACHE_DIRif the default location is on a problematic filesystem.
Stale vmlinux.btf or default.profraw in kernel source tree
Older ktstr versions could leave two files in a kernel source
directory: <source>/vmlinux.btf (a BTF sidecar, now written only
inside the cache root) and <source>/default.profraw (an LLVM
coverage artifact, now redirected next to the cargo-ktstr binary).
Both are leftover state and safe to remove:
rm -f /path/to/linux/vmlinux.btf /path/to/linux/default.profraw
If they keep reappearing, you are running an old ktstr binary — rebuild or reinstall, then delete again. See profraw layout for where coverage artifacts land now.
Cache directory not found
HOME is unset; cannot resolve cache directory. The container init or login shell did not assign HOME — set it to an absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
HOME is set to the empty string; cannot resolve cache directory. An empty HOME usually means a Dockerfile or shell rc has `export HOME=` or `ENV HOME=` with no value. Either set HOME to a real absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
The kernel image cache requires a writable directory, resolved as
KTSTR_CACHE_DIR > $XDG_CACHE_HOME/ktstr/ > $HOME/.cache/ktstr/.
The first form fires when HOME is absent (bare container inits,
systemd units without Environment=HOME=…); the second when HOME
is set to the empty string.
Fix: Set KTSTR_CACHE_DIR to an explicit path, or ensure HOME
is a real absolute path.
Stale kconfig
warning: entries marked (stale kconfig) were built against a different ktstr.kconfig. Rebuild with: kernel build --force --kernel <entry version> (add --extra-kconfig PATH if the entry also carries the (extra kconfig) tag).
cargo ktstr kernel list marks entries whose stored kconfig hash
differs from the current embedded ktstr.kconfig fragment — typical
after updating ktstr. Stale entries rebuild automatically on the next
cargo ktstr kernel build; --force overrides the cache for other
reasons.
Kernel auto-download failures
ktstr: no kernel found, downloading latest stable
fetch https://www.kernel.org/releases.json: <error>
ktstr auto-downloads a kernel when no --kernel is specified and
the discovery chain finds nothing; the same path runs when --kernel
names a version not in the cache. The <error> is the underlying
network error (DNS, connection refused, timeout, TLS). Variants:
fetch https://www.kernel.org/releases.json: HTTP 503
kernel.org returned a non-success status.
no stable kernel with patch >= 8 found in releases.json
ktstr requires a stable or longterm release with patch version >= 8 to avoid brand-new majors with build issues; releases.json contained no qualifying version.
extract tarball: <error>
Disk full, bad permissions on the temp directory, or a truncated download.
Fixes:
- Verify connectivity:
curl -sI https://www.kernel.org/releases.json - If behind a proxy, set
HTTP_PROXY/HTTPS_PROXY/NO_PROXY. - Check disk space; override the cache location with
KTSTR_CACHE_DIRif needed. - Pre-download explicitly —
cargo ktstr kernel build --kernel 6.14.10isolates version resolution from download failures.
Kernel download failures
These fire when an explicit version is requested:
version 6.14.22 not found. latest 6.14.x: 6.14.10
The requested version does not exist; when a sibling in the same series is available, the error suggests it. An EOL series gets only the bare “not found”.
RC tarball not found: https://git.kernel.org/torvalds/t/linux-6.15-rc3.tar.gz
RC releases are removed from git.kernel.org after the stable version ships.
Use --kernel git+URL#tag=NAME with a git.kernel.org URL to clone
the tag instead.
download ...: server returned HTML instead of tarball (URL may be invalid)
Some CDN error pages return HTTP 200 with HTML; the download rejects
these responses. Check the URL / version against
https://www.kernel.org/releases.json.
Shell mode issues
stdin must be a terminal
stdin must be a terminal for interactive shell mode
cargo ktstr shell requires a terminal for bidirectional I/O
forwarding; piped or redirected stdin is rejected.
include file not found
-i strace: not found in filesystem or PATH
Bare names (without /, ., or ..) are searched in PATH; if the
binary is not there, use an explicit path.
--include-files path not found: ./missing-file
Explicit paths must exist on disk.
include directory contains no files
warning: -i ./empty-dir: directory contains no regular files
The directory was walked recursively but contained no regular files (FIFOs, device nodes, and sockets are skipped).
Flock timeout / NFS rejection
flock LOCK_EX on run-dir target/ktstr/6.14-abc1234 timed out after
30s (lockfile target/ktstr/.locks/6.14-abc1234.lock, holders:
pid=12345 cmd=cargo-ktstr test --kernel 6.14). A peer cargo
ktstr test process is writing sidecars to the same
{kernel}-{project_commit} directory; wait for it to finish or kill
it, then retry.
A peer process is holding the per-run-key advisory flock(2) that
serializes sidecar writes; the helper polled for 30 s and gave up.
Run-dir locks live at {runs_root}/.locks/{kernel}-{project_commit}.lock
and serialize the pre-clear + write cycle so two concurrent runs
sharing a key cannot tear each other’s sidecars.
target/ktstr/.locks/6.14-abc1234.lock: filesystem NFS is not
supported for ktstr lockfiles (NFSv3 is advisory-only without
an NLM peer; NFSv4 byte-range locking does not cover flock(2)).
Move the lockfile path to a local filesystem (tmpfs, ext4, xfs,
btrfs, f2fs, bcachefs).
ktstr rejects NFS, CIFS, SMB2, CephFS, AFS, and FUSE mounts for
lockfiles because flock(2) semantics there are unreliable — see
Resource Budget for the rationale.
Diagnose:
cargo ktstr locks(orktstr locks --watch 1s) prints every ktstr flock currently held on the host with PID + cmdline — see ktstr (standalone).cat /proc/locks | grep '<lockfile-path-from-error>'falls back to the kernel’s own flock enumeration when the holder is outside ktstr.stat -f -c '%T' <runs-root>reports the filesystem type.
Fix:
- Peer-holder timeout: wait for the peer, kill it (
kill <pid>from the holder list), or retry. - NFS / remote-fs rejection: relocate the runs root to a local
filesystem via
KTSTR_SIDECAR_DIR— noting that the override path also skips the cross-process flock, so give each concurrent run its own path. The kernel cache’s lockfiles face the same constraint — overrideKTSTR_CACHE_DIRif the default resolves to NFS.
Test hangs / nextest timeout
A VM test that stops making progress is eventually flagged SLOW by
nextest and then terminated when it exceeds the profile’s
slow-timeout budget: 60 s × 2 periods on the default profile, 90 s ×
3 on the ci profile, with larger per-test overrides for heavy
classes (verifier sweeps 180 s, the wide-SMP boots up to 960 s — see
.config/nextest.toml).
ktstr’s own per-VM watchdog is sized to fire before nextest’s kill so you get a failure dump instead of a blunt termination. If nextest kills first, you lose the dump — so:
- Re-run just the failing test with its exact variant name and read the dump — see Reading Failure Output.
- Check for peers holding CPU locks (
cargo ktstr locks) — a contended host makes VM boots slow enough to blow timeouts. - On a busy or small machine, pass
--no-perf-modeand use--profile cilocally for the bigger budgets. - If one test legitimately needs longer (huge topology), give it a
per-test override in
.config/nextest.tomlrather than raising the profile-wide timeout.
Tests pass locally but fail in CI
Common causes:
- No KVM: CI runners need hardware virtualization. Check for
/dev/kvmaccess. - Fewer CPUs: gauntlet topology presets up to 252 CPUs may exceed the runner’s capacity. Use smaller topologies.
- No kernel: set
KTSTR_TEST_KERNELin the CI environment, or build and cache one per CI. - No CAP_SYS_NICE or rtprio: performance-mode tests require
CAP_SYS_NICEor an rtprio limit for RT scheduling, and enough host CPUs for exclusive LLC reservation. Pass--no-perf-mode(or setKTSTR_NO_PERF_MODE=1) to disable all performance mode features; tests withperformance_mode=trueare then skipped entirely. - Debug thresholds: CI often runs debug builds. Debug builds use relaxed thresholds (3000ms gap, 35% spread) but may still hit limits on slow runners. See Checking.
Environment Variables
Every environment variable ktstr reads, grouped by task. Unless a row says otherwise, an empty value is treated the same as unset — the exceptions are called out per row.
Most of these have a CLI flag equivalent; prefer the flag in scripts
and the variable in CI job-level env: blocks.
Daily knobs
| Variable | Effect | Accepted values | Default |
|---|---|---|---|
KTSTR_KERNEL | Selects the kernel for every entry point (build-time BTF resolution and runtime image discovery). Set automatically by cargo ktstr test --kernel. | Exact version (6.14), range (6.14..7.0), git+URL#tag=…/#branch=…/#sha=…, path, or cache key | Auto-discovered |
KTSTR_TEST_KERNEL | Points the test harness directly at a bootable image (bzImage on x86_64, Image on aarch64). Set-but-empty is a hard error, not a fallback. | Image path | Auto-discovered |
KTSTR_SCHEDULER | Global binary override for every SchedulerSpec::Discover scheduler. See Troubleshooting for the full resolution order. | Binary path | Build/discover cascade |
KTSTR_SCHEDULER_BIN_<NAME> | Per-scheduler binary override, checked before the global one. <NAME> is the discover name uppercased with non-alphanumerics mapped to _ (scx-ktstr → KTSTR_SCHEDULER_BIN_SCX_KTSTR). | Binary path | Unset |
KTSTR_SCHEDULER_PROFILE | Cargo build profile for the scheduler-under-test (independent of the harness profile). Set by cargo ktstr … --profile. | Profile name | release |
KTSTR_SCHEDULER_ALLOW_STALE_FALLBACK | After a failed orchestrated cargo build -p <sched>, fall back to a pre-built binary instead of failing the test. | Any non-empty value | Refuse stale fallback |
KTSTR_NO_PERF_MODE | Disable performance mode (pinning, RT scheduling, hugepages, KVM exit suppression). A budget-sized CPU reservation is still taken — see Resource Budget. Flag: --no-perf-mode. | Any non-empty value | Perf mode available |
KTSTR_NO_SKIP_MODE | Turn resource-contention / insufficient-topology skips into hard failures. Flag: --no-skip-mode. Presence-only: even an empty value activates it. | Presence | Skip on contention |
KTSTR_CARGO_TEST_MODE | Marks a direct cargo test / cargo nextest run without the cargo ktstr wrapper: no gauntlet expansion, no host CPU flocks, per-process initramfs builds, and $PATH-first scheduler discovery. | Any non-empty value | Full orchestration |
KTSTR_VERBOSE | Verbose guest console output (loglevel=7, plus earlyprintk=serial on x86_64). | Exactly "1" | Quiet console |
KTSTR_LOG_PASSES | Log every Verdict pass detail, not just failures — for “the test passed but what did the assertion see?”. | Anything except empty or "0" | Failures only |
KTSTR_BUDGET_SECS | Time budget for greedy coverage-maximizing test selection at list time. See Running Tests. | Positive number (fractional ok); invalid values warn and are ignored | All tests listed |
RUST_BACKTRACE | Verbose diagnostics on failure; "1" or "full" also enables the verbose guest console. Propagated to the guest. | 1, full | Off |
RUST_LOG | Tracing filter, host-side and guest-side (forwarded on the guest kernel command line). Example: RUST_LOG=ktstr::flock=debug surfaces flock-contention heartbeats. | tracing filter syntax | Off |
Kernel builds and caches
| Variable | Effect | Accepted values | Default |
|---|---|---|---|
KTSTR_CACHE_DIR | Override the cache root (kernel images, BTF anchors, blobs). The value is used verbatim — no per-type subdirectory is appended. | Absolute path | $XDG_CACHE_HOME/ktstr/ or ~/.cache/ktstr/ (per-type subdir appended) |
KTSTR_GHA_CACHE | Enable the GitHub Actions remote kernel cache. Needs ACTIONS_CACHE_URL (set by the runner) and a ktstr built with --features remote-cache. Local cache stays authoritative; remote failures are non-fatal. | Exactly "1" | Disabled |
KTSTR_KERNEL_PARALLELISM | Width of the download/resolve fan-out for multi-kernel --kernel specs. Affects downloads only — builds serialize on the host CPU locks. | Positive integer; 0 / unparseable falls back to default | Host logical CPU count |
KTSTR_CACHE_STORE_LOCK_TIMEOUT | Timeout for the exclusive lock taken while storing a built kernel into the cache. Raise on CI runners with slow shared disks. | humantime duration (30s, 2m) | Compile-time default |
KTSTR_BUSYBOX_TARBALL | Build-time (build.rs): read the busybox source tarball from a local path instead of downloading it. | Tarball path | Download |
KTSTR_SKIP_BUSYBOX_BUILD | Build-time (build.rs): skip the busybox compile entirely; shell mode becomes unavailable. | Any non-empty value | Build busybox |
KTSTR_SKIP_WPROF_BUILD | Build-time (build.rs, wprof feature only): skip fetching and compiling the bundled wprof tooling. | Any non-empty value | Build wprof |
Sidecars and stats
| Variable | Effect | Accepted values | Default |
|---|---|---|---|
KTSTR_SIDECAR_DIR | Override the per-test sidecar output directory. Skips both the pre-clear and the cross-process flock — the operator owns the directory, so concurrent runs pointing at the same path are unserialized. stats subcommands read the default pool; pass --dir to point them elsewhere. See Runs and Regression Gates. | Directory path | {runs root}/{kernel}-{project_commit}/ |
KTSTR_CI | Stamp every sidecar’s run_source as "ci" instead of "local", so CI-produced runs are filterable in the stats pool. | Any non-empty value | "local" |
Resource coordination and escape hatches
See Resource Budget for how these interact; the first two are mutually exclusive at every entry point.
| Variable | Effect | Accepted values | Default |
|---|---|---|---|
KTSTR_CPU_CAP | Cap the host CPUs reserved by a no-perf-mode VM or kernel build. Flag --cpu-cap N takes precedence. | Integer ≥ 1; 0 / non-numeric rejected | Kernel build: 30% of allowed CPUs (min 1). No-perf VM: the vCPU count, floored at 30%. |
KTSTR_BYPASS_LLC_LOCKS | Skip host-side LLC flock acquisition entirely — no coordination against concurrent runs. | Any non-empty value | Coordinate |
KTSTR_LOCK_DIR | Directory for the per-LLC / per-CPU flock files. Use when /tmp is constrained on a runner. | Directory path | /tmp |
KTSTR_CONTENTION_BYPASS | Make transient KVM errnos hard failures instead of ResourceContention skips (only when the host is not near its limits) — stricter, for catching kernel-side regressions. | Exactly "1" | Skip on contention |
KTSTR_HOST_CGROUP_PARENT | cgroup-v2 parent under which host_only tests create per-test cgroups. Must be a non-root subdirectory of /sys/fs/cgroup. | Path under /sys/fs/cgroup | /sys/fs/cgroup/ktstr |
KTSTR_CGROUP_WALK_ROOT | Where the setup-time controller-enable walk starts, for delegated cgroup subtrees (systemd Delegate=yes, container nsdelegate). Must be a prefix of the configured parent. | Path prefix of the parent | /sys/fs/cgroup |
KTSTR_STALL_POLL_MS | Host-mode stall-monitor poll cadence. | Milliseconds; empty / 0 / unparseable falls back | 500 ms |
KTSTR_WORKER_READY_MARKER_OVERRIDE | Path where the jemalloc alloc worker writes its ready marker, for noexec or quota-constrained temp filesystems. | Absolute path | /tmp/ktstr-worker-ready-<pid> |
Set by ktstr itself
cargo ktstr stamps these across the orchestrator → nextest →
test-binary boundary. Listed so you can recognize them in ps output
and CI logs — do not set them by hand.
| Variable | Carries |
|---|---|
KTSTR_KERNEL_LIST | Multi-kernel fan-out list (label=path;…) when a run resolves 2+ kernels; each test expands to one variant per kernel. Takes precedence over KTSTR_KERNEL during variant expansion. |
KTSTR_KERNEL_COMMIT | dir=commit map of each source kernel’s HEAD, so per-test processes skip re-walking the kernel tree. |
KTSTR_PROJECT_COMMIT | The project commit label perf-delta children must record in their sidecars. |
KTSTR_ORCHESTRATED | Orchestration marker; VM-booting integration tests skip when it is absent (raw cargo nextest run would starve their resource budgets). |
KTSTR_RUN_EPOCH | Per-invocation session token that keeps parallel test processes from pre-clearing each other’s freshly written sidecars. |
KTSTR_RUNS_ROOT | Absolute runs root, stamped once so sidecar writers and post-run readers resolve the same directory regardless of CWD. |
KTSTR_PERF_ONLY | Set by perf-delta runs: skip every test without performance_mode. Exporting it manually restricts any run the same way. |
KTSTR_VERIFIER_RAW | Set by cargo ktstr verifier --raw: emit verifier logs verbatim, no cycle collapsing. |
KTSTR_VERIFIER_RESULT_DIR | Directory where verifier cells write per-cell PASS/FAIL records for the summary grid. |
KTSTR_VERIFIER_SCHEDULER | The verifier --scheduler NAME filter, forwarded to cell emission. |
KTSTR_BUSYBOX_PATH | Path to the busybox blob cargo ktstr extracts at startup for shell-mode VMs and disk-template builds. |
Probe wiring
Consulted by integration tests that boot a jemalloc-linked allocator
worker and attach the jemalloc probe to it. Both must be populated
before ktstr’s early nextest dispatch runs, so tests set them from a
#[ctor] — see Payloads and Included Files
for the wiring pattern. Leaving them unset is the normal case: no
probe is packed into the initramfs.
| Variable | Effect | Default |
|---|---|---|
KTSTR_JEMALLOC_PROBE_BINARY | Absolute host path to ktstr-jemalloc-probe; packed into every VM’s initramfs at /bin/ktstr-jemalloc-probe when set. | No probe packed |
KTSTR_JEMALLOC_ALLOC_WORKER_BINARY | Absolute host path to the paired ktstr-jemalloc-alloc-worker, packed alongside the probe. | No worker packed |
LLVM coverage
| Variable | Effect | Default |
|---|---|---|
LLVM_COV_TARGET_DIR | Directory for extracted profraw files. | Parent of LLVM_PROFILE_FILE, or <exe-dir>/llvm-cov-target/ |
LLVM_PROFILE_FILE | Standard LLVM profiling output path; ktstr reads its parent as a fallback profraw directory. | None |
Nextest protocol
| Variable | Effect | Default |
|---|---|---|
NEXTEST | Set by nextest when it invokes the test binary. ktstr’s early dispatch inspects it to decide whether to intercept --list / --exact for gauntlet expansion and budget selection. | None |
VM-internal
Mostly set by the host on the guest kernel command line and read by
the guest init (via /proc/cmdline); a few (noted below) are
process-internal markers set inside the guest. Not intended for user
configuration; listed here for debugging.
| Variable | Description |
|---|---|
KTSTR_MODE | Guest execution mode. shell requests the interactive shell; disk_template requests a one-shot mkfs template-build VM. Absent means the default test-dispatch path. |
KTSTR_TOPO | Topology string (numa_nodes,llcs,cores,threads) for guest-side scenario resolution. |
KTSTR_TERM | Terminal type forwarded from the host (sets guest TERM). |
KTSTR_COLORTERM | Color capability forwarded from the host (sets guest COLORTERM). |
KTSTR_COLS / KTSTR_ROWS | Host terminal size, used to size the guest pty when available. |
KTSTR_GUEST_INIT | Process-internal marker set by the guest init — not a host-emitted cmdline token. Used to detect re-entrant worker spawns under PID-1 init. |
KTSTR_DISK0_FS / KTSTR_DISK0_MOUNT / KTSTR_DISK0_RO | Disk-attach metadata (fs type, mount point, ro flag) for #[ktstr_test(disk = ...)], consumed by the guest to mount the virtio-blk backing. |
ctprof
The ctprof profiler captures a host-wide per-thread snapshot of scheduling counters, memory / I/O accounting, CPU affinity, cgroup state, and thread identity, then compares two snapshots to surface what changed. It is a manually-invoked CLI companion to the automated scheduler tests — useful when a run passes on one machine and fails on another, or for A/B comparing host behavior across kernel / sysctl / workload changes.
This is a different tool from cargo ktstr show-host,
which captures the host context (kernel, CPU model, sched_*
tunables, NUMA layout, kernel cmdline) — aggregate state that
does not change between scenarios. The profiler captures
per-thread cumulative counters that do change, and its
comparison surface is designed for the thread-level diff.
When to use it
- Workload investigation — you observe a regression and want to know which process / thread pool moved in run time, context-switch rate, or migration count.
- Kernel / sysctl A/B — capture before and after flipping a sched_* tunable on an otherwise-identical workload; the compare output surfaces every counter that responded.
- Host baselining — capture on a known-good host, capture on a failing host, compare to isolate what differs at the thread-behavior level.
The profiler is not invoked automatically by scenarios or the
gauntlet. It is opt-in and operator-driven via the
ktstr ctprof subcommand.
Capture, then compare
The whole workflow is three commands: snapshot, change something, snapshot again, diff.
ktstr ctprof capture --output base.ctprof.zst
# ... run a workload, flip a tunable, swap a scheduler ...
ktstr ctprof capture --output cand.ctprof.zst
ktstr ctprof compare base.ctprof.zst cand.ctprof.zst
capture walks /proc for every live thread group, enumerates
each thread, and reads a handful of procfs sources for each one.
The output is a zstd-compressed JSON snapshot (conventional
extension: .ctprof.zst). On a workstation with ~1,200 live
threads, each snapshot in the run below took about a second.
Here is a real compare — two captures taken a couple of seconds apart on a busy workstation. Rows sort by largest absolute percent delta, so the biggest movers are the first thing you see:
## Primary metrics
comm threads metric value delta % %uptime
kworker/{N}:{N}-mm_percpu_wq
kworker/{N}:{N}-mm_percpu_wq 11→37 voluntary_csw 8.697K → 101.154K +92.457K +1063.1% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 timeslices 8.699K → 101.166K +92.467K +1063.0% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 wait_time_ns 2.684s → 27.653s +24.969s +930.2% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 stime_clock_ticks 22ticks → 217ticks +195ticks +886.4% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 run_time_ns 243.378ms → 2.320s +2.077s +853.4% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 nonvoluntary_csw 2 → 12 +10 +500.0% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 thread_count 11 → 37 +26 +236.4% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 nr_migrations 11 → 34 +23 +209.1% 93%
kworker/{N}:{N}-events
kworker/{N}:{N}-events 87→60 nonvoluntary_csw 22 → 11 -11 -50.0% 95%
kworker/{N}:{N}-events 87→60 timeslices 222.140K → 127.813K -94.327K -42.5% 95%
user.slice
user-{N}.slice
session-{H}.scope
ktstr
ktstr 1 processor 9 → 43 +34 +377.8% 0%
ktstr 1 wait_time_ns 6.850µs → 22.693µs +15.843µs +231.3% 0%
ktstr 1 nonvoluntary_csw 2 → 6 +4 +200.0% 0%
... 22 more lines truncated (use --limit 0 for unlimited)
Reading it:
- value is
baseline → candidatefor the group’s aggregated reading; delta and % carry the signed move. Themm_percpu_wqpool grew from 11 to 37 threads and its voluntary context switches went up 11×, all inside the capture window — the eye lands there first because the sort put it first. - threads shows the group population on each side
(
11→37). A population change is often the story by itself. - %uptime is the group’s average thread lifetime relative to the longest-lived group in the snapshot — low values flag young threads whose counters had little time to accumulate.
{N}/{H}placeholders come from name-pattern normalization:kworker/3:1andkworker/7:0are the same logical pool, so they land in onekworker/{N}:{N}bucket.
Groups present on only one side surface as unmatched — a row is missing because the process did not exist, not because it did zero work. A full (unfiltered) compare lists them in a trailer:
1 group(s) only in baseline (/tmp/ktstr-docs-base.ctprof.zst):
kworker/u{N}:{N}-flush-btrfs-{N} kworker/u{N}:{N}-flush-btrfs-{N}
2 group(s) only in candidate (/tmp/ktstr-docs-cand.ctprof.zst):
kworker/u{N}:{N}-writeback kworker/u{N}:{N}-writeback
kworker/{N}:{N}-events_freezable kworker/{N}:{N}-events_freezable
What is captured per thread
- Identity — tid, tgid, process and thread name, cgroup v2 path, start time, scheduling policy, nice, CPU affinity mask.
- Scheduling counters (cumulative, from
/proc/<tid>/schedand/proc/<tid>/schedstat) — run / wait / sleep / block / iowait time, context switches, wakeups with locality splits, migrations, plus lifetime peaks (wait_max,slice_max, …). - Memory — page faults; jemalloc per-thread
allocated/deallocated counters read via ptrace +
process_vm_readv(jemalloc-linked processes only — other allocators read zero rather than failing capture); per-processsmaps_rollup. - I/O —
rchar/wchar, syscall counts, and block-level byte counters from/proc/<tid>/io(requiresCONFIG_TASK_IO_ACCOUNTING). - Taskstats delay accounting + watermarks — eight delay categories plus peak-RSS/VM watermarks via the TASKSTATS genetlink family; see Taskstats delay accounting for gating and semantics.
- PSI and cgroup aggregates — host-level and per-cgroup
pressure (
CONFIG_PSI),cpu.stat/memory.*/pids.*per cgroup that hosted a sampled thread — read from cgroup files directly, not derived from per-thread data. - sched_ext sysfs —
state,switch_all,nr_rejected, and the hotplug/enable sequence counters, whenCONFIG_SCHED_CLASS_EXTis built.
Three timing families matter when interpreting a diff:
- Cumulative counters (the majority) only increase, so probe attachment time does not bias the reading — a diff between two captures measures exactly the activity in the window.
- Lifetime extrema (
*_max,hiwater_*,*_delay_min_ns) are per-event peaks kept by the kernel, not sums over the window. - Instantaneous gauges (
nr_threads,fair_slice_ns,state, affinity,processor) are sampled at capture time and can legitimately differ between two probes of the same thread.
Metrics that reset on attachment (perf_event_open counters, BPF tracing samples) are intentionally absent — they require long-lived instrumentation the capture layer cannot install without disturbing the system it is measuring.
Capture is best-effort
Each internal reader returns Option; a kernel missing a config
gate (no CONFIG_SCHED_DEBUG, no CONFIG_SCHEDSTATS) yields
None from that reader without failing the rest of the thread.
Counters collapse to 0, identity strings collapse to empty.
A missing reading is indistinguishable from a genuine zero in
the output — the contract is “never fail the snapshot.” The
capture summary lines on stderr tally read failures and hint at
the likely missing kconfig.
Pulling the jemalloc counters briefly stops each probed thread
via ptrace(PTRACE_SEIZE), which needs root, CAP_SYS_PTRACE,
or kernel.yama.ptrace_scope=0; without the privilege those
fields fall through to zero and the rest of the snapshot still
populates.
Cgroup namespace caveat
The per-thread cgroup path is read verbatim from
/proc/<tid>/cgroup — it is relative to the cgroup namespace
root the capturing process sees, not the system-global v2
mount root. A process inside a nested cgroup namespace sees a
truncated path. Cross-namespace comparison requires external
canonicalization; the capture layer deliberately does not
attempt it.
Compare options
Grouping
compare defaults to --group-by all: all three pattern-aware
axes (cgroup, pcomm, comm) contribute to one view — cgroup-grouped
rows render as an indented path tree, name-pattern buckets render
flat, as in the excerpt above — and renamed-but-identical cgroups
are joined for diffing (a [fudged: <leaf>] marker) instead of
surfacing as orphans.
--group-by pcomm— aggregate every thread of the same process together (theshowdefault).--group-by cgroup— aggregate by cgroup path; enables the per-cgroup sections. Use--cgroup-flatten '<glob>'to collapse dynamic segments (pod UUIDs, session scopes) so the same logical workload lands on the same row across runs.--group-by comm— aggregate by thread-name pattern across every process (tokio-worker-{0..N}→ one bucket). Choose it when a thread-pool name spans many binaries.--group-by comm-exact— literal thread names, no pattern collapse, for when distinct token values carry meaning (eachkworker/u8:Ntracked independently).--no-thread-normalizedisables the pattern collapse on the name axes.
Rule of thumb: start with the default all to find which axis
moved, then re-run with that single axis plus --sections /
--metrics filters for a narrow, pasteable table.
Filtering: --sections vs --metrics
--sectionspicks which sub-tables render:primary,taskstats-delay,derived,cgroup-stats,cgroup-limits,memory-stat,memory-events,pressure,host-pressure,smaps-rollup,sched-ext. The five cgroup sections require--group-by cgroup. The taskstats rows render inside the primary/derived tables but match thetaskstats-delaysection name, so you can scope to them alone or exclude them.--metricspicks which rows render inside the primary and derived tables, by metric name from themetric-listvocabulary. Secondary sub-tables have fixed shapes and ignore it.
They compose: --sections primary --metrics run_time_ns shows a
single row and nothing else. --sort-by 'wait_sum:desc,run_time_ns:desc' re-ranks rows by your own key
instead of the default biggest-|Δ%| ordering; --limit N caps
lines per section.
How groups aggregate
Every metric declares how per-thread values reduce into a group row; the registry binds each metric to exactly one reduction so a nonsensical fold (summing peaks) cannot be expressed.
| Metric class | Group reduction | Why |
|---|---|---|
| Cumulative counters (csw, wakeups, migrations, run/wait time, io bytes, delay totals) | sum | totals compose; deltas stay meaningful |
Lifetime peaks (wait_max, *_delay_max_ns, hiwater_*) | max | summing peaks conflates one 1 s spike with 1000 × 1 ms spikes |
Instantaneous gauges (nr_threads, fair_slice_ns) | max | summing a sampled instant has no physical meaning |
Bounded ordinals (nice, priority, processor) | [min, max] range | a shift on either end stays visible |
Categorical (policy, state) | mode + count/total | no arithmetic on categories; delta is same / differs |
| CPU affinity | min/max CPU count + uniform flag | heterogeneous groups render N-M cpus (mixed) |
Three cross-cutting caveats, stated once:
Swapin / thrashing overlap
Every thrashing event is also a swapin event at the syscall
layer. Never sum the two families; rollups OR them with max().
The *_delay_min_ns sentinel
The kernel keeps the smallest non-zero observation, so 0
means “no events observed”, not “saw a zero-ns event”.
Disambiguate against the matching *_count.
Shared-mm watermarks
The kernel reads hiwater_rss_bytes / hiwater_vm_bytes from
the shared mm_struct, so sibling threads of one process all
report the same value; kernel threads read zero by design.
Derived metrics
Derived metrics combine already-aggregated inputs into a scalar
with its own scale — they render in a separate ## Derived metrics table on both compare and show. A missing input or
zero denominator yields - (not computable), distinct from a
computed zero. Representative entries:
| Metric | Formula | Reading it |
|---|---|---|
cpu_efficiency | run / (run + wait) | fraction of scheduler-tracked time on-CPU; lower = more runqueue waiting |
avg_slice_ns | run_time_ns / timeslices | average on-CPU slice; catches timeslice-tuning regressions |
involuntary_csw_ratio | nonvol / (vol + nonvol) | preemption pressure vs cooperative blocking |
avg_cpu_delay_ns | cpu_delay_total_ns / count | runqueue wait per event, from the delayacct path |
live_heap_estimate | allocated - deallocated | jemalloc-only live heap; zero is genuine for other allocators |
total_offcpu_delay_ns | sum of delay buckets, swapin/thrashing OR’d | one off-CPU number; - when delayacct is off entirely |
The full registry (17 derived + every primary metric) is
enumerated by metric-list; all names are valid
--sort-by and --metrics keys.
Output and interpretation
The comparison prints raw numbers and percent delta. There are
no judgment labels (regression vs. improvement) — whether
“run_time went up 15%” is good depends on whether you measured a
CPU-bound workload (more work done) or a spin-wait pathology
(more time wasted). The interpretation is scheduler-specific and
left to the operator. Ratio-valued rows suppress the % column:
the absolute delta of a [0, 1] quantity already carries
percentage-point semantics.
show
show renders a single snapshot as a per-group table — same
grouping, filtering, and sorting surface as compare, minus the
diff columns. Default grouping is pcomm.
## Primary metrics
pcomm threads metric value
...
kworker/u{N}:{N}-btrfs-endio-write 23 run_time_ns 41.186s
kworker/u{N}:{N}-btrfs-endio-write 23 voluntary_csw 1.249M
kworker/u{N}:{N}-btrfs-endio-write 23 nr_migrations 26.759K
kworker/u{N}:{N}-btrfs-endio 47 run_time_ns 40.940s
... 464 more lines truncated (use --limit 0 for unlimited)
--columns controls the rendered column set; show accepts
group, threads, metric, value, tags, uptime, while
compare accepts group, threads, metric, baseline,
candidate, delta, %, arrow, tags, uptime — each side
rejects the other’s diff/value columns.
metric-list
metric-list prints every registered metric with its tags and a
one-line description — the authoritative vocabulary for
--metrics and --sort-by, plus a tag legend explaining which
kernels populate each counter:
## Metrics
metric tags description
...
run_time_ns [SCHED_INFO] Cumulative on-CPU time, ns; /proc/<tid>/schedstat field 1.
wait_time_ns [SCHED_INFO] Cumulative time waiting on the runqueue, ns; schedstat field 2.
timeslices [SCHED_INFO] Number of times the task was run on a CPU; schedstat field 3.
voluntary_csw Voluntary context switches (task gave up the CPU itself).
nonvoluntary_csw Involuntary context switches (task was preempted).
nr_wakeups [SCHEDSTATS] Total wakeups via try_to_wake_up().
...
nr_wakeups_affine [cfs-only] [SCHEDSTATS] Wakeups that succeeded under the wake_affine() heuristic.
The bracketed tags mark scheduler-class gates ([cfs-only]
counters stay zero under sched_ext) and kconfig gates
([SCHEDSTATS], [TASK_DELAY_ACCT], …) so you know whether a
zero means “idle” or “not compiled in”.
Taskstats delay accounting
The kernel’s TASKSTATS genetlink family delivers per-task
delay-accounting and memory-watermark fields that are not
exposed via /proc/<tid>/sched or /proc/<tid>/stat. The 34
captured fields (8 delay categories × 4 bucket fields + 2
watermarks) all tag the taskstats-delay section so they can be
filtered as a unit.
Capability and kconfig gating
Querying the netlink family requires CAP_NET_ADMIN on the
capturing process. A non-root operator running
ktstr ctprof capture hits EPERM on the first query and every
taskstats field collapses to zero per the best-effort contract.
- Delay-accounting fields require
CONFIG_TASKSTATS=yandCONFIG_TASK_DELAY_ACCT=yand the runtimedelayacct=ontoggle (boot param orkernel.task_delayacct=1). A kernel built with both configs but launched without the toggle produces all-zero delay readings. ktstr’s standard kernel build includes both kconfigs, and the test harness addsdelayacctto the guest cmdline. - Memory-watermark fields (
hiwater_rss_bytes,hiwater_vm_bytes) requireCONFIG_TASKSTATS=yandCONFIG_TASK_XACCT=y, and do not respond to the runtime toggle. See the shared-mm caveat.
The structured tally on CtprofSnapshot::taskstats_summary
(ok_count / eperm_count / esrch_count / other_err_count)
distinguishes “kernel doesn’t expose this” (netlink open failed,
all counters zero) from “every tid raced exit” (high
esrch_count) from “CAP_NET_ADMIN missing” (high
eperm_count). There is no CLI lens for it yet — read it from
the snapshot JSON (zstd -d < snap.ctprof.zst | jq .taskstats_summary).
The eight delay categories
| Category | Kernel source | Notes |
|---|---|---|
cpu_delay_* | tsk->sched_info via delayacct_add_tsk (kernel/delayacct.c) | Runqueue wait. Count and total update locklessly, so a reader can transiently observe one ahead of the other — averages are approximate at sub-event scale, stable integrated. Same bucket as schedstat wait_* via a different path. |
blkio_delay_* | delayacct_blkio_start/_end | Synchronous block-I/O wait; updates serialize through the task’s delay lock. The canonical delayacct block-I/O reading, distinct from schedstat iowait_sum. |
swapin_delay_* | delayacct_swapin_start/_end | Swap-in wait. Overlaps thrashing. |
freepages_delay_* | called from mm/vmscan.c | Direct-reclaim wait. |
thrashing_delay_* | called from mm/filemap.c, mm/page_io.c | Thrashing wait; refines swapin tracking. |
compact_delay_* | called from mm/page_alloc.c | Memory-compaction wait. |
wpcopy_delay_* | called from mm/memory.c, mm/hugetlb.c | Write-protect-copy (CoW) fault wait. Taskstats v13+. |
irq_delay_* | delayacct_irq | IRQ-handler windows charged to the task. Taskstats v14+. On kernels predating a bucket, the missing fields read zero from the truncated payload. |
Each category carries four fields: *_count (windows observed),
*_delay_total_ns (cumulative), *_delay_max_ns (longest single
window), and *_delay_min_ns (shortest non-zero window — mind
the sentinel).
File format
.ctprof.zst is zstd-compressed JSON of CtprofSnapshot. The
schema is #[non_exhaustive] so field additions do not break
existing snapshots. The top level carries threads,
cgroup_stats, host-level psi, optional sched_ext sysfs
state, the capture summaries (probe_summary, parse_summary,
taskstats_summary), and an embedded HostContext — the same
structure show-host prints — for round-trip tooling. Thread
start times are recorded in USER_HZ (100 on x86_64 and aarch64),
so cross-host comparison between differently-configured kernels
on those architectures is meaningful.
Extending ctprof
Adding a metric to the registry is a typed three-step change
(field newtype → capture wiring → registry entry) designed so a
mismatched aggregation fails to compile. See the module
documentation for ktstr::ctprof_compare in the
rustdoc.
Related
- Diagnose a Slow Scheduler with ctprof — the worked investigation recipe built on this tool.
cargo ktstr show-host— host context capture (kernel, CPU, tunables), without the per-thread walk; Capture and Compare Host State compares it across runs.
ktstr (standalone)
ktstr is the standalone debugging companion to the
#[ktstr_test] test harness.
It owns interactive VM shells, host topology inspection, host-wide
per-thread profiling, and lock introspection — the operations a
scheduler author reaches for when investigating a test failure.
To run the test suite, use
cargo ktstr test; to reproduce a test as a
self-contained script without a VM, use
cargo ktstr export.
A typical failure investigation chains the two binaries: a test
fails (read the output), you boot the same
environment interactively with cargo ktstr shell --test NAME, and
if the question is “what else changed on this host?”, you bracket
the workload with ktstr ctprof capture and diff the snapshots.
Build from the workspace:
cargo build --bin ktstr
Subcommands
topo
Show the host CPU topology — the same view the resource planner and performance mode use:
$ ktstr topo
CPUs: 64
LLCs: 4
NUMA nodes: 1
LLC 0 (node 0): [0, 1, 2, 3, 4, 5, 6, 7, 32, 33, 34, 35, 36, 37, 38, 39]
LLC 1 (node 0): [8, 9, 10, 11, 12, 13, 14, 15, 40, 41, 42, 43, 44, 45, 46, 47]
LLC 2 (node 0): [16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55]
LLC 3 (node 0): [24, 25, 26, 27, 28, 29, 30, 31, 56, 57, 58, 59, 60, 61, 62, 63]
This box is 1n4l8c2t in ktstr’s
topology notation: 1 NUMA node, 4 LLCs,
8 cores per LLC, 2 threads per core — note the SMT siblings (CPU 0
pairs with CPU 32).
kernel
Manage cached kernel images: list, build, clean. Identical to
the cargo-ktstr kernel subcommands — see
there for full documentation.
shell
Boot an interactive shell in a KVM VM. The guest is a busybox
userland with your files mounted at /include-files/:
ktstr shell
ktstr shell --kernel ../linux
ktstr shell --kernel 6.14.2 --topology 1,2,4,1
ktstr shell -i ./my-binary -i strace
ktstr shell --exec 'cat /proc/schedstat'
Files and directories passed via -i land at
/include-files/<name> inside the guest. Directories are walked
recursively; bare names (no path separator) are resolved via PATH;
dynamically-linked ELF binaries get automatic shared-library
resolution, and non-ELF files are copied as-is.
Stdin must be a terminal: the host terminal enters raw mode for
bidirectional forwarding, and the saved terminal state is restored
on exit paths ktstr controls (normal exit, errors, catchable fatal
signals). A SIGKILL cannot be intercepted — run reset if the
terminal is left raw.
| Flag | Default | Description |
|---|---|---|
--kernel ID | auto | Same kernel grammar as cargo ktstr test --kernel (path, version, cache key, range, git source), resolving to a single kernel; raw image files are rejected here. When absent, resolves via cache then filesystem, falling back to downloading the latest stable kernel. |
--topology N,L,C,T | 1,1,1,1 | Virtual topology as numa_nodes,llcs,cores,threads. All values must be >= 1. |
-i, --include-files PATH | — | Files or directories to include in the guest. Repeatable. |
--memory-mib MiB | auto | Guest memory in MiB (minimum 128). When absent, estimated from payload and include-file sizes. |
--dmesg | off | Forward the guest kernel console (COM1/dmesg) to stderr in real time; sets loglevel=7. |
--exec CMD | — | Run a command instead of an interactive shell; the VM exits when it completes. |
--exec-timeout DURATION | 120s | Max wall-clock for a --exec payload before the VM is killed (30s, 5m, 1h). |
--no-perf-mode | off | Disable all performance mode features. Also via KTSTR_NO_PERF_MODE. |
--cpu-cap N | unset | Reserve only N host CPUs for the shell VM; requires --no-perf-mode. See Resource Budget. |
--disk SIZE | unset | Attach a raw virtio-blk disk at /dev/vda, backed by a fresh sparse tempfile. IEC sizes only (256mib, 1gib). |
cargo ktstr shell runs the same boot flow
with two additions: it accepts raw kernel image files for
--kernel, and it has --test NAME to derive topology, memory, and
include files from a registered #[ktstr_test].
ctprof
Capture or compare a host-wide per-thread snapshot — for “the scheduler looks fine but something on the host is still behaving oddly”. Every visible thread’s scheduling, memory, and I/O counters are snapshotted as zstd-compressed JSON:
ktstr ctprof capture --output baseline.ctprof.zst
# ... run the workload of interest ...
ktstr ctprof capture --output candidate.ctprof.zst
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst
Cumulative counters and lifetime peaks are probe-timing-invariant —
sampled twice, a value either increased monotonically or stayed at
its high-water mark — so a diff between two snapshots measures
exactly the activity over the window. Capture uses no kprobes or
kernel tracing and does not modify thread state; the only exception
is the jemalloc-only memory fields, read by briefly ptrace-attaching
jemalloc-linked processes (needs root, CAP_SYS_PTRACE, or
ptrace_scope=0; recorded as zero when denied).
compare joins two snapshots on a grouping axis and renders
per-metric baseline/candidate/delta rows, sorted by largest relative
change. Real output (a cargo build ran between the snapshots):
## Primary metrics
comm threads metric value delta % %uptime
kworker/{N}:{N}-mm_percpu_wq
kworker/{N}:{N}-mm_percpu_wq 11→37 voluntary_csw 8.697K → 101.154K +92.457K +1063.1% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 timeslices 8.699K → 101.166K +92.467K +1063.0% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 wait_time_ns 2.684s → 27.653s +24.969s +930.2% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 stime_clock_ticks 22ticks → 217ticks +195ticks +886.4% 93%
kworker/{N}:{N}-mm_percpu_wq 11→37 run_time_ns 243.378ms → 2.320s +2.077s +853.4% 93%
...
Thread names are token-normalized (kworker/3:1 and kworker/7:0
fold into kworker/{N}:{N}), so the join key survives across
process restarts and even across hosts — deltas reflect the named
workload, not a specific pid.
Choosing --group-by: start with the default all (cgroup, then
process, then thread pattern — it folds renamed-but-identical
cgroups together); use pcomm when you think in processes, cgroup
when comparing services or containers, and comm / comm-exact
when a single thread pool is the suspect. Most-used compare flags:
| Flag | Default | Description |
|---|---|---|
--group-by AXIS | all | all, pcomm, cgroup, comm, or comm-exact (literal thread names). |
--sections NAMES | every | Sub-tables to render, e.g. primary, taskstats-delay, derived, pressure, smaps-rollup. |
--metrics NAMES | every | Metric allowlist (names from ktstr ctprof metric-list). |
--sort-by SPEC | largest |delta_pct| | Multi-key sort: metric[:asc|desc],.... |
--limit N | 500 | Max rendered lines per section; 0 disables truncation. |
show renders a single snapshot without diff math, and
metric-list prints the metric vocabulary — see the
ctprof reference for those, the full flag
tables, aggregation rules, and taskstats kconfig gating.
locks
Enumerate every ktstr flock held on this host — read-only, never
acquires anything. When a build or test stalls behind a peer’s
reservation, ktstr locks names the peer without disturbing it:
$ ktstr locks
LLC locks:
LLC NODE LOCKFILE HOLDERS
0 0 /tmp/ktstr-llc-0.lock <none recorded>
1 0 /tmp/ktstr-llc-1.lock <none recorded>
...
Run-dir locks:
RUN KEY LOCKFILE HOLDERS
7.0.14-73730e0-dirty target/ktstr/.locks/7.0.14-73730e0-dirty.lock <none recorded>
An idle host shows <none recorded>; while a lock is held, the
HOLDERS column names the holder’s PID and cmdline
(cross-referenced against /proc/locks). Four lock-file roots are
scanned:
{KTSTR_LOCK_DIR}/ktstr-llc-*.lock(default/tmp) — per-LLC reservations held by perf-mode test runs and--cpu-cap-bounded builds.{KTSTR_LOCK_DIR}/ktstr-cpu-*.lock— per-CPU reservations from the same flow.{cache_root}/.locks/*.lock— kernel-cache entry locks held duringkernel buildwrites, plus per-source-tree locks held while building from a path.{runs_root}/.locks/{kernel}-{project_commit}.lock— sidecar write locks serializing concurrent runs targeting the same run directory.
| Flag | Default | Description |
|---|---|---|
--json | off | JSON snapshot (pretty in one-shot mode; ndjson under --watch). |
--watch DURATION | unset | Redraw at the interval until SIGINT (100ms, 1s, 5m). |
Available identically as cargo ktstr locks. The reservation model
behind these locks is documented in
Resource Budget.
completions
Generate shell completions (bash, zsh, fish, elvish,
powershell):
ktstr completions bash
--binary NAME overrides the registered name when invoking ktstr
through a differently-named symlink. The same subcommand exists as
cargo ktstr completions.
Assertable Metrics
Every regression comparison — cargo ktstr perf-delta and the
per-test PerfDeltaAssertion gate — is
driven by the metric registry: the static ktstr::stats::METRICS
table. Each entry carries a metric’s name, its regression
polarity, its aggregation kind, the dual-gate
significance thresholds, and a display unit. This chapter explains
those fields, how to enumerate the live catalog, which workloads emit
which metric families, and how to pin a per-test regression gate.
The catalog: stats list-metrics
The authoritative, always-current catalog is the command output — it enumerates the registry directly, so it never drifts from the code:
cargo ktstr stats list-metrics # text table
cargo ktstr stats list-metrics --json # machine-readable (includes kind + every field)
NAME POLARITY DEFAULT_ABS DEFAULT_REL UNIT
worst_spread lower 5 0.25 %
worst_gap_ms lower 500 0.5 ms
total_migrations lower 2 0.3
worst_migration_ratio lower 0.05 0.2
max_imbalance_ratio lower 1 0.25 x
...
worst_p99_wake_latency_us lower 50 0.25 µs
worst_median_wake_latency_us lower 20 0.25 µs
...
iteration_rate higher 1 0.3 iter/s
total_iterations higher 2 0.1
list-metrics reads only the static registry; it needs no sidecar
pool. Which of these metrics a particular run actually carries
depends on the emitting workload — see
Workload → emitted metrics.
(cargo ktstr stats list-values enumerates the pool’s filter
dimensions — kernels, commits, schedulers, topologies, work types —
not its metric keys, so it cannot answer which metrics are present.)
Registry fields
-
name — the metric key (e.g.
worst_spread,worst_gap_ms,sched_count_per_sec). This is the string aPerfDeltaAssertionnames and the keyperf-deltareports on. -
polarity — the regression direction:
LowerBetter— an increase is a regression (latency, spread).HigherBetter— a decrease is a regression (throughput, iterations).Informational— directionless: a change is shown but never counted as a regression or improvement and never gates the exit.TargetValue(t)/Unknown— also exist (renderedtarget(t)/unknownbylist-metrics) but no registered metric uses them today.
-
kind — how per-sample readings fold into the run-level value:
Counter(sum),Peak(max-of-max),Gauge(average, last, or max, per metric),Rate(re-derived ratio), plus phase-aware kinds such asDeltaSumthat fold pre-deltaed per-phase readings. The kind decides whether the cross-run fold is a mean, a max, or a re-derived ratio. -
default_abs / default_rel — the dual gate. A move counts as a confident regression only when it clears both the absolute floor (
default_abs, in the metric’s units) and the relative threshold (default_rel, a fraction). The absolute floor’s role depends on the metric’s dynamic range:- Scale-bounded metrics (fractions, ratios,
%spread,ms/µslatencies) usedefault_absas a fixed unit-scale noise floor — a sub-unit move is immaterial regardless of its relative size. - Scale-varying metrics (
*_per_secrates,ops/s,req/s, raw counts) can span orders of magnitude across workloads, so a fixed floor would mask a large relative regression on a low-throughput workload. For these,default_absis only a near-idle activity guard anddefault_relcarries materiality — a 40 % drop is flagged whether the baseline is 50/s or 50000/s.
perf-delta --threshold PCT/--policy FILEoverride the relative gate; the absolute gate is per-metric. - Scale-bounded metrics (fractions, ratios,
-
display_unit — the unit rendered in tables (
ms,/s,ns, …).
Workload → emitted metrics
A metric only appears in a comparison if the run actually emitted it.
| Family | Example metrics | Emitted by | Present when |
|---|---|---|---|
| Spread / gap | worst_spread, worst_gap_ms | every scenario (scheduling-latency capture) | always |
| Iteration throughput | total_iterations, worst_iterations_per_cpu_sec | compute / spin workloads | the workload iterates; the *_per_cpu_sec form is overcommit-invariant |
| schedstat counters / rates | total_run_delay_ns_per_sched, total_ttwu_count, sched_count_per_sec | schedstat sampling over the run | schedstat capture enabled |
| IRQ / pressure | avg_irq_util, total_irq_pressure_us, max_cgroup_psi_irq_avg10 | IRQ-heavy scenarios, periodic host-pressure capture | those captures ran |
| NUMA locality | worst_page_locality, worst_cross_node_migration_ratio | NUMA-aware scenarios | multi-node topology |
| Payload metrics | sched_delay_msg_us, taobench_total_qps, schbench_loop_count | schbench / taobench payloads | the payload ran and reported |
Not every registry name can back a gate: perf-delta --must-fail
rejects unknown names, internal rate components, per-phase-only
metrics, and — without --noise-adjust — whole-run distribution
metrics and informational metrics, up front rather than silently
never firing.
PerfDeltaAssertion how-to
A PerfDeltaAssertion is a per-test performance-regression gate. It
is inert during a normal cargo ktstr test run (the in-VM verdict
never consults it) and active only under cargo ktstr perf-delta --noise-adjust, which serializes the declaration into the sidecar
and enforces it host-side. Plain (scalar) perf-delta does not
evaluate declared gates — gating on a single run would flip CI on
noise, so only the multi-run --noise-adjust path (Welch / disjoint-
band separation) is a sound basis. Declaring a gate requires
performance_mode (checked by the macro at compile time and by test
discovery at run time).
A declaration names a registry metric and overrides, for this test,
the gate that decides a confident regression on it. It layers on top
of the --noise-adjust all-metrics regression net (which still runs
to catch unknown-unknown regressions) — it is an explicit contract
check, not a whitelist.
Bind each gate to a const and list it on the macro:
use ktstr::prelude::*;
// Name any metric from `cargo ktstr stats list-metrics`.
const SPREAD_GATE: PerfDeltaAssertion =
PerfDeltaAssertion::new("worst_spread").with_max_regression_pct(5.0);
#[ktstr_test(performance_mode = true, perf_delta_assertions = [SPREAD_GATE])]
fn schbench_steady() -> Scenario {
// ... a degenerate / steady-state scenario whose worst_spread
// must not regress more than 5% against the baseline commit.
}
Builders (all const fn, chainable):
.with_max_regression_pct(pct)— relative gate: a worsening move larger thanpct% of the baseline gates. Unset → registrydefault_rel..with_min_abs(min)— absolute-materiality floor: a move smaller thanmin(in the metric’s units) never gates. Unset → registrydefault_abs..with_direction(polarity)— pin the regression direction instead of inheriting the registry polarity (e.g. treat anInformationalmetric asLowerBetterfor this test)..with_phase(step_index)— scope the gate to one phase (0= BASELINE,1..=N= scenario Step ordinals) instead of the whole-run value.
Then gate CI with the noise-adjusted compare:
cargo ktstr perf-delta --noise-adjust 5 --kernel 7.0 \
-E 'test(schbench_steady)'
This runs schbench_steady five times at HEAD and five at the
baseline, and fails when worst_spread regresses past the declared
5% gate with statistical confidence. See
Runs and Regression Gates for the full
perf-delta workflow and the CI chapter for wiring the
gate into a pull-request job.
API Reference
The guide and the rustdoc split the work: the guide explains why and
when to reach for an API; the rustdoc carries signatures, field
semantics, and edge cases. The complete rustdoc for every ktstr
workspace crate is published at
ktstr.dev/rustdoc/ktstr, and
ktstr::prelude
re-exports everything a test author needs.
| You want to | Reach for | Rustdoc | Guide chapter |
|---|---|---|---|
| Declare a test | #[ktstr_test] | attr.ktstr_test | The #[ktstr_test] Attribute |
| Declare a scheduler | declare_scheduler!, Scheduler, SchedulerSpec | macro.declare_scheduler, test_support | Scheduler Definitions |
| Drive a scenario | Ctx, scenarios::* | scenario | Scenarios, Custom Scenarios |
| Compose steps and ops | Step, Op, HoldSpec, Backdrop | scenario::ops | Ops, Steps, and Backdrop |
| Shape cgroups and cpusets | CgroupDef, CpusetSpec | scenario::ops | Topology |
| Check results | Assert, AssertResult, Verdict, claim! | assert | Checking, Customize Checking |
| Gate performance regressions | PerfDeltaAssertion | test_support | Assertable Metrics |
| Generate load | WorkType, WorkloadConfig, WorkloadHandle | workload | Work Types, Workers and Workloads |
| Run guest binaries | #[derive(Payload)], Payload | derive.Payload | Payloads and Included Files |
| Capture guest state | Snapshot, SnapshotBridge, Sample, SampleSeries | scenario::snapshot, scenario::sample | Snapshots, Temporal Assertions |
The #[ktstr_test] attribute’s arguments (topology dimensions,
thresholds, execution flags) are documented in the
macro reference chapter and on
the attribute’s own
rustdoc page;
the macro itself lives in the ktstr-macros crate and is re-exported
at the crate root.