The #[ktstr_test] Attribute
#[ktstr_test] registers a function as an integration test that
boots a VM with a declared topology and runs the function body inside
it. This page is the attribute reference.
Most tests need only a handful of attributes: a scheduler, a topology dimension or two, a duration, and the checking thresholds the test is actually about.
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 2, 4, 1),
});
#[ktstr_test(
scheduler = MY_SCHED, // scheduler under test (default: kernel EEVDF)
threads = 2, // override one dimension; the rest inherit
duration_s = 10, // workload window (default 12 s)
not_starved = true, // enable starvation / spread / gap checks
max_spread_pct = 20.0, // tighten the fairness-spread threshold
)]
fn smt_fairness(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")])
}
The function signature is
fn(&Ctx) -> anyhow::Result<AssertResult>. scheduler = expects the
bare const emitted by declare_scheduler! — see
Scheduler Definitions.
Attribute forms
All attributes are optional, with defaults, and most take
key = value. The sixteen bool attributes (auto_repro,
expect_auto_repro, not_starved, isolation, performance_mode,
pci, no_perf_mode, requires_smt, expect_err,
survives_storm, allow_inconclusive, fail_on_stall, host_only,
ignore, kaslr, wprof) also accept a bare form as shorthand for
= true — #[ktstr_test(host_only)] equals
#[ktstr_test(host_only = true)]. auto_repro and kaslr default
to true, so their meaningful spelling is auto_repro = false /
kaslr = false; the other fourteen default to false (or unset).
Each attribute key may appear at most once per invocation; a duplicate key fails at macro expansion rather than silently letting the later value win.
Topology
| Attribute | Default | Description |
|---|---|---|
numa_nodes | inherited | NUMA nodes |
llcs | inherited | Total LLCs (not per node) |
cores | inherited | Cores per LLC |
threads | inherited | Threads per core |
memory_mib | 2048 | VM memory floor in MiB (see below) |
Each dimension independently inherits from Scheduler.topology when
a scheduler is specified and that dimension is not set. Without a
scheduler, unset dimensions use the macro defaults (numa_nodes = 1,
llcs = 1, cores = 2, threads = 1). See
Topology for the notation and what the
guest actually gets.
Memory
memory_mib is one of three floors: the framework allocates
max(total_cpus * 64, 256, memory_mib) MiB at VM launch. Above 32
vCPUs the CPU-based floor dominates the default 2048, so a 126-vCPU
test gets 8064 MiB regardless. Raise memory_mib only when the test
needs more headroom than the per-CPU budget provides.
Timing
| Attribute | Default | Description |
|---|---|---|
duration_s | 12 | Workload window in seconds (ctx.duration) |
watchdog_timeout_s | 5 | sched_ext watchdog override in seconds |
The watchdog override is applied via scx_sched.watchdog_timeout on
7.1+ kernels and via the static scx_watchdog_timeout symbol on
earlier kernels; when neither path is available the override logs a
warning and the kernel default stands.
Checking thresholds
Checking attributes override the merged check set
(library defaults → scheduler-level assert → per-test attributes).
Checking explains the two evaluation
channels; Customize Checking owns
the merge rules and worked overrides. Everything here is inherited
when unset.
Worker checks — evaluated from per-worker telemetry after the scenario:
| Attribute | Unit | Example | Fails when |
|---|---|---|---|
not_starved | bool | not_starved = true | any worker finishes with zero work units; also enables the spread and gap checks |
isolation | bool | isolation = true | a worker ran on a CPU outside its cgroup’s cpuset |
max_gap_ms | ms | max_gap_ms = 500 | a worker’s longest scheduling gap exceeds the cap |
max_spread_pct | percentage points | max_spread_pct = 20.0 | max−min worker off-CPU% exceeds the cap |
max_throughput_cv | coefficient of variation | max_throughput_cv = 0.35 | per-worker throughput CV exceeds the cap |
min_work_rate | work units / CPU-second | min_work_rate = 1000.0 | a worker’s work rate falls below the floor |
min_iteration_rate | iterations / second | min_iteration_rate = 50000.0 | a worker’s wall-clock iteration rate falls below the floor |
max_migration_ratio | migrations / iteration | max_migration_ratio = 0.5 | a cgroup’s migration ratio exceeds the cap |
max_p99_wake_latency_ns | ns | max_p99_wake_latency_ns = 2000000 | p99 wake latency exceeds the cap |
max_wake_latency_cv | coefficient of variation | max_wake_latency_cv = 1.0 | wake-latency CV exceeds the cap |
min_page_locality | fraction 0.0–1.0 | min_page_locality = 0.9 | fraction of pages on expected NUMA nodes falls below the floor |
max_cross_node_migration_ratio | fraction 0.0–1.0 | max_cross_node_migration_ratio = 0.1 | NUMA-migrated pages / total pages exceeds the cap |
max_slow_tier_ratio | fraction 0.0–1.0 | max_slow_tier_ratio = 0.05 | pages on memory-only (CXL) nodes exceed the cap |
Monitor thresholds — evaluated from the host monitor’s samples of guest scheduler state:
| Attribute | Unit | Example | Fails when |
|---|---|---|---|
max_imbalance_ratio | ratio | max_imbalance_ratio = 2.0 | observed run-queue imbalance exceeds the cap |
max_local_dsq_depth | tasks | max_local_dsq_depth = 8 | a local DSQ grows deeper than the cap |
fail_on_stall | bool | fail_on_stall | the monitor’s stall detection fails the test instead of reporting |
sustained_samples | samples | sustained_samples = 3 | window size a violation must persist for before it counts |
max_fallback_rate | events/s | max_fallback_rate = 5.0 | fallback-dispatch event rate exceeds the cap |
max_keep_last_rate | events/s | max_keep_last_rate = 100.0 | keep-last event rate exceeds the cap |
What a failing gate looks like
A deliberately unreachable floor, to show the output shape — 50M iterations/s is far beyond what two workers deliver:
declare_scheduler!(MY_SCHED, { name = "my_sched", binary = "scx-ktstr" });
#[ktstr_test(scheduler = MY_SCHED, llcs = 1, cores = 2, threads = 1, duration_s = 5)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
let checks = Assert::default_checks().min_iteration_rate(50_000_000.0);
let steps = vec![Step {
setup: vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")].into(),
ops: vec![],
hold: HoldSpec::FULL,
}];
execute_steps_with(ctx, steps, Some(&checks))
}
TRY 1 FAIL [ 31.810s] (───) ktstr::docs_demo ktstr/throughput_gate
stderr ───
...
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
worker 71 iteration rate 41903.3/s below floor 50000000.0/s
worker 73 iteration rate 37834.5/s below floor 50000000.0/s
--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK
The failure names the worker, the measured value, and the threshold it crossed; the stats and monitor sections that follow are the context for deciding whether the threshold or the scheduler is wrong. Reading Failure Output walks the full transcript.
Expected-error matchers
Two attributes narrow which failure counts as the expected bug in an
expect_err = true reproducer test (both require expect_err; both
may be set, composing with AND semantics):
expect_scx_bpf_error_contains = "literal"— the capturedscx_bpf_errortext must contain the literal substring. Empty strings panic at construction.expect_scx_bpf_error_matches = "regex"— the text must match the regex. Empty patterns, invalid syntax, and any pattern that matches the empty string (a?,.*,^$) panic at construction, so a vacuous matcher can never silently pass.^/$anchor to the whole string by default (use(?m)for line anchors), and a bare\bslips the vacuity gate — prefer a substring.
See Investigate a Crash for the pin-an-error-as-a-regression-test workflow these serve.
Topology constraints
These filter which gauntlet presets a test expands into; the base
ktstr/ variant is unaffected.
| Attribute | Default | Description |
|---|---|---|
min_llcs / max_llcs | 1 / 12 | LLC-count bounds |
min_numa_nodes / max_numa_nodes | 1 / 1 | NUMA-node bounds — multi-NUMA presets are opt-in |
min_cpus / max_cpus | 1 / 192 | Total-CPU bounds |
requires_smt | false | Only SMT (threads > 1) presets; skips the test entirely on aarch64, which ships no SMT presets |
The gauntlet skips presets that fail any bound. See Gauntlet for the preset table, filtering rules, and a worked expansion.
Execution attributes
auto_repro / expect_auto_repro
On scheduler crash, auto_repro (default true) boots a second VM
with probes attached to capture the state on the way to the crash —
see Auto-Repro. Set
auto_repro = false for faster iteration; expect_err = true also
disables it. expect_auto_repro = true inverts the assertion: the
test fails unless the auto-repro path actually fired (used to pin
the repro machinery itself). It requires a scheduler and wprof,
and is rejected alongside auto_repro = false, expect_err, or
host_only.
expect_err / survives_storm / allow_inconclusive
expect_err = true asserts the run returns Err — the negative
test: a scheduler crash or scenario failure is the expected outcome,
and a clean pass fails the test. survives_storm = true is the
positive inverse: the scx scheduler must stay attached and alive
through every hold; a death or ejection fails with a
survival-specific explainer. It requires a scheduler and is mutually
exclusive with expect_err and expect_auto_repro.
allow_inconclusive = true lets an Inconclusive verdict pass
instead of exiting 2 — see
Checking for when Inconclusive arises.
performance_mode / no_perf_mode / cpu_budget
performance_mode = true pins vCPUs to reserved host cores with
hugepages, NUMA binding, and RT scheduling, for runs whose numbers
must be comparable — see
Performance Mode.
no_perf_mode = true goes the other way: build the VM with the
declared topology even on a smaller host, skipping all pinning.
The two are mutually exclusive (rejected at compile time).
cpu_budget = N overrides the auto-derived host-CPU mask size in
no-perf mode (must be > 0, requires no_perf_mode; an explicit
--cpu-cap / KTSTR_CPU_CAP still wins) — see
Resource Budget.
kaslr
Default true: the guest boots with KASLR enabled, so tests run
against the memory layout real systems have. kaslr = false
appends nokaslr to the guest command line for the rare workflow
that needs stable kernel addresses.
host_only
Run the function directly on the host — no VM. For tests that need
host tools (cargo, nested VMs) unavailable in the guest initramfs.
Mutually exclusive with scheduler, num_snapshots > 0,
auto_repro = true, disk, and networks — everything that
requires a VM to exist.
ignore
Emits #[ignore] on the generated test, so it is skipped by default
and runs only under nextest’s --run-ignored. Use for
slow-by-design tests that should not gate every local run.
pci / disk / networks
disk = CONST attaches a virtio-blk device from a const DiskConfig (built via DiskConfig::DEFAULT chained setters); the
framework owns the backing file. networks = [CONST, …] attaches
one virtio-net device per const NetConfig (aarch64 supports at
most one). On x86_64 either device auto-enables the virtio-PCI
transport; pci = true forces the PCI host bridge on without a
device (default: no bridge, pci=off on the guest command line).
bpf_map_write
bpf_map_write = CONST (or [A, B]) writes a u32 into a named
scheduler BPF-map field from the host, once, after the scheduler
loads — pre-seeding guest state before workers start. See
Snapshots — composing reads with writes.
watch_bpf_maps
watch_bpf_maps = CONST (or [A, B]) samples named scheduler
BPF-map fields observer-effect-free during the run and surfaces each
as a run-level metric. Full semantics in
Watching BPF-map fields below.
perf_delta_assertions
perf_delta_assertions = CONST (or [A, B]) declares per-test
performance-regression gates. Inert in a normal cargo ktstr test
run — enforced only under cargo ktstr perf-delta --noise-adjust.
Requires performance_mode (rejected at compile time otherwise,
since unpinned numbers would make the gate misfire). See
Assertable Metrics.
num_snapshots
num_snapshots = N fires N periodic BPF-state captures inside the
workload’s 10%–90% window. 0 (default) disables periodic capture.
Validated against the 64-capture bridge cap, host_only, and a
100 ms minimum boundary spacing. See
Periodic Capture.
post_vm / post_vm_unconditional
Host-side callbacks (fn(&VmResult) -> anyhow::Result<()>) invoked
after the VM exits — the place to drain snapshot bridges, read run
metrics, and run temporal assertions.
post_vm is suppressed when the guest reported failure;
post_vm_unconditional always runs (guard it with
if !result.success { return Ok(()); } when it reads state a crash
may not have produced, and note it never turns a guest failure into
a pass). When num_snapshots > 0 and post_vm is omitted, the
macro installs a default callback asserting at least one periodic
capture landed with real BPF state.
cleanup_budget_ms
Caps host-side VM teardown wall time; exceeding the budget folds a
failing detail into the test result. Unset disables the check; 0
is rejected.
staged_schedulers
staged_schedulers = [PATH, …] packs additional &'static Scheduler binaries into the guest at boot. Required for scenarios
that invoke Op::ReplaceScheduler / Op::AttachScheduler — the
swap target must already be on disk in the guest. See
Ops.
workload_root_cgroup
workload_root_cgroup = "/path" places the per-test workload
cgroups under a specific guest cgroup path, decoupled from the
scheduler’s cgroup_parent (which roots scheduler-side cells).
wprof / wprof_args
wprof = true attaches the wprof BPF tracer to the workload VM;
wprof_args = "..." passes space-separated CLI args. Both require
the wprof cargo feature.
payload / workloads / extra_include_files / extra_sched_args
payload = CONST declares the test’s primary benchmark binary and
workloads = [A, B] composes more alongside it; the include-file
pipeline packs every referenced binary into the guest. See
Payloads and Included Files.
extra_include_files = ["path", …] adds test-level host files that
belong to no particular payload. extra_sched_args = ["--flag", …]
appends scheduler CLI args after the scheduler’s own sched_args.
config
Inline scheduler config content, paired with a scheduler that
declares config_file_def — covered next.
Inline scheduler config
Some schedulers (e.g. scx_layered, scx_lavd) accept a JSON
config file via a CLI argument like --config /path/to/config.json.
Two pieces wire this into a test:
-
Scheduler declaration — declares the arg template and the guest path via
config_file_def:const LAYERED_SCHED: Scheduler = Scheduler::named("layered") .binary(SchedulerSpec::Discover("scx_layered")) .config_file_def("--config {file}", "/include-files/layered.json");{file}in the arg template is replaced with the guest path. The framework writes the config content to that path inside the guest before the scheduler binary starts. -
Test attribute — supplies the inline content:
const LAYERED_CONFIG: &str = r#"{ "layers": [...] }"#; #[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)] fn layered_test(ctx: &Ctx) -> Result<AssertResult> { Ok(AssertResult::pass()) }Both a string literal and a path to a
const &'static strare accepted.
The pairing gate is bidirectional and enforced at compile time (and
again at runtime for programmatic entry construction): a scheduler
with config_file_def requires config = … on every test, and a
scheduler without it rejects config = … — the content would
otherwise be silently dropped.
For schedulers that take the same config file on every test, use
Scheduler::config_file(host_path) instead — see
Scheduler Definitions.
Watching BPF-map fields
watch_bpf_maps turns “the scheduler computed X” into a post-VM
assertion. The free-running host monitor reads the named field from
the running guest’s BPF-map memory via BTF — without freezing vCPUs
— and folds the samples into a run-level metric.
Each declared const is one WatchBpfMap::new(map_name_suffix, field, agg, label):
map_name_suffix— matched against a loaded BPF map byends_with(".bss"for a section global, or a named map like"cpu_ctx_stor").field— a dot-path into the map’s value type ("sys_stat.avg_lat_cri", or a bare global like"lat_headroom").agg— pick by the field’s semantic class:BpfMapAgg::Scalar— a gauge; folded as the mean over the run’s samples.BpfMapAgg::ScalarCounter— a monotonic counter; folded as the value at the last sample (the final total, not a mean of a rising series).BpfMapAgg::PerCpu— a per-CPU gauge array; folded into a cross-CPU mean and max.BpfMapAgg::PerCpuCounter— a per-CPU counter array; folded as the cross-CPU sum at the last sample. Watch at u64 width so no per-CPU slot truncates before the sum.
label— the metric-key leaf; must be unique within a test.
The metric key is <scheduler-obj>_<label> (per-CPU gauges get
_avg / _max variants). The prefix is libbpf’s object name from
the scheduler’s global-section map, which can differ from the ops
name — scx-ktstr’s object is bpf_bpf, so its prefix is bpf_bpf.
Read the metric back with VmResult::run_metric in a post_vm
hook; an absent metric returns None, never a false 0.0.
const AVG_LAT_CRI: WatchBpfMap =
WatchBpfMap::new(".bss", "sys_stat.avg_lat_cri", BpfMapAgg::Scalar, "avg_lat_cri");
const LAT_HEADROOM: WatchBpfMap =
WatchBpfMap::new("cpu_ctx_stor", "lat_headroom", BpfMapAgg::PerCpu, "lat_headroom");
fn check(result: &VmResult) -> anyhow::Result<()> {
let avg_lat_cri = result.run_metric("scx_lavd_avg_lat_cri")
.ok_or_else(|| anyhow::anyhow!("avg_lat_cri absent"))?;
let headroom_max = result.run_metric("scx_lavd_lat_headroom_max")
.ok_or_else(|| anyhow::anyhow!("lat_headroom_max absent"))?;
anyhow::ensure!(avg_lat_cri.is_finite() && headroom_max.is_finite());
Ok(())
}
#[ktstr_test(
scheduler = SCX_LAVD,
watch_bpf_maps = [AVG_LAT_CRI, LAT_HEADROOM],
post_vm = check,
)]
fn lat_metrics_surface(ctx: &Ctx) -> anyhow::Result<AssertResult> { /* workload */ }
Resolution is lazy: the maps appear only after the scheduler attaches, so the monitor retries until the named map is present, then caches the resolved offset and re-reads only the leaf bytes each tick.
What the macro generates
The macro renames the function, registers it in the KTSTR_TESTS
distributed slice, and emits a #[test] wrapper that boots the VM
and dispatches. Details in the
attribute rustdoc.