Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The #[ktstr_test] Attribute

#[ktstr_test] registers a function as an integration test that boots a VM with a declared topology and runs the function body inside it. This page is the attribute reference.

Most tests need only a handful of attributes: a scheduler, a topology dimension or two, a duration, and the checking thresholds the test is actually about.

use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 2, 4, 1),
});

#[ktstr_test(
    scheduler = MY_SCHED,   // scheduler under test (default: kernel EEVDF)
    threads = 2,            // override one dimension; the rest inherit
    duration_s = 10,        // workload window (default 12 s)
    not_starved = true,     // enable starvation / spread / gap checks
    max_spread_pct = 20.0,  // tighten the fairness-spread threshold
)]
fn smt_fairness(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")])
}

The function signature is fn(&Ctx) -> anyhow::Result<AssertResult>. scheduler = expects the bare const emitted by declare_scheduler! — see Scheduler Definitions.

Attribute forms

All attributes are optional, with defaults, and most take key = value. The sixteen bool attributes (auto_repro, expect_auto_repro, not_starved, isolation, performance_mode, pci, no_perf_mode, requires_smt, expect_err, survives_storm, allow_inconclusive, fail_on_stall, host_only, ignore, kaslr, wprof) also accept a bare form as shorthand for = true#[ktstr_test(host_only)] equals #[ktstr_test(host_only = true)]. auto_repro and kaslr default to true, so their meaningful spelling is auto_repro = false / kaslr = false; the other fourteen default to false (or unset).

Each attribute key may appear at most once per invocation; a duplicate key fails at macro expansion rather than silently letting the later value win.

Topology

AttributeDefaultDescription
numa_nodesinheritedNUMA nodes
llcsinheritedTotal LLCs (not per node)
coresinheritedCores per LLC
threadsinheritedThreads per core
memory_mib2048VM memory floor in MiB (see below)

Each dimension independently inherits from Scheduler.topology when a scheduler is specified and that dimension is not set. Without a scheduler, unset dimensions use the macro defaults (numa_nodes = 1, llcs = 1, cores = 2, threads = 1). See Topology for the notation and what the guest actually gets.

Memory

memory_mib is one of three floors: the framework allocates max(total_cpus * 64, 256, memory_mib) MiB at VM launch. Above 32 vCPUs the CPU-based floor dominates the default 2048, so a 126-vCPU test gets 8064 MiB regardless. Raise memory_mib only when the test needs more headroom than the per-CPU budget provides.

Timing

AttributeDefaultDescription
duration_s12Workload window in seconds (ctx.duration)
watchdog_timeout_s5sched_ext watchdog override in seconds

The watchdog override is applied via scx_sched.watchdog_timeout on 7.1+ kernels and via the static scx_watchdog_timeout symbol on earlier kernels; when neither path is available the override logs a warning and the kernel default stands.

Checking thresholds

Checking attributes override the merged check set (library defaults → scheduler-level assert → per-test attributes). Checking explains the two evaluation channels; Customize Checking owns the merge rules and worked overrides. Everything here is inherited when unset.

Worker checks — evaluated from per-worker telemetry after the scenario:

AttributeUnitExampleFails when
not_starvedboolnot_starved = trueany worker finishes with zero work units; also enables the spread and gap checks
isolationboolisolation = truea worker ran on a CPU outside its cgroup’s cpuset
max_gap_msmsmax_gap_ms = 500a worker’s longest scheduling gap exceeds the cap
max_spread_pctpercentage pointsmax_spread_pct = 20.0max−min worker off-CPU% exceeds the cap
max_throughput_cvcoefficient of variationmax_throughput_cv = 0.35per-worker throughput CV exceeds the cap
min_work_ratework units / CPU-secondmin_work_rate = 1000.0a worker’s work rate falls below the floor
min_iteration_rateiterations / secondmin_iteration_rate = 50000.0a worker’s wall-clock iteration rate falls below the floor
max_migration_ratiomigrations / iterationmax_migration_ratio = 0.5a cgroup’s migration ratio exceeds the cap
max_p99_wake_latency_nsnsmax_p99_wake_latency_ns = 2000000p99 wake latency exceeds the cap
max_wake_latency_cvcoefficient of variationmax_wake_latency_cv = 1.0wake-latency CV exceeds the cap
min_page_localityfraction 0.0–1.0min_page_locality = 0.9fraction of pages on expected NUMA nodes falls below the floor
max_cross_node_migration_ratiofraction 0.0–1.0max_cross_node_migration_ratio = 0.1NUMA-migrated pages / total pages exceeds the cap
max_slow_tier_ratiofraction 0.0–1.0max_slow_tier_ratio = 0.05pages on memory-only (CXL) nodes exceed the cap

Monitor thresholds — evaluated from the host monitor’s samples of guest scheduler state:

AttributeUnitExampleFails when
max_imbalance_ratioratiomax_imbalance_ratio = 2.0observed run-queue imbalance exceeds the cap
max_local_dsq_depthtasksmax_local_dsq_depth = 8a local DSQ grows deeper than the cap
fail_on_stallboolfail_on_stallthe monitor’s stall detection fails the test instead of reporting
sustained_samplessamplessustained_samples = 3window size a violation must persist for before it counts
max_fallback_rateevents/smax_fallback_rate = 5.0fallback-dispatch event rate exceeds the cap
max_keep_last_rateevents/smax_keep_last_rate = 100.0keep-last event rate exceeds the cap

What a failing gate looks like

A deliberately unreachable floor, to show the output shape — 50M iterations/s is far beyond what two workers deliver:

declare_scheduler!(MY_SCHED, { name = "my_sched", binary = "scx-ktstr" });

#[ktstr_test(scheduler = MY_SCHED, llcs = 1, cores = 2, threads = 1, duration_s = 5)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
    let checks = Assert::default_checks().min_iteration_rate(50_000_000.0);
    let steps = vec![Step {
        setup: vec![ctx.cgroup_def("cg_a"), ctx.cgroup_def("cg_b")].into(),
        ops: vec![],
        hold: HoldSpec::FULL,
    }];
    execute_steps_with(ctx, steps, Some(&checks))
}
cargo ktstr test --kernel 7.0 -- -E 'test(throughput_gate)'
  TRY 1 FAIL [  31.810s] (───) ktstr::docs_demo ktstr/throughput_gate
  stderr ───
...
    ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
      worker 71 iteration rate 41903.3/s below floor 50000000.0/s
      worker 73 iteration rate 37834.5/s below floor 50000000.0/s

    --- stats ---
    2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
      cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
      cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
    --- monitor ---
    samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
    avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
    events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
    verdict: monitor OK

The failure names the worker, the measured value, and the threshold it crossed; the stats and monitor sections that follow are the context for deciding whether the threshold or the scheduler is wrong. Reading Failure Output walks the full transcript.

Expected-error matchers

Two attributes narrow which failure counts as the expected bug in an expect_err = true reproducer test (both require expect_err; both may be set, composing with AND semantics):

  • expect_scx_bpf_error_contains = "literal" — the captured scx_bpf_error text must contain the literal substring. Empty strings panic at construction.
  • expect_scx_bpf_error_matches = "regex" — the text must match the regex. Empty patterns, invalid syntax, and any pattern that matches the empty string (a?, .*, ^$) panic at construction, so a vacuous matcher can never silently pass. ^/$ anchor to the whole string by default (use (?m) for line anchors), and a bare \b slips the vacuity gate — prefer a substring.

See Investigate a Crash for the pin-an-error-as-a-regression-test workflow these serve.

Topology constraints

These filter which gauntlet presets a test expands into; the base ktstr/ variant is unaffected.

AttributeDefaultDescription
min_llcs / max_llcs1 / 12LLC-count bounds
min_numa_nodes / max_numa_nodes1 / 1NUMA-node bounds — multi-NUMA presets are opt-in
min_cpus / max_cpus1 / 192Total-CPU bounds
requires_smtfalseOnly SMT (threads > 1) presets; skips the test entirely on aarch64, which ships no SMT presets

The gauntlet skips presets that fail any bound. See Gauntlet for the preset table, filtering rules, and a worked expansion.

Execution attributes

auto_repro / expect_auto_repro

On scheduler crash, auto_repro (default true) boots a second VM with probes attached to capture the state on the way to the crash — see Auto-Repro. Set auto_repro = false for faster iteration; expect_err = true also disables it. expect_auto_repro = true inverts the assertion: the test fails unless the auto-repro path actually fired (used to pin the repro machinery itself). It requires a scheduler and wprof, and is rejected alongside auto_repro = false, expect_err, or host_only.

expect_err / survives_storm / allow_inconclusive

expect_err = true asserts the run returns Err — the negative test: a scheduler crash or scenario failure is the expected outcome, and a clean pass fails the test. survives_storm = true is the positive inverse: the scx scheduler must stay attached and alive through every hold; a death or ejection fails with a survival-specific explainer. It requires a scheduler and is mutually exclusive with expect_err and expect_auto_repro. allow_inconclusive = true lets an Inconclusive verdict pass instead of exiting 2 — see Checking for when Inconclusive arises.

performance_mode / no_perf_mode / cpu_budget

performance_mode = true pins vCPUs to reserved host cores with hugepages, NUMA binding, and RT scheduling, for runs whose numbers must be comparable — see Performance Mode. no_perf_mode = true goes the other way: build the VM with the declared topology even on a smaller host, skipping all pinning. The two are mutually exclusive (rejected at compile time). cpu_budget = N overrides the auto-derived host-CPU mask size in no-perf mode (must be > 0, requires no_perf_mode; an explicit --cpu-cap / KTSTR_CPU_CAP still wins) — see Resource Budget.

kaslr

Default true: the guest boots with KASLR enabled, so tests run against the memory layout real systems have. kaslr = false appends nokaslr to the guest command line for the rare workflow that needs stable kernel addresses.

host_only

Run the function directly on the host — no VM. For tests that need host tools (cargo, nested VMs) unavailable in the guest initramfs. Mutually exclusive with scheduler, num_snapshots > 0, auto_repro = true, disk, and networks — everything that requires a VM to exist.

ignore

Emits #[ignore] on the generated test, so it is skipped by default and runs only under nextest’s --run-ignored. Use for slow-by-design tests that should not gate every local run.

pci / disk / networks

disk = CONST attaches a virtio-blk device from a const DiskConfig (built via DiskConfig::DEFAULT chained setters); the framework owns the backing file. networks = [CONST, …] attaches one virtio-net device per const NetConfig (aarch64 supports at most one). On x86_64 either device auto-enables the virtio-PCI transport; pci = true forces the PCI host bridge on without a device (default: no bridge, pci=off on the guest command line).

bpf_map_write

bpf_map_write = CONST (or [A, B]) writes a u32 into a named scheduler BPF-map field from the host, once, after the scheduler loads — pre-seeding guest state before workers start. See Snapshots — composing reads with writes.

watch_bpf_maps

watch_bpf_maps = CONST (or [A, B]) samples named scheduler BPF-map fields observer-effect-free during the run and surfaces each as a run-level metric. Full semantics in Watching BPF-map fields below.

perf_delta_assertions

perf_delta_assertions = CONST (or [A, B]) declares per-test performance-regression gates. Inert in a normal cargo ktstr test run — enforced only under cargo ktstr perf-delta --noise-adjust. Requires performance_mode (rejected at compile time otherwise, since unpinned numbers would make the gate misfire). See Assertable Metrics.

num_snapshots

num_snapshots = N fires N periodic BPF-state captures inside the workload’s 10%–90% window. 0 (default) disables periodic capture. Validated against the 64-capture bridge cap, host_only, and a 100 ms minimum boundary spacing. See Periodic Capture.

post_vm / post_vm_unconditional

Host-side callbacks (fn(&VmResult) -> anyhow::Result<()>) invoked after the VM exits — the place to drain snapshot bridges, read run metrics, and run temporal assertions. post_vm is suppressed when the guest reported failure; post_vm_unconditional always runs (guard it with if !result.success { return Ok(()); } when it reads state a crash may not have produced, and note it never turns a guest failure into a pass). When num_snapshots > 0 and post_vm is omitted, the macro installs a default callback asserting at least one periodic capture landed with real BPF state.

cleanup_budget_ms

Caps host-side VM teardown wall time; exceeding the budget folds a failing detail into the test result. Unset disables the check; 0 is rejected.

staged_schedulers

staged_schedulers = [PATH, …] packs additional &'static Scheduler binaries into the guest at boot. Required for scenarios that invoke Op::ReplaceScheduler / Op::AttachScheduler — the swap target must already be on disk in the guest. See Ops.

workload_root_cgroup

workload_root_cgroup = "/path" places the per-test workload cgroups under a specific guest cgroup path, decoupled from the scheduler’s cgroup_parent (which roots scheduler-side cells).

wprof / wprof_args

wprof = true attaches the wprof BPF tracer to the workload VM; wprof_args = "..." passes space-separated CLI args. Both require the wprof cargo feature.

payload / workloads / extra_include_files / extra_sched_args

payload = CONST declares the test’s primary benchmark binary and workloads = [A, B] composes more alongside it; the include-file pipeline packs every referenced binary into the guest. See Payloads and Included Files. extra_include_files = ["path", …] adds test-level host files that belong to no particular payload. extra_sched_args = ["--flag", …] appends scheduler CLI args after the scheduler’s own sched_args.

config

Inline scheduler config content, paired with a scheduler that declares config_file_def — covered next.

Inline scheduler config

Some schedulers (e.g. scx_layered, scx_lavd) accept a JSON config file via a CLI argument like --config /path/to/config.json. Two pieces wire this into a test:

  1. Scheduler declaration — declares the arg template and the guest path via config_file_def:

    const LAYERED_SCHED: Scheduler = Scheduler::named("layered")
        .binary(SchedulerSpec::Discover("scx_layered"))
        .config_file_def("--config {file}", "/include-files/layered.json");

    {file} in the arg template is replaced with the guest path. The framework writes the config content to that path inside the guest before the scheduler binary starts.

  2. Test attribute — supplies the inline content:

    const LAYERED_CONFIG: &str = r#"{ "layers": [...] }"#;
    
    #[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)]
    fn layered_test(ctx: &Ctx) -> Result<AssertResult> {
        Ok(AssertResult::pass())
    }

    Both a string literal and a path to a const &'static str are accepted.

The pairing gate is bidirectional and enforced at compile time (and again at runtime for programmatic entry construction): a scheduler with config_file_def requires config = … on every test, and a scheduler without it rejects config = … — the content would otherwise be silently dropped.

For schedulers that take the same config file on every test, use Scheduler::config_file(host_path) instead — see Scheduler Definitions.

Watching BPF-map fields

watch_bpf_maps turns “the scheduler computed X” into a post-VM assertion. The free-running host monitor reads the named field from the running guest’s BPF-map memory via BTF — without freezing vCPUs — and folds the samples into a run-level metric.

Each declared const is one WatchBpfMap::new(map_name_suffix, field, agg, label):

  • map_name_suffix — matched against a loaded BPF map by ends_with (".bss" for a section global, or a named map like "cpu_ctx_stor").
  • field — a dot-path into the map’s value type ("sys_stat.avg_lat_cri", or a bare global like "lat_headroom").
  • agg — pick by the field’s semantic class:
    • BpfMapAgg::Scalar — a gauge; folded as the mean over the run’s samples.
    • BpfMapAgg::ScalarCounter — a monotonic counter; folded as the value at the last sample (the final total, not a mean of a rising series).
    • BpfMapAgg::PerCpu — a per-CPU gauge array; folded into a cross-CPU mean and max.
    • BpfMapAgg::PerCpuCounter — a per-CPU counter array; folded as the cross-CPU sum at the last sample. Watch at u64 width so no per-CPU slot truncates before the sum.
  • label — the metric-key leaf; must be unique within a test.

The metric key is <scheduler-obj>_<label> (per-CPU gauges get _avg / _max variants). The prefix is libbpf’s object name from the scheduler’s global-section map, which can differ from the ops name — scx-ktstr’s object is bpf_bpf, so its prefix is bpf_bpf. Read the metric back with VmResult::run_metric in a post_vm hook; an absent metric returns None, never a false 0.0.

const AVG_LAT_CRI: WatchBpfMap =
    WatchBpfMap::new(".bss", "sys_stat.avg_lat_cri", BpfMapAgg::Scalar, "avg_lat_cri");
const LAT_HEADROOM: WatchBpfMap =
    WatchBpfMap::new("cpu_ctx_stor", "lat_headroom", BpfMapAgg::PerCpu, "lat_headroom");

fn check(result: &VmResult) -> anyhow::Result<()> {
    let avg_lat_cri = result.run_metric("scx_lavd_avg_lat_cri")
        .ok_or_else(|| anyhow::anyhow!("avg_lat_cri absent"))?;
    let headroom_max = result.run_metric("scx_lavd_lat_headroom_max")
        .ok_or_else(|| anyhow::anyhow!("lat_headroom_max absent"))?;
    anyhow::ensure!(avg_lat_cri.is_finite() && headroom_max.is_finite());
    Ok(())
}

#[ktstr_test(
    scheduler = SCX_LAVD,
    watch_bpf_maps = [AVG_LAT_CRI, LAT_HEADROOM],
    post_vm = check,
)]
fn lat_metrics_surface(ctx: &Ctx) -> anyhow::Result<AssertResult> { /* workload */ }

Resolution is lazy: the maps appear only after the scheduler attaches, so the monitor retries until the named map is present, then caches the resolved offset and re-reads only the leaf bytes each tick.

What the macro generates

The macro renames the function, registers it in the KTSTR_TESTS distributed slice, and emits a #[test] wrapper that boots the VM and dispatches. Details in the attribute rustdoc.