Checking
ktstr judges scheduler behavior through two channels: worker-side telemetry (every worker process reports what happened to it) and host-side monitoring (the monitor reads guest kernel state from outside). Both channels always measure; nothing asserts until the test opts in — a test with no checking attributes passes as long as the VM boots and the scenario completes.
Which API to reach for:
#[ktstr_test]attributes — cover most tests:not_starved,max_gap_ms,max_spread_pct,min_iteration_rate, and every other threshold below has an attribute (see the macro reference).Verdict+claim!— labeled assertions on values you compute inside a custom scenario body.AbsoluteThresholds— a one-call multi-field bound check against collected reports, bypassing the config merge.assert_scx_events_clean— bounds on SCX event counters (“no fallbacks fired”).
Worker checks
After each scenario, ktstr collects a
WorkerReport from every worker and
runs the opted-in checks against them:
- Starvation (
not_starved) — any worker with zero work units fails:tid N starved (0 work units). - Scheduling gaps (
max_gap_ms) — the longest wall-clock gap observed at work-unit checkpoints. A violation renders astid N stuck Xms on cpuY at +Zms (threshold Nms). - Fairness (
max_spread_pct) — workers in one cgroup should get similar CPU time; the spread (max off-CPU% − min off-CPU%) must stay below the bound. - Cpuset isolation (
isolation) — workers may only run on CPUs in their assigned cpuset; any excursion fails. - Throughput —
max_throughput_cvbounds the coefficient of variation of per-worker work rate (some workers quietly slower);min_work_ratesets an absolute floor (all workers equally slow). - Benchmarking —
max_p99_wake_latency_nsandmax_wake_latency_cvbound wake-to-run latency for work types that block and measure it (see Work Types for which do);min_iteration_ratefloors outer-loop iterations per second per worker.
The loop, end to end
A test sets a threshold, the run violates it, the failure output names the check, the value, and the bound:
#[ktstr_test(
scheduler = MY_SCHED,
llcs = 1, cores = 2, threads = 1,
min_iteration_rate = 50_000_000.0, // deliberately unreachable floor
)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("cg_a").workers(1).cpuset(CpusetSpec::disjoint(0, 2)),
CgroupDef::named("cg_b").workers(1).cpuset(CpusetSpec::disjoint(1, 2)),
])
}
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed: worker 71 iteration rate 41903.3/s below floor 50000000.0/s worker 73 iteration rate 37834.5/s below floor 50000000.0/s --- stats --- 2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600 cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252 ... --- monitor --- samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0 avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0 events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0 ... verdict: monitor OK
Both channels report: the worker check that tripped, and the monitor verdict that did not. The full failure anatomy — timeline, scheduler log, dump sections — is in Reading Failure Output.
Monitor checks
The host-side monitor samples guest per-CPU runqueue state (via BTF offsets, no guest instrumentation) roughly every 100ms and evaluates:
- Imbalance ratio —
max(nr_running) / max(1, min(nr_running))across CPUs. - Local DSQ depth — per-CPU dispatch queue depth.
- Stall detection —
rq_clocknot advancing on a CPU with runnable tasks; idle CPUs and preempted vCPUs are exempt. - Event rates —
select_cpu_fallbackanddispatch_keep_lastcounters per second.
Monitor violations always land in the failure report’s --- monitor --- section, but they flip the test result only when the test
enforces them — set the corresponding attributes, call
.with_monitor_defaults() on an Assert, or set
enforce_monitor_thresholds. A monitor that produced no usable
signal (empty samples, uninitialized guest memory) reports
inconclusive, never a silent pass — a CI gate can always tell
“verified OK” from “never measured”.
The defaults with_monitor_defaults() applies:
| Threshold | Default | Rationale |
|---|---|---|
max_imbalance_ratio | 4.0 | max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions. |
max_local_dsq_depth | 50 | Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks. |
fail_on_stall | true | Fail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt. |
sustained_samples | 5 | At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration. |
max_fallback_rate | 200.0/s | select_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure. |
max_keep_last_rate | 100.0/s | dispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation. |
Every monitor threshold uses the sustained_samples window — a
violation must persist for N consecutive samples before it counts.
NUMA checks
For workers with a MemPolicy, three thresholds
gate page placement:
min_page_locality— minimum fraction of pages on the expected NUMA nodes (the cgroup’s cpuset nodes, derived at evaluation time). Zero observed pages counts as zero locality, not a vacuous pass.max_cross_node_migration_ratio— bound on migrated pages relative to allocated pages (from/proc/vmstatdeltas).max_slow_tier_ratio— bound on the fraction of pages landing on memory-only (CXL-tier) nodes.
Default thresholds
not_starved = true also enables the built-in fairness and gap
checks at these defaults:
| Check | Release | Debug |
|---|---|---|
| Scheduling gap | 2000 ms | 3000 ms |
| Fairness spread | 15% | 35% |
Debug builds run with higher scheduling overhead, so thresholds are relaxed.
How configuration merges
Assert is the threshold-config struct; every field is an Option
where None means “inherit”. Three layers merge, last-Some wins:
the baseline (all None), then the scheduler’s assert, then the
per-test attributes — so a scheduler-wide bound applies to every
test and any single test can override or disable it.
enforce_monitor_thresholds is the one sticky field: once any layer
sets it, it stays set. Worked override recipes live in
Customize Checking.
execute_steps_with(ctx, steps, Some(&assert)) bypasses the merged
config with an explicit Assert for that scenario’s worker checks.
Verdicts and outcomes
Every assertion produces one of four outcomes, and a result’s
terminal verdict is the fold over all of them, most severe first:
Fail > Inconclusive > Pass > Skip.
| Outcome | Meaning |
|---|---|
Pass | the assertion ran and the value satisfied the bound |
Fail | the assertion ran and the value violated the bound |
Inconclusive | the assertion ran but had no signal to evaluate |
Skip | the scenario couldn’t run (unmet precondition) |
Inconclusive exists for instrument-derived denominators — a ratio
whose denominator (iterations, samples, wall-clock interval)
legitimately reached zero because the workload produced no signal.
Policy-derived denominators stay Fail on zero: under
MemPolicy::Bind the policy says pages will exist, so their absence
is a defect, not “couldn’t measure”.
CI gates read the verdict through four accessors:
if r.is_pass() { /* ship */ }
if r.is_fail() { /* block; surface r.failure_details() */ }
if r.is_skip() || r.is_inconclusive() { /* no verdict — triage */ }
is_pass() is deliberately strict: inconclusive and all-skip both
read false.
Beyond attributes
-
Verdict+claim!— the claim accumulator for custom scenario bodies. Labels come from the code itself (stringify!-derived), so they cannot drift from the value they describe:let mut v = Assert::default_checks().verdict(); stats.claim_max_gap_ms(&mut v).at_most(100); claim!(v, iter_delta).at_least(1000); let result = v.into_result(); -
AbsoluteThresholds— flat per-run bounds (max_p99_wake_latency_ns,max_iteration_cost_p99_ns,max_migrations,min_work_units) checked in one call:assert_thresholds(&reports, &AbsoluteThresholds::strict()). Empty report slices return a skip rather than a vacuous pass. -
assert_scx_events_clean(events, bound)— SCX event counters under a cap (None= exactly zero); negative counts always fail. -
Composition —
AssertResult::mergeaccumulates results in a loop;all_of/any_offold sibling results as AND / OR.
Signatures, comparators, and construction details are in the
ktstr::assert rustdoc.
For phase-scoped checks over a stepped scenario, see
Phases and
Temporal Assertions.