Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Checking

ktstr judges scheduler behavior through two channels: worker-side telemetry (every worker process reports what happened to it) and host-side monitoring (the monitor reads guest kernel state from outside). Both channels always measure; nothing asserts until the test opts in — a test with no checking attributes passes as long as the VM boots and the scenario completes.

Which API to reach for:

  • #[ktstr_test] attributes — cover most tests: not_starved, max_gap_ms, max_spread_pct, min_iteration_rate, and every other threshold below has an attribute (see the macro reference).
  • Verdict + claim! — labeled assertions on values you compute inside a custom scenario body.
  • AbsoluteThresholds — a one-call multi-field bound check against collected reports, bypassing the config merge.
  • assert_scx_events_clean — bounds on SCX event counters (“no fallbacks fired”).

Worker checks

After each scenario, ktstr collects a WorkerReport from every worker and runs the opted-in checks against them:

  • Starvation (not_starved) — any worker with zero work units fails: tid N starved (0 work units).
  • Scheduling gaps (max_gap_ms) — the longest wall-clock gap observed at work-unit checkpoints. A violation renders as tid N stuck Xms on cpuY at +Zms (threshold Nms).
  • Fairness (max_spread_pct) — workers in one cgroup should get similar CPU time; the spread (max off-CPU% − min off-CPU%) must stay below the bound.
  • Cpuset isolation (isolation) — workers may only run on CPUs in their assigned cpuset; any excursion fails.
  • Throughputmax_throughput_cv bounds the coefficient of variation of per-worker work rate (some workers quietly slower); min_work_rate sets an absolute floor (all workers equally slow).
  • Benchmarkingmax_p99_wake_latency_ns and max_wake_latency_cv bound wake-to-run latency for work types that block and measure it (see Work Types for which do); min_iteration_rate floors outer-loop iterations per second per worker.

The loop, end to end

A test sets a threshold, the run violates it, the failure output names the check, the value, and the bound:

#[ktstr_test(
    scheduler = MY_SCHED,
    llcs = 1, cores = 2, threads = 1,
    min_iteration_rate = 50_000_000.0,  // deliberately unreachable floor
)]
fn throughput_gate(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("cg_a").workers(1).cpuset(CpusetSpec::disjoint(0, 2)),
        CgroupDef::named("cg_b").workers(1).cpuset(CpusetSpec::disjoint(1, 2)),
    ])
}
cargo ktstr test --kernel 7.0
ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
  worker 71 iteration rate 41903.3/s below floor 50000000.0/s
  worker 73 iteration rate 37834.5/s below floor 50000000.0/s

--- stats ---
2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
  cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
  cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252
...
--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
...
verdict: monitor OK

Both channels report: the worker check that tripped, and the monitor verdict that did not. The full failure anatomy — timeline, scheduler log, dump sections — is in Reading Failure Output.

Monitor checks

The host-side monitor samples guest per-CPU runqueue state (via BTF offsets, no guest instrumentation) roughly every 100ms and evaluates:

  • Imbalance ratiomax(nr_running) / max(1, min(nr_running)) across CPUs.
  • Local DSQ depth — per-CPU dispatch queue depth.
  • Stall detectionrq_clock not advancing on a CPU with runnable tasks; idle CPUs and preempted vCPUs are exempt.
  • Event ratesselect_cpu_fallback and dispatch_keep_last counters per second.

Monitor violations always land in the failure report’s --- monitor --- section, but they flip the test result only when the test enforces them — set the corresponding attributes, call .with_monitor_defaults() on an Assert, or set enforce_monitor_thresholds. A monitor that produced no usable signal (empty samples, uninitialized guest memory) reports inconclusive, never a silent pass — a CI gate can always tell “verified OK” from “never measured”.

The defaults with_monitor_defaults() applies:

ThresholdDefaultRationale
max_imbalance_ratio4.0max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions.
max_local_dsq_depth50Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks.
fail_on_stalltrueFail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt.
sustained_samples5At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration.
max_fallback_rate200.0/sselect_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure.
max_keep_last_rate100.0/sdispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation.

Every monitor threshold uses the sustained_samples window — a violation must persist for N consecutive samples before it counts.

NUMA checks

For workers with a MemPolicy, three thresholds gate page placement:

  • min_page_locality — minimum fraction of pages on the expected NUMA nodes (the cgroup’s cpuset nodes, derived at evaluation time). Zero observed pages counts as zero locality, not a vacuous pass.
  • max_cross_node_migration_ratio — bound on migrated pages relative to allocated pages (from /proc/vmstat deltas).
  • max_slow_tier_ratio — bound on the fraction of pages landing on memory-only (CXL-tier) nodes.

Default thresholds

not_starved = true also enables the built-in fairness and gap checks at these defaults:

CheckReleaseDebug
Scheduling gap2000 ms3000 ms
Fairness spread15%35%

Debug builds run with higher scheduling overhead, so thresholds are relaxed.

How configuration merges

Assert is the threshold-config struct; every field is an Option where None means “inherit”. Three layers merge, last-Some wins: the baseline (all None), then the scheduler’s assert, then the per-test attributes — so a scheduler-wide bound applies to every test and any single test can override or disable it. enforce_monitor_thresholds is the one sticky field: once any layer sets it, it stays set. Worked override recipes live in Customize Checking.

execute_steps_with(ctx, steps, Some(&assert)) bypasses the merged config with an explicit Assert for that scenario’s worker checks.

Verdicts and outcomes

Every assertion produces one of four outcomes, and a result’s terminal verdict is the fold over all of them, most severe first: Fail > Inconclusive > Pass > Skip.

OutcomeMeaning
Passthe assertion ran and the value satisfied the bound
Failthe assertion ran and the value violated the bound
Inconclusivethe assertion ran but had no signal to evaluate
Skipthe scenario couldn’t run (unmet precondition)

Inconclusive exists for instrument-derived denominators — a ratio whose denominator (iterations, samples, wall-clock interval) legitimately reached zero because the workload produced no signal. Policy-derived denominators stay Fail on zero: under MemPolicy::Bind the policy says pages will exist, so their absence is a defect, not “couldn’t measure”.

CI gates read the verdict through four accessors:

if r.is_pass() { /* ship */ }
if r.is_fail() { /* block; surface r.failure_details() */ }
if r.is_skip() || r.is_inconclusive() { /* no verdict — triage */ }

is_pass() is deliberately strict: inconclusive and all-skip both read false.

Beyond attributes

  • Verdict + claim! — the claim accumulator for custom scenario bodies. Labels come from the code itself (stringify!-derived), so they cannot drift from the value they describe:

    let mut v = Assert::default_checks().verdict();
    stats.claim_max_gap_ms(&mut v).at_most(100);
    claim!(v, iter_delta).at_least(1000);
    let result = v.into_result();
  • AbsoluteThresholds — flat per-run bounds (max_p99_wake_latency_ns, max_iteration_cost_p99_ns, max_migrations, min_work_units) checked in one call: assert_thresholds(&reports, &AbsoluteThresholds::strict()). Empty report slices return a skip rather than a vacuous pass.

  • assert_scx_events_clean(events, bound) — SCX event counters under a cap (None = exactly zero); negative counts always fail.

  • CompositionAssertResult::merge accumulates results in a loop; all_of / any_of fold sibling results as AND / OR.

Signatures, comparators, and construction details are in the ktstr::assert rustdoc. For phase-scoped checks over a stepped scenario, see Phases and Temporal Assertions.