Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitor

When your scheduler leaves a CPU starving, the monitor is what notices. It runs on the host while the guest executes and reads scheduler state directly out of guest memory — the scheduler under test never executes an extra instruction to be observed, and no BPF probe perturbs the decisions being measured.

Every test report ends with the monitor’s summary. From a real run:

--- monitor ---
samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
events+: refill_slice_dfl=210
schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
bpf: ktstr_select_cp cnt=189 145ns/call
bpf: ktstr_enqueue cnt=373 34ns/call
bpf: ktstr_dispatch cnt=584 237ns/call
verdict: monitor OK

Reading it: samples is how many point-in-time snapshots the monitor took; max_imbalance/max_dsq_depth/stuck are the peaks the threshold checks evaluate; events are sched_ext event-counter rates (select-cpu fallbacks, keep-last dispatches); the bpf: lines are per-callback invocation counts and mean cost from the guest kernel’s BPF program-runtime stats; verdict is what folds into the test result.

What it reads

The monitor resolves kernel structure offsets from the guest kernel’s BTF — nothing is hardcoded per kernel version. Per CPU, it reads the runqueue’s nr_running, scx_nr_running, rq_clock, local_dsq_depth, and scx_flags, plus the sched_ext event counters (select-cpu fallback, dispatch keep-last, bypass activity, and the rest of the family). When the guest kernel has CONFIG_SCHEDSTATS, it also reads per-CPU struct rq schedstat fields (run_delay, pcount, ttwu_count, …).

It also walks the struct sched_domain tree — from rq->sd up the sd->parent chain — whenever the BTF exposes it, capturing per-level topology metadata (level, name, flags, span_weight) and runtime fields (balance_interval, nr_balance_failed, max_newidle_lb_cost), plus load-balancing stats when CONFIG_SCHEDSTATS is enabled. Fields that newer kernels added (the proportional-newidle counters newidle_call / newidle_success / newidle_ratio, new in 7.0 with some stable backports) resolve as optional: on kernels whose BTF lacks them they are simply absent, and the rest of the walk proceeds.

Sampling

The monitor takes periodic snapshots (MonitorSample) of all per-CPU state; each sample is a point-in-time view of every CPU. MonitorSummary aggregates samples into peak values (max imbalance ratio, max DSQ depth, stall detection), per-sample averages, and event-counter deltas. Averages are computed over valid samples only (excluding uninitialized guest memory — see below).

Threshold evaluation

MonitorThresholds defines the pass/fail conditions:

ThresholdDefaultTrips when
max_imbalance_ratio4.0max/min per-CPU nr_running exceeds the ratio
max_local_dsq_depth50any CPU’s local DSQ exceeds the depth
fail_on_stalltruea CPU’s rq_clock stops advancing (exemptions below)
max_fallback_rate200.0/ssustained select-cpu-fallback event rate
max_keep_last_rate100.0/ssustained dispatch-keep-last event rate
sustained_samples5— window: a violation must persist this many consecutive samples

A violation must persist for sustained_samples consecutive samples before it counts — at the ~100ms sample interval, the default 5 means roughly 500ms of sustained violation. This filters transient spikes from cpuset transitions and cgroup creation/destruction. The reasoning behind each default value lives in Checking.

Test authors do not construct MonitorThresholds directly: the #[ktstr_test] threshold attributes (max_imbalance_ratio, fail_on_stall, …) and Assert::with_monitor_defaults() feed it — see the macro reference.

enforce is the on/off gate for the threshold-violation path. The default is report-only: monitor evaluations record every violation in the verdict’s details, but the verdict passes. Opting in — via Assert::with_monitor_defaults(), which fills unset threshold fields and sets enforce — promotes recorded violations to failures. Setting a field like fail_on_stall without enforcement is a no-op for the violation path: the violation appears in the monitor report, the verdict still passes, and the summary carries a report-only advisory flagging the missing enforcement.

The no-signal arms bypass enforce entirely. An empty sample buffer, or data that fails the plausibility check below, always produces a verdict with passed: false, inconclusive: true, which folds into the test’s AssertResult as Inconclusive (exit code 2). “Couldn’t evaluate” is not the same as “evaluated and OK,” so the no-signal path always surfaces distinct from Pass. Only threshold violations are gated by enforce.

Stall detection

A stall is detected when a CPU’s rq_clock does not advance between consecutive samples. Three exemptions prevent false positives:

  • Idle CPUs: when nr_running == 0 in both the current and previous sample, the CPU has no runnable tasks. The kernel stops the tick (NOHZ) on idle CPUs, so rq_clock legitimately does not advance.
  • Preempted vCPUs: when the vCPU thread’s CPU time did not advance past the preemption threshold between samples, the host preempted the vCPU — the guest never got a chance to run, which is not the scheduler’s fault.
  • Sustained window: stall detection uses per-CPU consecutive counters and the sustained_samples threshold, matching the other checks. A single stuck sample does not trigger failure.

Uninitialized memory detection

Before the guest kernel initializes per-CPU structures, monitor reads return garbage. Two layers handle this: summary computation skips individual samples where any CPU’s local_dsq_depth exceeds a plausibility ceiling (10,000), and threshold evaluation checks the whole report — if all rq_clock values are identical across every CPU and sample, or any sample exceeds the ceiling, the report is classified “not yet initialized” and no per-threshold checks run (this is one of the Inconclusive arms above).

The monitor never instruments the guest

Everything above is passive memory reading. The one ktstr feature that does load BPF probes into a guest is auto-repro — and it runs them in a separate, disposable repro VM booted after the original test failed, never in the VM whose behavior is being measured. The run your verdict is based on is unperturbed.

How guest memory is read

Three address-translation modes cover the kernel’s address spaces:

  • Text/data/bss — linear offset from the kernel’s static map, for statically-linked kernel variables.
  • Direct mappingkva - PAGE_OFFSET, for SLAB allocations and per-CPU data.
  • Vmalloc/vmap — a real page-table walk through the guest’s CR3 (4- and 5-level paging on x86_64; 4/16/64 KB granules on aarch64), for BPF maps and vmalloc’d memory.

All reads are bounds-checked and volatile (the guest modifies memory concurrently), and the runtime KASLR offset is recovered at startup so ELF symbols from the matching vmlinux resolve to live guest addresses. vmlinux must match the guest kernel — it supplies both the symbol table and the BTF. The implementing types (GuestMem, GuestKernel, GuestMemMapAccessor) are documented in the monitor module’s rustdoc (cargo doc --document-private-items).

BPF map access

The monitor also discovers and reads/writes the scheduler’s BPF maps directly through guest physical memory — no guest cooperation, no BPF syscalls. Maps are found by walking the kernel’s map_idr and matched by name suffix (".bss" matches "mitosis.bss"); values are read at BTF-resolved offsets, including per-CPU array maps. When a map carries program BTF, the dump renderer uses it to render the value struct field by field — which is why failure dumps show your BPF globals by name.

Tests use this through BpfMapWrite: a host-side write to a BPF map during VM execution. The test runner waits for the scheduler to load (the map becomes discoverable), writes the value, then signals the guest to start the scenario:

const BPF_CRASH: BpfMapWrite = BpfMapWrite::new(".bss", "crash", 1);

#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}

The field’s byte offset and width are resolved from the map’s program BTF at write time (which also disambiguates same-suffix maps by picking the one whose BTF names the field). Only array maps and 4-byte scalar fields are supported. For reading map state from test code, see Snapshots.

What a dump shows: cast analysis

When a test fails, the failure dump renders the scheduler’s BPF map state — with BTF names, not raw bytes. From a real failure:

map bpf_bpf.bss (type=array, value_size=448, max_entries=1)
.bss:
  scx_arena_verify_once=true   ktstr_alloc_count=76   nr_dispatched=907
  nr_enqueued=495              nr_select_cpu=372      stats_magic=6004496034161779060
...
    root 0x100000006000 → sdt_desc:
      nr_free=512
      chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{}
  ktstr_bss_arena_holder ktstr_bss_arena_holder:
    bss_plain_counter=76
    arena_target 0x10000000aa80 (cast→arena) [chase: arena chase: STX-flow path tagged slot as Arena with deferred resolve; bridge had no entry for 0x10000000aa80]

BPF schedulers frequently store kernel and arena pointers in u64 fields, because BTF cannot express a pointer to a per-allocation type. Without help, the renderer would print those as meaningless integers. The cast analyzer closes that gap by analyzing the scheduler binary’s BPF bytecode to learn which u64 fields actually hold pointers, so the renderer can chase them. The annotations tell you what happened:

  • (cast→arena) / (cast→kernel) — the pointer was recovered by cast analysis and chased into arena or kernel memory; the rendered fields after it are real dereferenced state, visually distinct from natively BTF-typed pointers.
  • (sdt_alloc) — the chase resolved the pointee’s type through a live arena-allocator slot (the common pattern for scx_task_data()-style per-task state).
  • [chase: …] — the chase stopped, and this is why. A stopped chase falls back to showing the raw value; the analyzer is deliberately conservative, preferring a raw u64 over chasing garbage.

The analysis is unconditional — no opt-in, no test-author configuration — and applies to every snapshot, periodic capture, and failure dump. Failure dumps are also written machine-readable as a JSON artifact next to the test outputs, carrying the same BTF-resolved field names and cast annotations:

{
  "name": "bpf_bpf.bss",
  "map_kva": 18400526959283003096,
  "map_type": 2,
  "value_size": 448,
  "max_entries": 1,
  "value": {
    "kind": "struct",
    "type_name": ".bss",
    "members": [
      {
        "name": "scx_arena_verify_once",
        "value": {
          "kind": "bool",
          "value": true
        }
      },
...
      {
        "name": "nr_dispatched",
        "value": {
          "kind": "uint",
          "bits": 64,
          "value": 849
        }
      },
...

Every field name here was resolved on the host from the guest’s BTF — the guest wrote nothing but its normal map state. See Reading Failure Output for the full anatomy of a failure report.