Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reading Failure Output

When a test fails, everything ktstr knows lands in the test’s stderr as one bundle: the violated thresholds, the workload statistics, a phase timeline, the scheduler’s own log, the monitor’s summary, and — for scheduler crashes — the kernel’s sched_ext dump and an auto-repro trail. This page walks the sections in the order they appear, using two real failures.

A check-gate failure

This test set an iteration-rate floor its (deliberately slowed) scheduler could not meet. The first line names the test, scheduler, and topology; the indented lines under it are the violated checks:

cargo ktstr test --kernel 7.0 -- --features integration -E 'test(=ktstr/throughput_gate)'
  TRY 1 FAIL [  31.810s] (───) ktstr::docs_demo ktstr/throughput_gate
  stderr ───
...
    ktstr_test 'throughput_gate' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
      worker 71 iteration rate 41903.3/s below floor 50000000.0/s
      worker 73 iteration rate 37834.5/s below floor 50000000.0/s

    --- stats ---
    2 workers, 4 cpus, 2 migrations, worst_spread=0.0%, worst_gap=21ms
      cg0: workers=1 cpus=2 spread=0.0% gap=10ms migrations=1 iter=209600
      cg1: workers=1 cpus=2 spread=0.0% gap=21ms migrations=1 iter=189252

    --- timeline ---
    topology: 1n1l2c1t (2 cpus)  scheduler: my_sched  scenario: throughput_gate  duration: 15.0s

    Phase 1: StepStart[0] ops=0 (4960ms, 0 samples):
      imbalance: avg=1.2 max=5.0 | dsq: avg=0 max=0 | nr_run: avg=1.0 | fallback: 0/s | keep_last: 38/s | throughput: 79697 iter/s (stimulus-derived)
      per-cgroup:
        cg_a: off-cpu avg=0.3% min=0.3% max=0.3% spread=0.0% | run-delay mean=915µs worst=915µs | iters=209600 migrations=1 | gap=10ms@cpu0
        cg_b: off-cpu avg=9.0% min=9.0% max=9.0% spread=0.0% | run-delay mean=5654µs worst=5654µs | iters=189252 migrations=1 | gap=21ms@cpu0
      >>> StepStart[0]: ops=0 (2 cgroups, 2 workers)

Reading it:

  • Header + details — the checks that failed, one line each, with the observed value and the threshold in the same line. The detail line is the verdict; everything below is context.
  • --- stats --- — the per-run roll-up: worker count, distinct CPUs touched, migrations, worst per-cgroup spread and off-CPU gap. cg{i} is the positional index of the cgroup in this roll-up, not its name — it lines up with the per-cgroup rows underneath.
  • --- timeline --- — one block per scenario phase with monitor averages and per-cgroup detail (named cgroups here). In this run, cg_b’s 9% off-CPU time and 5.6 ms mean run-delay against cg_a’s 0.3%/0.9 ms is the asymmetry the failed rate gate traces back to.

Two more sections follow every failure when they have content:

    --- scheduler log ---
    libbpf: struct_ops ktstr_ops: member sub_attach not found in kernel, skipping it as it's set to zero
...

    --- monitor ---
    samples=41 max_imbalance=2.00 max_dsq_depth=0 stuck=0
    avg: imbalance=1.32 nr_running/cpu=1.2 dsq/cpu=0.0
    events: fallback=0 (0.0/s) keep_last=210 (52.5/s) offline=0
    events+: refill_slice_dfl=210
    schedstat: csw=586 (146/s) run_delay=381246314ns/s ttwu=204 goidle=1
    bpf: ktstr_select_cp cnt=189 145ns/call
    bpf: ktstr_enqueue cnt=373 34ns/call
    bpf: ktstr_dispatch cnt=584 237ns/call
    verdict: monitor OK

The scheduler log is whatever the scheduler binary printed (libbpf noise included). The monitor block is the host-side observer’s summary — see Monitor for what each line means. verdict: monitor OK here says the monitor’s checks passed; the test still failed on the worker-side rate gate. The two channels are independent.

A scheduler crash

When the scheduler itself dies, the trail grows: a BUG SUMMARY line, a --- diagnostics --- section for the run stage, the kernel’s sched_ext debug dump, and an auto-repro section. From a real scx_bpf_error crash (triggered on purpose by a demo test):

BUG SUMMARY: scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)
ktstr_test 'bpf_crash_auto_repro_e2e' [sched=scx-ktstr] [topo=1n1l4c1t] failed:
  scheduler process died unexpectedly during workload (2.2s into test)

--- stats ---
4 workers, 0 cpus, 0 migrations, worst_spread=0.0%, worst_gap=0ms
  cg0: workers=4 cpus=0 spread=n/a gap=0ms migrations=0 iter=0

--- diagnostics ---
stage: payload started but produced no test result
exit_code=1

BUG SUMMARY is the one-line cause, extracted from the kernel’s triggered exit kind emission or the scheduler log. The --- diagnostics --- stage line tells you how far the run got before dying — here the workload started but never reported, because the scheduler died under it.

The scheduler-log section carries the kernel’s full debug dump for scheduler exits — exit kind, backtrace, and (if the scheduler implements ops.dump) its own state:

--- scheduler log ---
...
DEBUG DUMP
================================================================================

swapper/3[0] triggered exit kind 1025:
  scx_bpf_error (src/bpf/main.bpf.c:424: ktstr: host-triggered crash)

Backtrace:
  scx_exit+0x50/0x70
  scx_bpf_error_bstr+0x78/0x90
  bpf_prog_1fed99378f3a8055_ktstr_dispatch+0x4d/0x1cb
  bpf__sched_ext_ops_dispatch+0x4b/0xa7
  do_pick_task_scx+0x379/0x770
  __schedule+0x5ca/0xfc0
...
ktstr scheduler state:
  stall=0 crash=1 degrade_rt=0

A --- sched_ext dump --- section repeats the same dump as captured from the kernel trace channel, and --- auto-repro --- reports the second VM’s replay of the crash:

--- auto-repro ---
--- probe pipeline ---
  extracted:   10 functions from crash backtrace
  traceable:   7 passed, 3 dropped: bpf_prog_1fed99378f3a8055_ktstr_dispatch, bpf__sched_ext_ops_dispatch, ret_from_fork_asm
...
repro VM duration: 16.9s

See Auto-Repro for how to read the probe pipeline and its output.

Detail-line catalog

The worker-side checks emit a fixed set of detail-line shapes (each format string is pinned by a unit test, so these stay accurate):

  • worker {N} iteration rate {R}/s below floor {F}/s — a benchmark rate gate failed.
  • tid {N} starved (0 work units) — a worker made no progress at all (not_starved).
  • tid {N} stuck {X}ms on cpu{C} at +{T}ms (threshold {N}ms) — a worker’s longest off-CPU gap crossed max_gap_ms.
  • unfair cgroup: spread={P}% ({lo}-{hi}%) {N} workers on {N} cpus (threshold {P}%) — per-cgroup fairness exceeded max_spread_pct.

See Checking for the model behind these and the monitor-side violations.

Artifacts on disk

After the run, cargo ktstr prints where everything landed:

cargo ktstr: test outputs
  ~/ktstr/target/ktstr/7.0.14-73730e0-dirty
    FAILED  throughput_gate  [my_sched 1n1l2c1t]
      failure dump  ~/ktstr/target/ktstr/7.0.14-73730e0-dirty/throughput_gate-2ecd2624f3df7276.failure-dump.json
      stats         ~/ktstr/target/ktstr/7.0.14-73730e0-dirty/throughput_gate-2ecd2624f3df7276.ktstr.json
      replay        cargo ktstr replay --filter throughput_gate --exec

Every failed test writes a {test_name}-{variant_hash:016x}.failure-dump.json next to its result sidecar in the run directory (see Runs for the directory semantics). Auto-repro runs write a sibling .repro.failure-dump.json with the repro VM’s own snapshot. The path is pre-cleared at each dispatch, so a passing rerun never leaves a stale dump behind.

The dump comes in two shapes. When the scheduler attached and its exit path triggered, it is a full post-mortem: BPF map contents with BTF-typed field names, per-vCPU registers, per-program runtime stats — the JSON form of what Snapshots renders. When the failure happened before the BPF probe attached, a placeholder is written instead, and says so:

{
    "schema": "single",
    "maps": [],
    "sdt_alloc_unavailable": "test failed at stage `payload started but produced no test result`; no BPF state captured (probe did not attach before failure)",
    ...
    "is_placeholder": true
}

The JSON is for tooling that walks the run directory; humans should read the stderr — the actionable diagnostics are the BUG SUMMARY line and the --- sched_ext dump --- section.

Investigation workflow

  1. Read the header and detail lines — they name the check and the margin by which it failed.

  2. For check failures, correlate against --- stats --- and --- timeline ---: which cgroup, which phase, migrations or gaps?

  3. For crashes, start from BUG SUMMARY and the backtrace in the debug dump, then read the auto-repro trail for the state on the way to the exit.

  4. Re-run exactly the failing variant — including the gauntlet preset segment if it was a gauntlet case (see test name shapes):

    cargo ktstr test --kernel 7.0 -- -E 'test(=gauntlet/my_test/smt-2llc)'
    

    or re-run everything that failed last session with cargo ktstr replay.

  5. Poke at the same environment interactively: cargo ktstr shell --test my_test boots a VM with the test’s topology, memory, and include files (see ktstr shell).

Verbosity knobs

  • RUST_BACKTRACE=1 — Rust panic backtraces, plus ktstr’s verbose mode: the full guest kernel console is appended to --- diagnostics --- on failure, the auto-repro VM’s console is forwarded live, and the guest boots with loglevel=7.
  • RUST_LOG=ktstr=debug — host-side tracing (probe attach reasons, libbpf errors).
  • --dmesg — streams the guest kernel console in real time; it is a flag on cargo ktstr shell / ktstr shell, not on test.

Environment errors (kernel not found, cgroup controllers missing, flock timeouts) are cataloged in Troubleshooting.