Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ktstr in Action

The feature tour: each capability in a few sentences, with real output where output is the point — captured from actual runs (ktstr 0.23.0, kernel 7.0.14 unless noted), trimmed but never edited.

Testing

Real kernel, clean slate, x86/arm parity

Every VM test boots its own Linux kernel in KVM — fresh state each run, no shared daemons, no containers. Topology is configurable per test: NUMA nodes, LLCs, cores per LLC, threads per core, with real ACPI SRAT/SLIT tables on x86_64 and FDT cpu nodes on aarch64 — 24 topology presets on x86_64, 14 on aarch64. The framework drives the whole lifecycle: boot, attach, scenario, collect, teardown. See Topology.

Fast boot

The initramfs base (test binary + busybox + shared libraries) is LZ4-compressed and cached in shared memory; concurrent VMs COW-map the cached base instead of rebuilding it, so boot is dominated by kernel init, not initramfs preparation. Measured on a 64-CPU host:

  initramfs spawn: 55.583µs
  kvm+kernel: 867.005µs
  setup_memory (joins initramfs): 1.409360963s
  setup_vcpus: 1.409565321s
VM setup total: 1.409619773s

Declarative scheduler registration

One macro declares a scheduler — binary, default topology, kernel filter for the verifier sweep, assertion overrides, always-on CLI args — and tests reference the const it emits:

use ktstr::declare_scheduler;

declare_scheduler!(MITOSIS, {
    name = "mitosis",
    binary = "scx_mitosis",
    topology = (1, 2, 4, 1),
    sched_args = ["--exit-dump-len", "1048576"],
});

Schedulers that take a --config JSON file declare the arg template once; each test supplies its config inline via config = …. See Scheduler Definitions and The #[ktstr_test] Attribute.

Data-driven scenarios

You declare intent as data — cgroups, cpusets, workloads, mid-run ops — and the framework creates cgroups, spawns workers, applies policies and affinity, and tears it all down. Canned scenarios grouped by scheduling concern — affinity, cpusets, dynamic cgroups, nesting, contention, stress — cover the common patterns; the ops DSL underneath (Step, Op, Backdrop) expresses the rest. See Scenarios and Ops, Steps, and Backdrop.

45 work types

Each work type targets a specific scheduling pressure, so a test can pin the kernel path a regression lives in. A sample of the 45:

PressureExamples
CPU and IPCSpinWait, AluHot, IpcVariance
Wakeup placementFutexPingPong, WakeChain, PipeIo
Task churnForkExit, CgroupAttachStorm, AffinityChurn
Priority and starvationPreemptStorm, RtStarvation, PriorityInversion
Memory and NUMACachePressure, PageFaultChurn, NumaWorkingSetSweep
IRQ and I/OIrqWake, NetTraffic, IoConvoy
BenchmarksSchbench, Taobench

Workers can also set comm, nice level, and thread-group-leader name (pcomm) to model real applications, and Custom takes a user-supplied work function. Full catalog: Work Types.

Gauntlet

A single #[ktstr_test] auto-expands across topology presets, and multi-kernel runs (--kernel A --kernel B) add the kernel as another matrix dimension. Budget-based selection maximizes coverage within a CI time limit; multi-NUMA and very large presets are opt-in via constraint attributes. See Gauntlet.

Real-kernel BPF verifier analysis

The verifier sweep boots a VM per (scheduler × kernel × topology preset), loads the scheduler through struct_ops — the same path production uses — and reads actual verified instruction counts from guest memory. Topology is a real verification axis: values baked into .rodata (like CPU counts) change what the verifier explores, so a scheduler can attach on one topology and be rejected on another.

cargo ktstr verifier --kernel 7.0 --scheduler ktstr_sched
        PASS [  12.406s] (1/4) ktstr::kaslr_axis_e2e verifier/ktstr_sched/kernel_7_0/odd-3llc
...
verifier verified_insns (per scheduler; rows: kernel, cols: BPF program, cell: range across topologies):

ktstr_sched:
 kernel      ktstr_dispatch  ktstr_dump  ktstr_dump_cpu  ktstr_dump_task  ktstr_enqueue  ktstr_exit  ktstr_exit_task  ktstr_init  ktstr_init_task  ktstr_select_cp  ktstr_yield 
 kernel_7_0  102             81          13              70               74             25          419              2296        29077            39               8           

verifier summary: 4 ✅  0 ❌  0 🇽
 topology   ktstr_sched 
 odd-3llc   ✅          
 smt-2llc   ✅          
 tiny-1llc  ✅          
 tiny-2llc  ✅          

On rejection, the log is cycle-collapsed — repeated loop-unrolling iterations are deduplicated so the offending access is readable:

Global function ktstr_dispatch() doesn't return scalar. Only those are supported.
...
--- 8x of the following 25 lines ---
; u64 t = bpf_ktime_get_ns(); @ main.bpf.c:453
38: (85) call bpf_ktime_get_ns#5      ; R0=scalar()
...
--- 6 identical iterations omitted ---
...
--- end repeat ---
192: (63) *(u32 *)(r1 +0) = r2
R1 invalid mem access 'scalar'
processed 186 insns (limit 1000000) max_states_per_insn 0 total_states 7 peak_states 7 mark_read 0

See BPF Verifier Sweep.

Bare-metal export

cargo ktstr export packages a registered test as a self-extracting .run script that reproduces the scenario on real hardware, no VM. The script freezes the scheduler binary, its args and config files, and the required topology; it validates the host and refuses to displace an already-attached sched_ext scheduler.

wrote /tmp/sched_basic_proportional.run (90074903 bytes archive, 0 include files)
----- head -40 of the generated script -----
#!/bin/bash
# Generated by `cargo ktstr export`. Do not edit; regenerate to update.
...
# --- frozen test specification ---
KTSTR_TEST_NAME=sched_basic_proportional
KTSTR_SCHED_NAME=ktstr_sched
KTSTR_GIT_HASH=73730e0
NEED_LLCS=1
NEED_CORES_PER_LLC=2
NEED_THREADS_PER_CORE=1
NEED_NUMA_NODES=1
...

See cargo ktstr.

Observability

Zero-perturbation introspection

Everything is built on direct reads of guest physical memory from the host via the KVM memory mapping. Kernel state — per-CPU runqueues, sched_domain trees, schedstat and sched_ext event counters — is read through BTF-resolved struct offsets; BPF maps get typed field access via program BTF. No guest-side instrumentation, no BPF syscalls: the observer does not perturb the scheduler under test. See Monitor.

Cast analysis

Schedulers stash kernel and arena pointers in BPF map fields declared as u64, because BTF cannot express those pointer types. The cast analyzer walks the scheduler’s instruction stream and proves which fields are really pointers and to what — so dumps chase through them and print typed structs annotated (cast→arena) or (cast→kernel) instead of raw hex. It runs on every scheduler load, no configuration (the failure-dump excerpt below shows it in action). See Monitor.

Periodic capture and temporal assertions

#[ktstr_test(num_snapshots = N)] samples BPF map fields and scheduler stats at N points across the workload window, from outside the guest. Temporal patterns — nondecreasing, rate_within, steady_within, converges_to, and friends — assert over the whole series; on-demand and write-triggered snapshots share the same machinery. See Periodic Capture, Temporal Assertions, and Snapshots.

Statistical regression detection

Every run writes machine-readable results; cross-run comparison with dual-gate significance thresholds (absolute and relative) catches regressions single-run assertions miss. Gated metrics include worst_spread, worst_gap_ms, worst_p99_wake_latency_us, and the duration-invariant total_run_delay_ns_per_sched; run cargo ktstr stats list-metrics for the registry. See Runs and Regression Gates and Assertable Metrics.

Debugging

Failure dumps

A failing test’s stderr carries the whole story: the tripped check, per-cgroup stats, a phase timeline, the scheduler log, the monitor’s verdict. On scheduler crash, ktstr also snapshots every BPF map with fields rendered by name through BTF — .bss globals, arena allocator state, typed pointer chases — plus vCPU registers at the instant of death:

cargo ktstr test — failure dump excerpt
--- repro VM failure dump ---
DualFailureDumpReport: early=absent (max_age never crossed threshold (peak=66j, threshold=2500j)), late=(12 maps, 2 vcpu_regs)
...
map bpf_bpf.bss (type=array, value_size=448, max_entries=1)
.bss:
  scx_arena_verify_once=true   ktstr_alloc_count=76   nr_dispatched=907
  nr_enqueued=495              nr_select_cpu=372      stats_magic=6004496034161779060
...
  scx_task_allocator scx_allocator:
...
    root 0x100000006000 → sdt_desc:
      nr_free=512
      chunk 0x100000007000 (sdt_alloc) → ktstr_arena_ctx{}
...
vcpu_regs:
  vcpu 0: ip=0xffffffff96347fbf sp=0xffffffff97203e78 ptroot=0x0000000001e85003
  vcpu 1: ip=0xffffffff9560bdc5 sp=0xff3b18cb8000f778 ptroot=0x0000000001e85003

The same report is written as a JSON artifact next to the run’s stats sidecar. See Reading Failure Output and Snapshots.

Auto-repro

On a scheduler crash, ktstr extracts the crash stack, discovers the struct_ops callbacks, and reruns the scenario in a second VM with BPF probes attached along the crash path — decoded function arguments and struct state at each call site, arrows marking entry-to-exit changes:

do_enqueue_task                                               kernel/sched/ext.c
  rq *rq
    cpu         1
  task_struct *p
    pid         97
    cpus_ptr    0xf(0-3)
    dsq_id      SCX_DSQ_INVALID          →  SCX_DSQ_LOCAL
...
    scx_flags   QUEUED|DEQD_FOR_SLEEP    →  QUEUED

On by default; requires a kernel with the sched_ext_exit tracepoint. See Auto-Repro.

Interactive shell

ktstr shell boots a busybox VM and drops you into it. --include-files injects host binaries with their shared-library closure resolved automatically (recursive DT_NEEDED discovery); --exec "cmd" runs one command non-interactively. For debugging, not tests. See ktstr (standalone).

ctprof

ktstr ctprof capture snapshots per-task and per-cgroup scheduler telemetry on the host — no VM involved. Capture before and after a change, then compare to see which processes scheduled differently:

## Primary metrics
 comm                              threads  metric             value                delta      %         %uptime 
 kworker/{N}:{N}-mm_percpu_wq                                                                                    
     kworker/{N}:{N}-mm_percpu_wq  11→37    voluntary_csw      8.697K → 101.154K    +92.457K   +1063.1%  93%     
     kworker/{N}:{N}-mm_percpu_wq  11→37    timeslices         8.699K → 101.166K    +92.467K   +1063.0%  93%     
     kworker/{N}:{N}-mm_percpu_wq  11→37    wait_time_ns       2.684s → 27.653s     +24.969s   +930.2%   93%     
...

See ctprof.

Infrastructure

Supported kernels

CapabilityKernel requirement
CI-tested series6.14 and 7.1, on x86_64 and aarch64, every push
Watchdog-timeout override7.1+ via BTF (scx_sched.watchdog_timeout); older kernels via the static scx_watchdog_timeout symbol
sched_ext event counters6.16+ (two BTF layouts); sampling is disabled when neither is present
Auto-repro probe triggerkernels with the sched_ext_exit tracepoint

Outside the CI-tested series, the monitor degrades feature by feature rather than failing: tests still run, and unavailable capabilities are reported as absent.

Kernel management

cargo ktstr kernel build builds and caches kernel images from version numbers, local source paths, or git URLs; automatic discovery resolves cached images, host kernels, and CI-provided paths, and an optional GHA cache backend shares built kernels across CI runs. See cargo ktstr and CI.

Performance mode

Performance mode pins vCPUs to reserved host cores, pre-faults 2 MB hugepages, runs vCPU threads under SCHED_FIFO, and suppresses PAUSE/HLT exits — removing host scheduling noise so a latency spike in the guest points at the scheduler under test, not the host. Guest-side jitter from shared LLCs and memory bandwidth remains, so compare performance-mode runs only against the same host. When the host can’t physically satisfy a declared topology, no_perf_mode builds the VM anyway and skips the isolation. See Performance Mode.

Resource-budget coordination

--cpu-cap N confines kernel builds and no-perf-mode VMs to N host CPUs, reserved whole-LLC-at-a-time and coordinated between concurrent ktstr processes through per-LLC locks — so a kernel build and a performance run can share a box without trampling each other. ktstr locks lists every held lock with its holder. See Resource Budget and ktstr (standalone).

Change-scoped selection

cargo ktstr affected attributes a base..HEAD diff to the schedulers it touches and emits a JSON array ready for a GitHub Actions dynamic matrix — one job per affected scheduler instead of the whole fleet. --relevant applies the same attribution to the local working tree for a fast inner loop. Any uncertainty widens to “run all”: silently skipping an affected scheduler is the worst outcome. See CI.

Guest coverage

Profraw data is collected inside the VM over shared memory and merged with host coverage, so guest-side code paths count toward the same cargo llvm-cov report as host-side ones.


Ready to try it? Getting Started sets up your first test; the Recipes cover complete workflows.