pub enum OpKind {
Show 28 variants
AddCgroup,
AddCgroupDef,
RemoveCgroup,
SetCpuset,
ClearCpuset,
SwapCpusets,
Spawn,
StopCgroup,
SetAffinity,
MoveAllTasks,
RunPayload,
WaitPayload,
KillPayload,
FreezeCgroup,
UnfreezeCgroup,
CaptureSnapshot,
WatchSnapshot,
WriteKernelHot,
WriteKernelCold,
ReadKernelHot,
ReadKernelCold,
AttachScheduler,
DetachScheduler,
RestartScheduler,
ReplaceScheduler,
PinBpfMap,
CaptureCgroupProcs,
SteerIrq,
}Expand description
Auto-generated discriminant enum variants
Variants§
AddCgroup
Create a new cgroup under the managed cgroup parent, with no
cpuset, no controller knobs, and no workers — the
operator-friendly way to declare an empty move-target cgroup
that later receives tasks via Op::MoveAllTasks or
similar. For mid-step cgroups that need cpuset / cpu /
memory / io / pids / workers, use Op::add_cgroup_def
instead; for setup-time cgroups with the same knobs, declare
via super::super::Step::with_defs.
AddCgroupDef
Create a cgroup mid-step from a full CgroupDef — cpuset,
cpu/memory/io/pids knobs, and worker spawns all apply in one
op, mirroring the way Step::with_defs materializes a
step-local CgroupDef at setup time. Use this when the
add-cgroup-with-cpuset-and-workers sequence needs to happen
after the step’s setup pass (e.g. driven by an earlier op’s
observed state) instead of as part of the step’s setup. The
embedded def is dedup-checked the same way apply_setup
rejects collisions with prior Backdrop or step-local
CgroupDef declarations.
RemoveCgroup
Remove a cgroup (stops its workers first). Permitted against
both step-local and Backdrop-owned cgroups; removing a
Backdrop cgroup mid-scenario drops it from the Backdrop
tracking list so a later Op::AddCgroup with the same name
can re-create the cgroup. A typo’d cgroup name surfaces
later as a kernel-layer “cgroup missing” error on the next
op that references the name, not at the RemoveCgroup site.
SetCpuset
Set a cgroup’s cpuset to the resolved CPU set.
ClearCpuset
Clear a cgroup’s cpuset (allow all CPUs).
SwapCpusets
Read both cgroups’ cpusets and swap them.
Spawn
Spawn workers and place them according to placement.
The work type is used as-is; gauntlet work_type_override does
not apply. Use CgroupDef with swappable(true) when the
work type should be overridable.
Placement contract (bullets follow SpawnPlacement variant
declaration order):
SpawnPlacement::RunnerCgroup— spawn workers in the spawner’s own cgroup; the handler issues ZERO cgroup ops and the workers inherit whatever cgroup the test runner sits in.WorkSpec::workers_pctis rejected for this placement because there’s no managed cgroup whose cpuset would supply the percentage denominator.SpawnPlacement::Cgroup— spawn workers and move them into the named cgroup; the cgroup must already exist (declared viaCgroupDefinStep.setup, viaOp::AddCgroup/Op::AddCgroupDefearlier in the same step, or on the persistentBackdrop).
StopCgroup
Stop all workers in a cgroup (does not remove the cgroup). Permitted against both step-local and Backdrop-owned cgroups; stopping a Backdrop cgroup’s workers mid-scenario leaves the cgroup hierarchy intact but makes subsequent ops that expect those workers (e.g. wait/kill payload) fail to find them.
SetAffinity
Set worker affinity in a cgroup. Resolved at apply time via
resolve_affinity_for_cgroup().
MoveAllTasks
Move all tasks from one cgroup to another.
Each task is moved via cgroup.procs. If any move fails, the
error propagates and handle name keys are left unchanged (workers
remain addressed under from). On success, handle name keys are
updated to to so subsequent ops address the moved workers.
§Self-move rejection
A self-move (from == to) is rejected at handler entry — the
kernel cgroup.procs write is idempotent on same-cgroup targets
so the op would silently no-op, masking either a stale op the
test author forgot to remove or a typo. The bail names both
sides so the operator can pick the right fix. The check also
catches the symmetric empty-string pair (("", "")), which
would otherwise no-op a RunnerCgroup-to-RunnerCgroup transfer.
§Empty-string source
Passing from = "" matches workers spawned by
Op::Spawn with SpawnPlacement::RunnerCgroup —
RunnerCgroup-placement handles are tracked under the
empty-string key (workers stay in the spawner’s own cgroup,
outside any managed hierarchy). Op::move_all_tasks("", "named") is the canonical way to materialize
RunnerCgroup-placement workers into a managed cgroup
mid-scenario; after the move the captured handles re-key
to "named" and lose their empty-string identity,
behaving like any other managed worker (lifetime tied to
"named"’s ownership slot per the table below).
§Lifetime / ownership-direction asymmetry
MoveAllTasks is asymmetric with respect to cgroup ownership:
the legality of a move depends on the relative lifetimes of
the from and to cgroups, not just on which one is the
source.
from ownership | to ownership | Outcome |
|---|---|---|
| step-local | step-local | Allowed; both die at step teardown together. |
| step-local | Backdrop (persistent) | Allowed; handle ownership transfers from step-local set to Backdrop set so the worker survives step teardown. |
| Backdrop | Backdrop | Allowed; both persist for the scenario. |
| Backdrop | step-local | Rejected at apply time. A persistent worker would be stranded inside a cgroup that gets rmdir’d at step boundary; the kernel migrates the orphaned task to the cgroup root with a frozen-task warning in dmesg. The bail! diagnostic names the offending pair and tells the operator to either declare the destination in the Backdrop too, or move the worker back into a Backdrop-owned cgroup. |
The Backdrop→Backdrop and step→step cases are unconditionally
allowed because both endpoints share a lifetime; the
step→Backdrop case is allowed because the kernel moves
reference-count once and the framework’s
ScenarioState::rename_handles
transfers the handle into the persistent slot in the same
step. The Backdrop→step case is the only one that produces
a guaranteed orphan, hence the asymmetric reject.
§Backdrop-setup exemption
MoveAllTasks ops running INSIDE a Backdrop’s setup_ops
pass (state.target_backdrop=true) are exempt from the
Backdrop→step-local check: at that point, “step-local”
cgroups don’t exist yet (the Backdrop is the only cgroup
scope), and the rule reduces to a pure source-ownership
check that the apply path handles already.
RunPayload
Spawn a userspace Payload
binary in the background and track its
PayloadHandle
under the step’s payload-handle set.
Subsequent Op::WaitPayload / Op::KillPayload address
the running child by the composite
(Payload::name, cgroup) key — the same payload can run
concurrently in two different cgroups without a dedup
collision, but the lookup from the waiting op must match
the pair the run op recorded. See Op::WaitPayload /
Op::KillPayload for the ambiguity rules when the
waiting op supplies only the name.
Only PayloadKind::Binary
payloads are spawnable; scheduler-kind payloads are rejected at
apply time with an actionable error.
args is appended to payload.default_args. cgroup, when
set, places the child in the named cgroup (resolved relative
to the scenario’s parent cgroup) via
PayloadRun::in_cgroup;
unset inherits the spawning process’s cgroup.
Handles not explicitly consumed by WaitPayload / KillPayload
are drained at step-teardown by collect_step (step-local) or
at scenario end by collect_backdrop (when the handle lives on
the Backdrop), matching the CgroupDef::workload semantics.
§Scheduler-kind rejection across surfaces
Three surfaces accept a &Payload and each rejects a
scheduler-kind Payload differently — deliberately, to match
the lifecycle of the caller:
| Surface | Rejection | When |
|---|---|---|
PayloadRun::run (ctx.payload(&X)...) | Err(anyhow::Error) | scenario-time |
CgroupDef::workload | panic! | declaration-time |
Op::RunPayload (this variant) | Err(anyhow::Error) | apply-ops-time |
Rationale: CgroupDef::workload is a builder invoked during
test construction (nextest --list phase) — a panic there
surfaces the misuse before any VM boot, with a full
backtrace pointing at the offending call. ctx.payload()
and Op::RunPayload both run inside an executing scenario
where one bad misuse should not crash the whole test run;
they bail! with an actionable message and let the
surrounding step-sequence skip to teardown. The three
paths are symmetric in what they reject (scheduler-kind
Payloads in non-scheduler slots); they differ only in
how the misuse is surfaced, matched to caller context.
WaitPayload
Block until the payload named name exits naturally, then
evaluate its checks and record metrics to the per-test sidecar.
The target is looked up by composite key (name, cgroup).
cgroup: None matches the unique live copy (whatever its
placement); if two or more copies of the same payload are
live in different cgroups, the lookup bails with an
“ambiguous — specify cgroup” error so the test doesn’t
silently wait on the wrong one. Use
Op::wait_payload_in_cgroup to disambiguate.
A consumed or unknown (name, cgroup) pair returns Err
with an actionable message — test authors must not silently
wait for payloads that were never started or have already
been consumed by a prior WaitPayload/KillPayload.
No timeout. WaitPayload waits indefinitely for the
child to exit. A binary that never terminates (e.g. a
benchmark configured without --runtime=N, or a stress-ng
run without --timeout) will hang the step until the
outer test watchdog fires. For time-boxed long-running
payloads, prefer KillPayload paired
with a super::super::HoldSpec::fixed / super::super::HoldSpec::frac step
boundary that guarantees forward progress; the payload’s
own CLI (--runtime, --timeout) is the reliable way to
cap a single invocation’s runtime.
Check failures from the payload are recorded to the sidecar
for regression analysis but do NOT fail the step or the test
in-process. Use
ctx.payload(&X).run()
directly if the test body needs to gate on check results.
KillPayload
SIGKILL the payload named name, reap the child, evaluate
checks, and record metrics. Mirrors the behavior of
step-teardown drain for an explicitly-targeted payload.
The target is looked up by composite key (name, cgroup)
— see Op::WaitPayload for the ambiguity rules.
A consumed or unknown (name, cgroup) pair returns Err
with an actionable message, identical to Op::WaitPayload’s
lookup semantics.
Check failures from the payload are recorded to the sidecar
for regression analysis but do NOT fail the step or the test
in-process. Use
ctx.payload(&X).run()
directly if the test body needs to gate on check results.
FreezeCgroup
Freeze every task in the named cgroup via cgroup.freeze.
Writes "1" to the cgroup’s cgroup.freeze file. The kernel’s
cgroup_freeze_write dispatches the asynchronous freeze path;
tasks transition to the frozen state without external SIGSTOP,
and cgroup.events reaches frozen 1 once every task has
parked. Idempotent — freezing an already-frozen cgroup is a
no-op.
§Auto-unfreeze at teardown
Op::FreezeCgroup is paired with Op::UnfreezeCgroup to
release. A test that omits the unfreeze still tears down
cleanly: crate::cgroup::CgroupManager::remove_cgroup
auto-unfreezes the cgroup before draining tasks (see the
kernel’s cgroup_freezer_migrate_task, which clears the
task’s freeze state when it migrates to an unfrozen
destination), so step teardown is robust to a stuck-frozen
cgroup. Pair the ops explicitly when the scenario needs
observable unfreeze timing inside the step body.
§Worked example
Three-Step suspend/resume sequence: a Backdrop-resident
long-running workload is paused mid-scenario and resumed
later, exercising how the scheduler responds to a sudden
idle window.
Step 1 (run): apply cgroup; workload spins for 2s.
Step 2 (suspend): Op::freeze_cgroup("workers"); hold 1s.
The cgroup's tasks park via cgroup.freeze,
schedstat gauges drop to zero, and the
scheduler observes a sudden idle subtree.
Step 3 (resume): Op::unfreeze_cgroup("workers"); hold 2s.
Tasks return to runnable state, the
scheduler must re-pick them onto the
cgroup's CPUs without spuriously preempting
unrelated workloads.§Observer-cgroup deadlock warning
Do NOT freeze a cgroup that hosts the test’s own observation machinery. The freeze path stops every task in the cgroup — including any thread that:
- opens
/proc/<pid>/schedor other procfs entries owned by tasks inside the frozen cgroup, then waits on the read, - holds a futex shared with frozen tasks (the unfreeze must land before the wait can complete),
- synchronously waits on a stalled-task pipe whose producer is in the frozen cgroup.
The framework’s stimulus-event SHM ring and the BlkWorker
epoll loop both run outside the test cgroup tree, so they
are unaffected — but a test author who explicitly places an
observer thread inside the same cgroup as its observation
targets will deadlock the scenario when the freeze fires.
Place observers in a sibling cgroup (or in the parent) so
cgroup.freeze is scoped to the workload subtree alone.
Pair with Op::UnfreezeCgroup to release. Useful for
scheduler suspend/resume tests where the test body wants to
observe how the scheduler handles a suddenly-frozen workload
and the resumption sequence afterwards.
Treats a missing cgroup as a step failure: the
cgroup.freeze write fails with ENOENT and the error
propagates via the apply_ops with_context chain.
Freezing a non-existent cgroup is NOT a no-op; only
freezing an already-frozen cgroup is.
UnfreezeCgroup
Unfreeze every task in the named cgroup via cgroup.freeze.
Writes "0" to the cgroup’s cgroup.freeze file. Inverse of
Op::FreezeCgroup. Idempotent.
CaptureSnapshot
Capture a host-side diagnostic snapshot under name. The
freeze coordinator pauses every vCPU long enough to read
the BPF map state, vCPU registers, and per-CPU
counters into a
FailureDumpReport,
then resumes the guest. The report is keyed by name on
the active
SnapshotBridge;
downstream test code reads it via
Snapshot.
On-demand snapshots are orthogonal to the error-class
freeze trigger — the request flows through a separate
channel, does not transition the coordinator’s
freeze_state, and is serviced even after Done. The only
scheduling rule: at most one capture in flight at a time
(each request waits for the previous freeze’s vCPUs to
fully resume before issuing).
Guest → host wire. In-guest scenarios submit the request
over the virtio-console port-1 TLV stream: request_snapshot
builds a SnapshotRequestPayload and writes it via
write_msg(MsgType::SnapshotRequest, ...) to /dev/vport0p1
(src/vmm/guest_comms.rs). The host coordinator decodes the
MSG_TYPE_SNAPSHOT_REQUEST frame, runs
freeze_and_dispatch(FreezeMode::Capture { .. }), and the
installed CaptureCallback returns the resulting report
through a paired reply frame. See
CaptureCallback
for the full protocol.
No active bridge ⇒ no-op. When the executor runs in a
context with no installed
SnapshotBridge
(e.g. unit tests that exercise the executor without
spinning up a VM), this op emits a tracing::warn! and
continues. Existing scenarios that never declare snapshot
ops keep their behavior unchanged.
§Example
Declare a snapshot mid-step, fetch the captured report after the scenario completes, and assert against a BTF-rendered field:
use ktstr::scenario::ops::{CgroupDef, HoldSpec, Op, Step, execute_steps};
use ktstr::scenario::snapshot::{Snapshot, SnapshotBridge};
// Wire up the bridge before execute_steps runs (host-side
// VM setup typically performs this step automatically).
let bridge = SnapshotBridge::new(/* capture callback */);
let _guard = bridge.clone().set_thread_local();
let steps = vec![Step {
setup: vec![CgroupDef::named("workers").workers(2)].into(),
ops: vec![Op::capture_snapshot("after_spawn")],
hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)?;
// Inspection.
let captured = bridge.drain();
let report = captured.get("after_spawn").expect("snapshot recorded");
let snap = Snapshot::new(report);
let nr_cpus = snap.var("nr_cpus_onln").as_u64()?;
assert!(nr_cpus > 0, "snapshot captured live nr_cpus_onln");WatchSnapshot
Capture a snapshot whenever the guest writes to the named kernel symbol. The snapshot is tagged with the symbol itself; one fire = one capture.
Symbol resolution at op execution time is a verbatim match
against the vmlinux ELF symbol table: the freeze coordinator
walks Elf::syms and accepts the symbol whose strtab entry
equals the requested string byte-for-byte. There is no
prefix stripping, BTF lookup, kallsyms walk, or per-CPU
offset arithmetic — the string must match an entry that
nm vmlinux would print (e.g. "jiffies_64",
"scx_watchdog_timestamp").
The register_watch callback on a host-side
SnapshotBridge
is for host-side unit testing only — it lets in-process
executor tests record the symbol and return without arming
any hardware. Production in-VM scenarios run via the
virtio-console port 1 MSG_TYPE_SNAPSHOT_REQUEST TLV frame
and the host coordinator’s arm_user_watchpoint path
(src/vmm/freeze_coord/mod.rs); the thread-local bridge is
never installed inside the guest.
§Guard rails
- Maximum of 3 watch ops per scenario. The KVM
hardware-watchpoint plumbing reserves slot 0 for the
existing
*scx_root->exit_kindtrigger (used by the error-trigger path); only the remaining three user watchpoint slots are available for on-demand watches. The bridge’sregister_watchrejects a 4thOp::WatchSnapshotand fails the step when the cap is exceeded. - Symbol resolution failures bail immediately. A
missing symbol or unaligned address surfaces as an
Errfromexecute_stepsso the test author notices the watch did not attach. Silent degradation would leave the scenario running with no captures and look identical to a healthy passing run. - 4-byte alignment. The resolved KVA must be 4-byte
aligned: the framework arms 4-byte data-write watches,
which require
addr & 0x3 == 0on every supported architecture. Mis-aligned addresses bail at setup with the resolved KVA in the error. - Silent-misfire detection (KASLR-on guests). When the
host coordinator’s
kaslr_offsetis zero AND the resolved kernel symbol lives in the x86_64 high-half address range,arm_user_watchpointemits atracing::warn!(once per unique(symbol, link_kva)per process) noting the arm targets the link-time KVA while the runtime symbol lives atlink_kva + runtime_kaslr_slide. The arm STILL completes (rejecting it would regress every caller running before the host coordinator’s runtime-KASLR-slide derivation lands); operators who hit the warn can boot the guest with thenokaslrcmdline to useOp::WatchSnapshot, or omit the op from KASLR-on test runs entirely.
Guest → host wire. The registration request rides the
same ioeventfd doorbell as Op::CaptureSnapshot (separate tag
namespace), so symbol resolution + user watchpoint slot
allocation + KVM_SET_GUEST_DEBUG arming happen on the host
without a vCPU userspace exit. Once armed, the
KVM_EXIT_DEBUG dispatch path drives the resulting
captures directly into the freeze coordinator (no
per-fire doorbell write needed). See
WatchRegisterCallback
for the full protocol.
Note: high-frequency variables (rq counters, jiffies)
will fire watches every few microseconds and fire
thousands of times (each overwriting the prior capture
under the same tag); the framework does not rate-limit
captures, so the test author owns the frequency choice.
Use Op::CaptureSnapshot for time-driven captures when
frequency is the concern.
WriteKernelHot
Live-vCPU write of one or more KernelTarget / KernelValue
pairs into running guest memory. The host coordinator routes
each pair to the appropriate GuestKernel::write_* helper
(no freeze rendezvous, vCPUs keep executing). A Release fence
is issued after the last write so a weakly-ordered guest’s
smp_load_acquire observes the bytes in write order — but
concurrent guest readers can still race against in-flight
stores, and the caller owns any guest-side synchronisation
the test requires (READ_ONCE / smp_load_acquire on the
target field).
Same orchestration pattern as the existing
BpfMapAccessor::write_value path: synchronous host-side
memory mutation on a worker thread, no vCPU pause. Use this
for scratch fields, debug flags, scx-ktstr-private state,
and anything the guest reads with proper barriers.
Batch shape. writes carries 1+ pairs; the executor
issues them in order. For a single write the
Op::write_kernel_hot singleton
constructor wraps a 1-element vec.
Dispatch. The executor’s arm calls
dispatch_kernel_op_request (src/scenario/ops/dispatch.rs:2386), which
uses the in-process SnapshotBridge callback when one is
installed (the test-fixture seam) and falls back to the
virtio-console port-1 wire path (MsgType::KernelOpRequest)
in-guest. The wire request is consumed by
dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs),
invoked from the freeze coordinator’s apply path.
See also. KernelTarget — scroll to the
“Semantic risk” section for the single source of truth
on which scheduler-bookkeeping targets are safe vs
silently load-bearing.
WriteKernelCold
Auto-freezing batched write of one or more
KernelTarget / KernelValue pairs while every vCPU is
parked at the freeze rendezvous. Reuses the same coordinator
path that Op::CaptureSnapshot triggers: one rendezvous,
every write in the batch lands while paused, then resume.
Batching is a hard correctness requirement. Multi-CPU
seeds (e.g. a planned with_uptime helper writing per-CPU
rq.clock on every CPU at the same instant) must land in
ONE freeze window —
N separate cold-write ops would mean N rendezvous cycles
and observable inter-CPU skew. The variant payload is a
Vec precisely to make batched writes the natural shape.
The executor’s apply_ops pre-pass auto-merges adjacent
singleton Op::WriteKernelCold ops into one merged op as
a safety net — N adjacent write_kernel_cold(...) calls
collapse into one rendezvous regardless of whether the
caller used crate::scenario::ops::Op::write_kernel_cold_batch
or chained singletons.
Dispatch. The executor’s arm calls
dispatch_kernel_op_request (src/scenario/ops/dispatch.rs:2386), which
uses the in-process SnapshotBridge callback when one is
installed (the test-fixture seam) and falls back to the
virtio-console port-1 wire path (MsgType::KernelOpRequest)
in-guest. The wire request lands at the freeze coordinator’s
rendezvous boundary via
dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs).
Use this for: multi-field atomic writes, all-CPUs-at-once
seeding, one-shot setup that must complete before the guest
observes any partial state. Use Op::WriteKernelHot when
the guest is OK with live-write semantics + caller-side
synchronisation.
See also. KernelTarget — scroll to the
“Semantic risk” section for the single source of truth
on which scheduler-bookkeeping targets are safe vs
silently load-bearing.
ReadKernelHot
Live-vCPU read of a KernelTarget into the
SnapshotBridge
drain log keyed by tag. Mirrors Op::WriteKernelHot:
no freeze rendezvous, host-side worker thread issues the
read while the guest keeps executing. The caller assumes
the read may race against guest writes; for read-write
coherency pair the op with a guest-side smp_store_release
on the target.
Use this for: read-back of values previously written via
Op::WriteKernelHot, lightweight polling of single fields
the test wants to observe without pausing the guest.
Width. The width field picks which
crate::monitor::guest::GuestKernel read_* family the
host dispatcher invokes — u32 / u64 / Bytes(len).
The reply lands as a crate::vmm::wire::KernelOpValue of
the matching shape in the bridge’s drain log; a u32 field
must be read with KernelValueWidth::u32() (a u64 read of
a u32 field returns the field’s bytes plus 4 adjacent
bytes).
Dispatch. Same bridge-first / wire-fallback model as
Op::WriteKernelHot; the wire request is consumed by
dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs).
ReadKernelCold
Auto-freezing read of a KernelTarget into the
SnapshotBridge
drain log keyed by tag, taken while every vCPU is parked
at the freeze rendezvous. Reuses the same coordinator path
that Op::CaptureSnapshot triggers. Coherent with
respect to guest state — no concurrent guest write can race
against the read.
Use this for: ground-truth reads that must reflect a stable
guest state, snapshot-style point-in-time reads. Note: each
Op::ReadKernelCold triggers its OWN freeze rendezvous —
apply_ops’s pre-pass folds adjacent
Op::WriteKernelCold ops into one rendezvous but does NOT
fold reads (per-entry wire tags are needed for the
multi-read reply-routing contract; queued as a wire-format
follow-up). For multi-read coherent snapshots, prefer
Op::CaptureSnapshot (which already orchestrates a single
rendezvous for all snapshot reads).
Width. Same width semantics as Op::ReadKernelHot:
pick the read family explicitly so the dispatcher invokes
the matching GuestKernel::read_* helper.
Dispatch. Bridge-first / wire-fallback like the other
*Kernel* variants; the wire request lands at the freeze
coordinator’s rendezvous boundary via
dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs).
AttachScheduler
Attach a scheduler mid-scenario: spawn the named staged
scheduler from /staging/schedulers/<name>/ inside the guest
and wait for it to publish its first BPF object accessors.
Dispatch (dispatch_attach_scheduler at
src/scenario/ops/dispatch.rs:2032): waits up to 60s for the
accessor-init worker to quiesce (handles the case where the
boot scheduler’s first publish is still in flight), captures
the pre-spawn publish seqno, spawns the staged scheduler
binary, re-installs the sched_exit_monitor against the new
SCHED_PID, then waits up to 30s for a fresh accessor publish.
Already-attached behavior. No framework-level idempotency
guard: if a scheduler is already running, the kernel rejects
the new attach at the scx_enable_state() != SCX_DISABLED
gate (kernel/sched/ext.c:6837, returns -EBUSY); the
spawned binary exits, no fresh publish lands, and the dispatch
bails on the 30s publish-wait timeout. Use
Op::DetachScheduler (then AttachScheduler) or
Op::ReplaceScheduler to swap schedulers.
The scheduler reference holds a 'static lifetime: the
test author declares each crate::test_support::Scheduler
at static scope (via declare_scheduler! or a
static MY_SCHED: Scheduler = ... item) and passes the
borrow into the constructor. The staging slot that ships the
binary into the initramfs is KtstrTestEntry::staged_schedulers;
the dispatch arm reads its path via
test_support::staged::staged_scheduler_binary_path.
DetachScheduler
Detach the currently-running scheduler.
Dispatch (dispatch_detach_scheduler →
kill_current_scheduler at src/scenario/ops/dispatch.rs:1896):
stops the host’s sched_exit_monitor so the intentional kill
isn’t promoted into a test-fatal scheduler-died signal,
writes 'S' to /proc/sysrq-trigger to start the kernel-
side scx_disable cascade asynchronously (avoiding the
D-state stall inside scx_flush_disable_work’s
kthread_flush_work(&sch->disable_work) at
kernel/sched/ext.c:6145, reached on the struct_ops detach
path via bpf_scx_unreg at kernel/sched/ext.c:7666), sends
SIGTERM to the
scheduler pid, waits up to SCHED_LIFECYCLE_KILL_GRACE (10s)
for the kernel BPF state to reach SCX_DISABLED, then
clears the SCHED_PID atomic (defined in
src/vmm/rust_init/mod.rs) so subsequent
crate::vmm::rust_init::sched_pid() reads return None.
Bails when no scheduler is currently attached (SCHED_PID is
0), when the SIGTERM syscall fails, or when the
SCX_DISABLED wait times out. NOT idempotent: a second
detach with no scheduler attached bails rather than no-oping.
For defensive “ensure clean slate” scaffolds, gate on
crate::vmm::rust_init::sched_pid() returning Some before
emitting the Detach step rather than relying on no-op
tolerance.
RestartScheduler
Kill the currently-running scheduler and respawn the BOOT scheduler. Useful for hot-restart validation of the boot scheduler. Bails if no scheduler is currently attached.
v0 limitation. Always respawns the boot scheduler at
/scheduler + /sched_args regardless of which scheduler
was most-recently attached — after an Op::AttachScheduler
or Op::ReplaceScheduler to a staged scheduler, this op
restarts the BOOT scheduler, not the most-recently-attached
one. For restarting a staged scheduler, use
Op::ReplaceScheduler with the same staged spec.
Dispatch (dispatch_restart_scheduler at
src/scenario/ops/dispatch.rs:2129): kills the current scheduler
via the shared kill_current_scheduler helper, spawns the
boot scheduler from the hardcoded /scheduler + /sched_args
paths with log at /tmp/sched.log, then re-installs the
sched_exit_monitor against the re-spawned boot pid.
ReplaceScheduler
Detach the currently-running scheduler and attach a different
one. Equivalent to [DetachScheduler, AttachScheduler { scheduler: new }] but expressed as a single op so the
no-scheduler window is bounded and the per-phase scheduler
tagging on the sidecar can record the transition atomically.
The mid-experiment swap case the operator typically wants:
run scheduler A for the first phase of a multi-step test, swap
to scheduler B (or A-with-different-CLI-args, modeled as a
distinct Scheduler declaration) for the second phase, and
assert a per-phase metric delta across the boundary.
Bails if no scheduler is currently attached — there is no
scheduler to detach from, so the “replace” semantic has no
meaning. Use Op::AttachScheduler for the first attach.
Dispatch (dispatch_replace_scheduler at
src/scenario/ops/dispatch.rs:2153): kills the current scheduler
via the shared kill_current_scheduler helper, spawns the
named staged scheduler binary from
/staging/schedulers/<name>/, re-installs the
sched_exit_monitor against the new SCHED_PID, waits up to
REPLACE_NOT_TRYING_DEADLINE_S (5s) for the accessor-init
worker to quiesce, captures the pre-publish seqno, then
waits up to 10s for fresh accessors to publish against the
new BPF object. The 10s budget aligns with
SCHED_LIFECYCLE_KILL_GRACE and covers a cold-cache vmlinux
re-parse during the worker reinit.
PinBpfMap
Open a BPF map fd by name and hold it for the scenario lifetime.
Why this exists. Op::ReplaceScheduler kills the outgoing
scheduler process; libbpf’s drop path then releases the map
fds the loader was holding. Once the last refcount on a map
drops, the kernel frees it — typically before any post-swap
freeze captures, so the multi-bss “same-binary swap window”
case (two <obj>.bss copies coexisting briefly) closes too
fast to be reliably observed in a test. PinBpfMap holds an
extra refcount on the named map so the kernel keeps it alive
until the scenario ends.
Semantics. Walks the kernel’s map ID space (via
[libbpf_rs::query::MapInfoIter], which wraps
BPF_MAP_GET_NEXT_ID + BPF_MAP_GET_FD_BY_ID +
BPF_OBJ_GET_INFO_BY_FD) and keeps the fd whose name matches.
The held fd lives in the scenario’s Backdrop state and drops
(via std OwnedFd Drop) at scenario teardown. Multiple
PinBpfMap ops with distinct names accumulate; pinning the
same name twice is a no-op (the second call returns without
re-opening the fd, so the originally-pinned map instance is the
one held — not the second-call-time instance).
Name truncation. BPF map names are capped at
BPF_OBJ_NAME_LEN = 16 bytes including the trailing NUL, so
15 usable chars max per kernel/bpf/syscall.c’s
bpf_obj_name_cpy. Pass the kernel-visible name (typically
<obj>.bss / <obj>.data / <obj>.rodata). When a libbpf
object name + section suffix exceeds the 15-char cap, libbpf
truncates the object prefix at load time and the kernel-side
name is the truncated form; the framework does not auto-
truncate the user-supplied string, so pass the post-truncation
form. Reading the map names from a prior
crate::monitor::dump::FailureDumpReport’s maps[].name
or via bpftool map list is the safe way to discover the
exact string the kernel sees.
Order. Place this op AFTER the scheduler that owns the
target map has attached (typically a small fixed hold suffices
— ~100ms for the small scx-ktstr fixture, longer for
heavyweight schedulers). For the same-binary swap-window
scenario specifically: pin the outgoing scheduler’s bss
before Op::ReplaceScheduler runs — pinning after the
swap is too late because the outgoing scheduler’s bss has
already been freed by libbpf’s drop path. The pin walker
picks the lowest-id matching map, so the outgoing copy (the
older id) is the one held; the incoming scheduler’s load
then creates a second copy that’s also kept alive because
the outgoing refcount blocks the kernel from freeing the id.
Failure surface. The pin runs at Step apply time inside
execute_steps / execute_scenario. A failure (no matching
map found in the walk) bails out of the apply path as an
Err from execute_steps; the scenario stops before the
next Step runs and the post_vm callback is not invoked.
The underlying [libbpf_rs::query::MapInfoIter] silently
terminates iteration on any non-ENOENT errno from the BPF
ID walk (including EPERM from missing CAP_SYS_ADMIN), so
such errors surface as the no-matching-map case rather than
a distinct EPERM error — acceptable because ktstr always runs
as root inside the guest, so the CAP_SYS_ADMIN gates at
kernel/bpf/syscall.c:4761 (BPF_MAP_GET_NEXT_ID walk) and
:4869 (BPF_MAP_GET_FD_BY_ID) are always satisfied and the
EPERM path is unreachable in practice.
Example.
let steps = vec![
// Phase 0: primary scheduler runs alone; pin BEFORE the swap.
Step::with_op(
Op::pin_bpf_map("<obj>.bss"),
HoldSpec::frac(0.3),
),
// Phase 1: swap to a same-binary alt — the pinned map
// keeps the OUTGOING bss alive across the teardown.
Step::with_op(
Op::replace_scheduler(&STAGED_ALT_SCHED),
HoldSpec::frac(0.7),
),
];See also. crate::scenario::bpf_pin::open_bpf_map_fd_by_name
for the underlying helper and tests/live_var_disambiguation_e2e.rs
for the swap-window conditional walker-fired gate this pin is
designed to make deterministic.
CaptureCgroupProcs
Capture the current cgroup.procs of cgroup and store the
PID list on the active SnapshotBridge
under tag.
Synchronous read of the cgroup-v2 cgroup.procs pseudofile in
the dispatching thread (in-scenario — runs wherever
execute_scenario runs; inside the guest VM for #[ktstr_test]
e2e tests, on the host for host-only scenarios). Returns the
thread-group leaders (PIDs / TGIDs) the kernel reports at apply
time. The snapshot is appended to the bridge’s per-tag drain
log; test bodies drain via
SnapshotBridge::drain_cgroup_procs
(or the by-tag lookup
SnapshotBridge::cgroup_procs_by_tag)
after the scenario completes to read the captured pids back.
Distinct from Op::CaptureSnapshot: that op routes through
the host-side freeze coordinator (TLV transport in production,
thread-local bridge in test fixtures); this op runs entirely
in-process against the local cgroupfs.
§Use cases
Pin “did my workers land in cgroup X” assertions without the shell-probe + tmpfs-roundtrip pattern. Typical shape:
use ktstr::prelude::SnapshotBridge;
use std::sync::Arc;
// Install a bridge (dummy capture cb — only cgroup-procs drain
// is used). MUST clone before set_thread_local, which consumes
// self — the clone shares the Arc-internal state and is what
// we drain on after the scenario completes.
let bridge = SnapshotBridge::new(Arc::new(|_| None));
let bridge_for_drain = bridge.clone();
let _guard = bridge.set_thread_local();
let backdrop = Backdrop::new().push_op(Op::add_cgroup("workers"));
let steps = vec![
Step::new(
vec![
Op::spawn(SpawnPlacement::cgroup("workers"),
WorkSpec::default().workers(4)),
Op::capture_cgroup_procs("after_spawn", "workers"),
],
HoldSpec::fixed(Duration::ZERO),
),
];
let _ = execute_scenario(&ctx, backdrop, steps)?;
// Either drain the whole log or look up by tag.
let after = bridge_for_drain.cgroup_procs_by_tag("after_spawn")
.expect("Op::CaptureCgroupProcs(\"after_spawn\", ...) snapshot");
assert_eq!(after.pids.len(), 4);§Within-Step ordering
Ops in a single Step apply sequentially in vec order, so a
Op::CaptureCgroupProcs placed AFTER Op::Spawn /
Op::MoveAllTasks observes the post-spawn / post-migrate
kernel state. The producing ops complete synchronously (their
cgroup.procs writes block on kernel commit), so the capture
sees every PID those ops placed.
§PID vs TID grain
Reads cgroup.procs (thread-group leaders), NOT cgroup.threads
(per-thread TIDs). Grain implications by spawn op:
Op::Spawn→ ktstr workers are 1-thread-per-worker, soworkers(N)producesNpids incgroup.procs.Op::RunPayload→ anexecve’d binary is ONE process; even if the binary spawns 100 threads,cgroup.procsreports the single thread-group leader. Tests asserting per-thread placement would need a siblingcgroup.threadsaccessor (future Op variant if a use case arises).
§Tag uniqueness
tag is the snapshot key the test body uses to find the
capture in the drain log. The apply-ops dispatch rejects an
empty tag with an actionable bail. Multiple captures of
the same cgroup under DIFFERENT tags surface as separate
entries (lets a scenario capture pre/post snapshots of the
same cgroup); multiple captures with the same (tag, cgroup)
also append rather than overwrite — tag uniqueness is a caller
convention, not a framework-enforced contract. The by-tag
lookup SnapshotBridge::cgroup_procs_by_tag
returns the FIRST match; callers who care about multiplicity
must use SnapshotBridge::drain_cgroup_procs
and filter the Vec manually.
§Empty / unknown cgroup
- Empty cgroup (exists but holds no tasks): captured snapshot
has
pids = vec![]. Lets callers assert “no tasks landed here” without conflating with “no such cgroup.” - Unknown cgroup (directory missing): apply bails with a
layered anyhow chain — the outer wrap names the op + tag +
cgroup; the inner
crate::cgroup::CgroupOps::read_procscontext surfaces the resolved path + the actionable hint aboutOp::AddCgroup/workload_root_cgroup. Useformat!("{err:#}")(alternate display) to flatten both layers in test assertions.
§See also
Op::CaptureSnapshot— diagnostic-snapshot capture (full scheduler state dump via FailureDumpReport). Distinct from this op’s cgroup-procs read AND drains via a separateSnapshotBridge::drain/drain_orderedchannel, notdrain_cgroup_procs.crate::cgroup::CgroupOps::read_procs— the underlying trait method this op dispatches through.
SteerIrq
Re-steer a hardware IRQ to a single CPU by writing
/proc/irq/<N>/smp_affinity_list in the guest — the knob
that drives the kernel’s write_irq_affinity →
irq_set_affinity → irq_do_set_affinity → irqchip
set_affinity path (kernel/irq/proc.c,
kernel/irq/manage.c). Use it to place a NIC’s
RX-completion interrupt on a chosen CPU so the hardirq, the
NET_RX softirq it raises, and any task that path wakes all
land where the scenario wants them: the steering half of an
IRQ-locality test whose generating half is
crate::workload::WorkType::NetTraffic and whose observing
half is the per-CPU IRQ metric axis (max_cpu_hardirqs,
max_cpu_softirq_net_rx, and their *_concentration ratios).
§In-guest file write, NOT a kernel-memory poke
A write to the irq_desc affinity mask in kernel memory would
NOT re-route delivery — only the smp_affinity_list write
runs the full set-affinity path that reprograms the interrupt
controller (MSI-X message / IOAPIC RTE). So this Op is
dispatched as a plain std::fs::write from the executor
in-guest (mirroring the /proc/sysrq-trigger write
Op::DetachScheduler performs), NOT through the
kernel-memory rendezvous path of Op::WriteKernelHot /
Op::WriteKernelCold.
§Online-CPU requirement
The kernel intersects the requested mask with
cpu_online_mask before programming the irqchip
(irq_do_set_affinity); a single-CPU target that is offline
leaves no online CPU in the mask and the write returns
-EINVAL (the !cpumask_intersects(new_value, cpu_online_mask) arm of write_irq_affinity). The
dispatcher pre-checks cpu against
/sys/devices/system/cpu/online and bails with an actionable
message before the write, so an out-of-range / offline target
names the CPU instead of surfacing a bare EINVAL. IRQ
affinity is a system-wide property, NOT scoped to the writing
task’s cpuset — the target need not be in the runner’s
allowed set.
Construct via Op::steer_irq.
Trait Implementations§
impl Copy for OpKind
impl Eq for OpKind
impl StructuralPartialEq for OpKind
Auto Trait Implementations§
impl Freeze for OpKind
impl RefUnwindSafe for OpKind
impl Send for OpKind
impl Sync for OpKind
impl Unpin for OpKind
impl UnwindSafe for OpKind
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§impl<Q, K> Equivalent<K> for Q
impl<Q, K> Equivalent<K> for Q
§fn equivalent(&self, key: &K) -> bool
fn equivalent(&self, key: &K) -> bool
§impl<Q, K> Equivalent<K> for Q
impl<Q, K> Equivalent<K> for Q
§fn equivalent(&self, key: &K) -> bool
fn equivalent(&self, key: &K) -> bool
key and return true if they are equal.§impl<Q, K> Equivalent<K> for Q
impl<Q, K> Equivalent<K> for Q
§fn equivalent(&self, key: &K) -> bool
fn equivalent(&self, key: &K) -> bool
§impl<Q, K> Equivalent<K> for Q
impl<Q, K> Equivalent<K> for Q
§fn equivalent(&self, key: &K) -> bool
fn equivalent(&self, key: &K) -> bool
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more