Enum OpKind

Source

pub enum OpKind {
Show 28 variants    AddCgroup,
    AddCgroupDef,
    RemoveCgroup,
    SetCpuset,
    ClearCpuset,
    SwapCpusets,
    Spawn,
    StopCgroup,
    SetAffinity,
    MoveAllTasks,
    RunPayload,
    WaitPayload,
    KillPayload,
    FreezeCgroup,
    UnfreezeCgroup,
    CaptureSnapshot,
    WatchSnapshot,
    WriteKernelHot,
    WriteKernelCold,
    ReadKernelHot,
    ReadKernelCold,
    AttachScheduler,
    DetachScheduler,
    RestartScheduler,
    ReplaceScheduler,
    PinBpfMap,
    CaptureCgroupProcs,
    SteerIrq,
}

Expand description

Auto-generated discriminant enum variants

Variants§

§

AddCgroup

Create a new cgroup under the managed cgroup parent, with no cpuset, no controller knobs, and no workers — the operator-friendly way to declare an empty move-target cgroup that later receives tasks via Op::MoveAllTasks or similar. For mid-step cgroups that need cpuset / cpu / memory / io / pids / workers, use Op::add_cgroup_def instead; for setup-time cgroups with the same knobs, declare via super::super::Step::with_defs.

§

AddCgroupDef

Create a cgroup mid-step from a full CgroupDef — cpuset, cpu/memory/io/pids knobs, and worker spawns all apply in one op, mirroring the way Step::with_defs materializes a step-local CgroupDef at setup time. Use this when the add-cgroup-with-cpuset-and-workers sequence needs to happen after the step’s setup pass (e.g. driven by an earlier op’s observed state) instead of as part of the step’s setup. The embedded def is dedup-checked the same way apply_setup rejects collisions with prior Backdrop or step-local CgroupDef declarations.

§

RemoveCgroup

Remove a cgroup (stops its workers first). Permitted against both step-local and Backdrop-owned cgroups; removing a Backdrop cgroup mid-scenario drops it from the Backdrop tracking list so a later Op::AddCgroup with the same name can re-create the cgroup. A typo’d cgroup name surfaces later as a kernel-layer “cgroup missing” error on the next op that references the name, not at the RemoveCgroup site.

§

SetCpuset

Set a cgroup’s cpuset to the resolved CPU set.

§

ClearCpuset

Clear a cgroup’s cpuset (allow all CPUs).

§

SwapCpusets

Read both cgroups’ cpusets and swap them.

§

Spawn

Spawn workers and place them according to placement.

The work type is used as-is; gauntlet work_type_override does not apply. Use CgroupDef with swappable(true) when the work type should be overridable.

Placement contract (bullets follow SpawnPlacement variant declaration order):

SpawnPlacement::RunnerCgroup — spawn workers in the spawner’s own cgroup; the handler issues ZERO cgroup ops and the workers inherit whatever cgroup the test runner sits in. WorkSpec::workers_pct is rejected for this placement because there’s no managed cgroup whose cpuset would supply the percentage denominator.
SpawnPlacement::Cgroup — spawn workers and move them into the named cgroup; the cgroup must already exist (declared via CgroupDef in Step.setup, via Op::AddCgroup / Op::AddCgroupDef earlier in the same step, or on the persistent Backdrop).

§

StopCgroup

Stop all workers in a cgroup (does not remove the cgroup). Permitted against both step-local and Backdrop-owned cgroups; stopping a Backdrop cgroup’s workers mid-scenario leaves the cgroup hierarchy intact but makes subsequent ops that expect those workers (e.g. wait/kill payload) fail to find them.

§

SetAffinity

Set worker affinity in a cgroup. Resolved at apply time via resolve_affinity_for_cgroup().

§

MoveAllTasks

Move all tasks from one cgroup to another.

Each task is moved via cgroup.procs. If any move fails, the error propagates and handle name keys are left unchanged (workers remain addressed under from). On success, handle name keys are updated to to so subsequent ops address the moved workers.

§Self-move rejection

A self-move (from == to) is rejected at handler entry — the kernel cgroup.procs write is idempotent on same-cgroup targets so the op would silently no-op, masking either a stale op the test author forgot to remove or a typo. The bail names both sides so the operator can pick the right fix. The check also catches the symmetric empty-string pair (("", "")), which would otherwise no-op a RunnerCgroup-to-RunnerCgroup transfer.

§Empty-string source

Passing from = "" matches workers spawned by Op::Spawn with SpawnPlacement::RunnerCgroup — RunnerCgroup-placement handles are tracked under the empty-string key (workers stay in the spawner’s own cgroup, outside any managed hierarchy). Op::move_all_tasks("", "named") is the canonical way to materialize RunnerCgroup-placement workers into a managed cgroup mid-scenario; after the move the captured handles re-key to "named" and lose their empty-string identity, behaving like any other managed worker (lifetime tied to "named"’s ownership slot per the table below).

§Lifetime / ownership-direction asymmetry

MoveAllTasks is asymmetric with respect to cgroup ownership: the legality of a move depends on the relative lifetimes of the from and to cgroups, not just on which one is the source.

`from` ownership	`to` ownership	Outcome
step-local	step-local	Allowed; both die at step teardown together.
step-local	Backdrop (persistent)	Allowed; handle ownership transfers from step-local set to Backdrop set so the worker survives step teardown.
Backdrop	Backdrop	Allowed; both persist for the scenario.
Backdrop	step-local	Rejected at apply time. A persistent worker would be stranded inside a cgroup that gets `rmdir`’d at step boundary; the kernel migrates the orphaned task to the cgroup root with a frozen-task warning in dmesg. The `bail!` diagnostic names the offending pair and tells the operator to either declare the destination in the Backdrop too, or move the worker back into a Backdrop-owned cgroup.

The Backdrop→Backdrop and step→step cases are unconditionally allowed because both endpoints share a lifetime; the step→Backdrop case is allowed because the kernel moves reference-count once and the framework’s ScenarioState::rename_handles transfers the handle into the persistent slot in the same step. The Backdrop→step case is the only one that produces a guaranteed orphan, hence the asymmetric reject.

§Backdrop-setup exemption

MoveAllTasks ops running INSIDE a Backdrop’s setup_ops pass (state.target_backdrop=true) are exempt from the Backdrop→step-local check: at that point, “step-local” cgroups don’t exist yet (the Backdrop is the only cgroup scope), and the rule reduces to a pure source-ownership check that the apply path handles already.

§

RunPayload

Spawn a userspace Payload binary in the background and track its PayloadHandle under the step’s payload-handle set.

Subsequent Op::WaitPayload / Op::KillPayload address the running child by the composite (Payload::name, cgroup) key — the same payload can run concurrently in two different cgroups without a dedup collision, but the lookup from the waiting op must match the pair the run op recorded. See Op::WaitPayload / Op::KillPayload for the ambiguity rules when the waiting op supplies only the name.

Only PayloadKind::Binary payloads are spawnable; scheduler-kind payloads are rejected at apply time with an actionable error.

args is appended to payload.default_args. cgroup, when set, places the child in the named cgroup (resolved relative to the scenario’s parent cgroup) via PayloadRun::in_cgroup; unset inherits the spawning process’s cgroup.

Handles not explicitly consumed by WaitPayload / KillPayload are drained at step-teardown by collect_step (step-local) or at scenario end by collect_backdrop (when the handle lives on the Backdrop), matching the CgroupDef::workload semantics.

§Scheduler-kind rejection across surfaces

Three surfaces accept a &Payload and each rejects a scheduler-kind Payload differently — deliberately, to match the lifecycle of the caller:

Surface	Rejection	When
`PayloadRun::run` (`ctx.payload(&X)...`)	`Err(anyhow::Error)`	scenario-time
`CgroupDef::workload`	`panic!`	declaration-time
`Op::RunPayload` (this variant)	`Err(anyhow::Error)`	apply-ops-time

Rationale: CgroupDef::workload is a builder invoked during test construction (nextest --list phase) — a panic there surfaces the misuse before any VM boot, with a full backtrace pointing at the offending call. ctx.payload() and Op::RunPayload both run inside an executing scenario where one bad misuse should not crash the whole test run; they bail! with an actionable message and let the surrounding step-sequence skip to teardown. The three paths are symmetric in what they reject (scheduler-kind Payloads in non-scheduler slots); they differ only in how the misuse is surfaced, matched to caller context.

§

WaitPayload

Block until the payload named name exits naturally, then evaluate its checks and record metrics to the per-test sidecar.

The target is looked up by composite key (name, cgroup). cgroup: None matches the unique live copy (whatever its placement); if two or more copies of the same payload are live in different cgroups, the lookup bails with an “ambiguous — specify cgroup” error so the test doesn’t silently wait on the wrong one. Use Op::wait_payload_in_cgroup to disambiguate.

A consumed or unknown (name, cgroup) pair returns Err with an actionable message — test authors must not silently wait for payloads that were never started or have already been consumed by a prior WaitPayload/KillPayload.

No timeout. WaitPayload waits indefinitely for the child to exit. A binary that never terminates (e.g. a benchmark configured without --runtime=N, or a stress-ng run without --timeout) will hang the step until the outer test watchdog fires. For time-boxed long-running payloads, prefer KillPayload paired with a super::super::HoldSpec::fixed / super::super::HoldSpec::frac step boundary that guarantees forward progress; the payload’s own CLI (--runtime, --timeout) is the reliable way to cap a single invocation’s runtime.

Check failures from the payload are recorded to the sidecar for regression analysis but do NOT fail the step or the test in-process. Use ctx.payload(&X).run() directly if the test body needs to gate on check results.

§

KillPayload

SIGKILL the payload named name, reap the child, evaluate checks, and record metrics. Mirrors the behavior of step-teardown drain for an explicitly-targeted payload.

The target is looked up by composite key (name, cgroup) — see Op::WaitPayload for the ambiguity rules.

A consumed or unknown (name, cgroup) pair returns Err with an actionable message, identical to Op::WaitPayload’s lookup semantics.

Check failures from the payload are recorded to the sidecar for regression analysis but do NOT fail the step or the test in-process. Use ctx.payload(&X).run() directly if the test body needs to gate on check results.

§

FreezeCgroup

Freeze every task in the named cgroup via cgroup.freeze.

Writes "1" to the cgroup’s cgroup.freeze file. The kernel’s cgroup_freeze_write dispatches the asynchronous freeze path; tasks transition to the frozen state without external SIGSTOP, and cgroup.events reaches frozen 1 once every task has parked. Idempotent — freezing an already-frozen cgroup is a no-op.

§Auto-unfreeze at teardown

Op::FreezeCgroup is paired with Op::UnfreezeCgroup to release. A test that omits the unfreeze still tears down cleanly: crate::cgroup::CgroupManager::remove_cgroup auto-unfreezes the cgroup before draining tasks (see the kernel’s cgroup_freezer_migrate_task, which clears the task’s freeze state when it migrates to an unfrozen destination), so step teardown is robust to a stuck-frozen cgroup. Pair the ops explicitly when the scenario needs observable unfreeze timing inside the step body.

§Worked example

Three-Step suspend/resume sequence: a Backdrop-resident long-running workload is paused mid-scenario and resumed later, exercising how the scheduler responds to a sudden idle window.

Step 1 (run): apply cgroup; workload spins for 2s.
Step 2 (suspend): Op::freeze_cgroup("workers"); hold 1s.
                  The cgroup's tasks park via cgroup.freeze,
                  schedstat gauges drop to zero, and the
                  scheduler observes a sudden idle subtree.
Step 3 (resume): Op::unfreeze_cgroup("workers"); hold 2s.
                 Tasks return to runnable state, the
                 scheduler must re-pick them onto the
                 cgroup's CPUs without spuriously preempting
                 unrelated workloads.

§Observer-cgroup deadlock warning

Do NOT freeze a cgroup that hosts the test’s own observation machinery. The freeze path stops every task in the cgroup — including any thread that:

opens /proc/<pid>/sched or other procfs entries owned by tasks inside the frozen cgroup, then waits on the read,
holds a futex shared with frozen tasks (the unfreeze must land before the wait can complete),
synchronously waits on a stalled-task pipe whose producer is in the frozen cgroup.

The framework’s stimulus-event SHM ring and the BlkWorker epoll loop both run outside the test cgroup tree, so they are unaffected — but a test author who explicitly places an observer thread inside the same cgroup as its observation targets will deadlock the scenario when the freeze fires. Place observers in a sibling cgroup (or in the parent) so cgroup.freeze is scoped to the workload subtree alone.

Pair with Op::UnfreezeCgroup to release. Useful for scheduler suspend/resume tests where the test body wants to observe how the scheduler handles a suddenly-frozen workload and the resumption sequence afterwards.

Treats a missing cgroup as a step failure: the cgroup.freeze write fails with ENOENT and the error propagates via the apply_ops with_context chain. Freezing a non-existent cgroup is NOT a no-op; only freezing an already-frozen cgroup is.

§

UnfreezeCgroup

Unfreeze every task in the named cgroup via cgroup.freeze.

Writes "0" to the cgroup’s cgroup.freeze file. Inverse of Op::FreezeCgroup. Idempotent.

§

CaptureSnapshot

Capture a host-side diagnostic snapshot under name. The freeze coordinator pauses every vCPU long enough to read the BPF map state, vCPU registers, and per-CPU counters into a FailureDumpReport, then resumes the guest. The report is keyed by name on the active SnapshotBridge; downstream test code reads it via Snapshot.

On-demand snapshots are orthogonal to the error-class freeze trigger — the request flows through a separate channel, does not transition the coordinator’s freeze_state, and is serviced even after Done. The only scheduling rule: at most one capture in flight at a time (each request waits for the previous freeze’s vCPUs to fully resume before issuing).

Guest → host wire. In-guest scenarios submit the request over the virtio-console port-1 TLV stream: request_snapshot builds a SnapshotRequestPayload and writes it via write_msg(MsgType::SnapshotRequest, ...) to /dev/vport0p1 (src/vmm/guest_comms.rs). The host coordinator decodes the MSG_TYPE_SNAPSHOT_REQUEST frame, runs freeze_and_dispatch(FreezeMode::Capture { .. }), and the installed CaptureCallback returns the resulting report through a paired reply frame. See CaptureCallback for the full protocol.

No active bridge ⇒ no-op. When the executor runs in a context with no installed SnapshotBridge (e.g. unit tests that exercise the executor without spinning up a VM), this op emits a tracing::warn! and continues. Existing scenarios that never declare snapshot ops keep their behavior unchanged.

§Example

Declare a snapshot mid-step, fetch the captured report after the scenario completes, and assert against a BTF-rendered field:

use ktstr::scenario::ops::{CgroupDef, HoldSpec, Op, Step, execute_steps};
use ktstr::scenario::snapshot::{Snapshot, SnapshotBridge};

// Wire up the bridge before execute_steps runs (host-side
// VM setup typically performs this step automatically).
let bridge = SnapshotBridge::new(/* capture callback */);
let _guard = bridge.clone().set_thread_local();

let steps = vec![Step {
    setup: vec![CgroupDef::named("workers").workers(2)].into(),
    ops: vec![Op::capture_snapshot("after_spawn")],
    hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)?;

// Inspection.
let captured = bridge.drain();
let report = captured.get("after_spawn").expect("snapshot recorded");
let snap = Snapshot::new(report);
let nr_cpus = snap.var("nr_cpus_onln").as_u64()?;
assert!(nr_cpus > 0, "snapshot captured live nr_cpus_onln");

§

WatchSnapshot

Capture a snapshot whenever the guest writes to the named kernel symbol. The snapshot is tagged with the symbol itself; one fire = one capture.

Symbol resolution at op execution time is a verbatim match against the vmlinux ELF symbol table: the freeze coordinator walks Elf::syms and accepts the symbol whose strtab entry equals the requested string byte-for-byte. There is no prefix stripping, BTF lookup, kallsyms walk, or per-CPU offset arithmetic — the string must match an entry that nm vmlinux would print (e.g. "jiffies_64", "scx_watchdog_timestamp").

The register_watch callback on a host-side SnapshotBridge is for host-side unit testing only — it lets in-process executor tests record the symbol and return without arming any hardware. Production in-VM scenarios run via the virtio-console port 1 MSG_TYPE_SNAPSHOT_REQUEST TLV frame and the host coordinator’s arm_user_watchpoint path (src/vmm/freeze_coord/mod.rs); the thread-local bridge is never installed inside the guest.

§Guard rails

Maximum of 3 watch ops per scenario. The KVM hardware-watchpoint plumbing reserves slot 0 for the existing *scx_root->exit_kind trigger (used by the error-trigger path); only the remaining three user watchpoint slots are available for on-demand watches. The bridge’s register_watch rejects a 4th Op::WatchSnapshot and fails the step when the cap is exceeded.
Symbol resolution failures bail immediately. A missing symbol or unaligned address surfaces as an Err from execute_steps so the test author notices the watch did not attach. Silent degradation would leave the scenario running with no captures and look identical to a healthy passing run.
4-byte alignment. The resolved KVA must be 4-byte aligned: the framework arms 4-byte data-write watches, which require addr & 0x3 == 0 on every supported architecture. Mis-aligned addresses bail at setup with the resolved KVA in the error.
Silent-misfire detection (KASLR-on guests). When the host coordinator’s kaslr_offset is zero AND the resolved kernel symbol lives in the x86_64 high-half address range, arm_user_watchpoint emits a tracing::warn! (once per unique (symbol, link_kva) per process) noting the arm targets the link-time KVA while the runtime symbol lives at link_kva + runtime_kaslr_slide. The arm STILL completes (rejecting it would regress every caller running before the host coordinator’s runtime-KASLR-slide derivation lands); operators who hit the warn can boot the guest with the nokaslr cmdline to use Op::WatchSnapshot, or omit the op from KASLR-on test runs entirely.

Guest → host wire. The registration request rides the same ioeventfd doorbell as Op::CaptureSnapshot (separate tag namespace), so symbol resolution + user watchpoint slot allocation + KVM_SET_GUEST_DEBUG arming happen on the host without a vCPU userspace exit. Once armed, the KVM_EXIT_DEBUG dispatch path drives the resulting captures directly into the freeze coordinator (no per-fire doorbell write needed). See WatchRegisterCallback for the full protocol.

Note: high-frequency variables (rq counters, jiffies) will fire watches every few microseconds and fire thousands of times (each overwriting the prior capture under the same tag); the framework does not rate-limit captures, so the test author owns the frequency choice. Use Op::CaptureSnapshot for time-driven captures when frequency is the concern.

§

WriteKernelHot

Live-vCPU write of one or more KernelTarget / KernelValue pairs into running guest memory. The host coordinator routes each pair to the appropriate GuestKernel::write_* helper (no freeze rendezvous, vCPUs keep executing). A Release fence is issued after the last write so a weakly-ordered guest’s smp_load_acquire observes the bytes in write order — but concurrent guest readers can still race against in-flight stores, and the caller owns any guest-side synchronisation the test requires (READ_ONCE / smp_load_acquire on the target field).

Same orchestration pattern as the existing BpfMapAccessor::write_value path: synchronous host-side memory mutation on a worker thread, no vCPU pause. Use this for scratch fields, debug flags, scx-ktstr-private state, and anything the guest reads with proper barriers.

Batch shape. writes carries 1+ pairs; the executor issues them in order. For a single write the Op::write_kernel_hot singleton constructor wraps a 1-element vec.

Dispatch. The executor’s arm calls dispatch_kernel_op_request (src/scenario/ops/dispatch.rs:2386), which uses the in-process SnapshotBridge callback when one is installed (the test-fixture seam) and falls back to the virtio-console port-1 wire path (MsgType::KernelOpRequest) in-guest. The wire request is consumed by dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs), invoked from the freeze coordinator’s apply path.

See also. KernelTarget — scroll to the “Semantic risk” section for the single source of truth on which scheduler-bookkeeping targets are safe vs silently load-bearing.

§

WriteKernelCold

Auto-freezing batched write of one or more KernelTarget / KernelValue pairs while every vCPU is parked at the freeze rendezvous. Reuses the same coordinator path that Op::CaptureSnapshot triggers: one rendezvous, every write in the batch lands while paused, then resume.

Batching is a hard correctness requirement. Multi-CPU seeds (e.g. a planned with_uptime helper writing per-CPU rq.clock on every CPU at the same instant) must land in ONE freeze window — N separate cold-write ops would mean N rendezvous cycles and observable inter-CPU skew. The variant payload is a Vec precisely to make batched writes the natural shape. The executor’s apply_ops pre-pass auto-merges adjacent singleton Op::WriteKernelCold ops into one merged op as a safety net — N adjacent write_kernel_cold(...) calls collapse into one rendezvous regardless of whether the caller used crate::scenario::ops::Op::write_kernel_cold_batch or chained singletons.

Dispatch. The executor’s arm calls dispatch_kernel_op_request (src/scenario/ops/dispatch.rs:2386), which uses the in-process SnapshotBridge callback when one is installed (the test-fixture seam) and falls back to the virtio-console port-1 wire path (MsgType::KernelOpRequest) in-guest. The wire request lands at the freeze coordinator’s rendezvous boundary via dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs).

Use this for: multi-field atomic writes, all-CPUs-at-once seeding, one-shot setup that must complete before the guest observes any partial state. Use Op::WriteKernelHot when the guest is OK with live-write semantics + caller-side synchronisation.

See also. KernelTarget — scroll to the “Semantic risk” section for the single source of truth on which scheduler-bookkeeping targets are safe vs silently load-bearing.

§

ReadKernelHot

Live-vCPU read of a KernelTarget into the SnapshotBridge drain log keyed by tag. Mirrors Op::WriteKernelHot: no freeze rendezvous, host-side worker thread issues the read while the guest keeps executing. The caller assumes the read may race against guest writes; for read-write coherency pair the op with a guest-side smp_store_release on the target.

Use this for: read-back of values previously written via Op::WriteKernelHot, lightweight polling of single fields the test wants to observe without pausing the guest.

Width. The width field picks which crate::monitor::guest::GuestKernel read_* family the host dispatcher invokes — u32 / u64 / Bytes(len). The reply lands as a crate::vmm::wire::KernelOpValue of the matching shape in the bridge’s drain log; a u32 field must be read with KernelValueWidth::u32() (a u64 read of a u32 field returns the field’s bytes plus 4 adjacent bytes).

Dispatch. Same bridge-first / wire-fallback model as Op::WriteKernelHot; the wire request is consumed by dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs).

§

ReadKernelCold

Auto-freezing read of a KernelTarget into the SnapshotBridge drain log keyed by tag, taken while every vCPU is parked at the freeze rendezvous. Reuses the same coordinator path that Op::CaptureSnapshot triggers. Coherent with respect to guest state — no concurrent guest write can race against the read.

Use this for: ground-truth reads that must reflect a stable guest state, snapshot-style point-in-time reads. Note: each Op::ReadKernelCold triggers its OWN freeze rendezvous — apply_ops’s pre-pass folds adjacent Op::WriteKernelCold ops into one rendezvous but does NOT fold reads (per-entry wire tags are needed for the multi-read reply-routing contract; queued as a wire-format follow-up). For multi-read coherent snapshots, prefer Op::CaptureSnapshot (which already orchestrates a single rendezvous for all snapshot reads).

Width. Same width semantics as Op::ReadKernelHot: pick the read family explicitly so the dispatcher invokes the matching GuestKernel::read_* helper.

Dispatch. Bridge-first / wire-fallback like the other *Kernel* variants; the wire request lands at the freeze coordinator’s rendezvous boundary via dispatch_kernel_op_batch (src/vmm/freeze_coord/kernel_op_dispatch.rs).

§

AttachScheduler

Attach a scheduler mid-scenario: spawn the named staged scheduler from /staging/schedulers/<name>/ inside the guest and wait for it to publish its first BPF object accessors.

Dispatch (dispatch_attach_scheduler at src/scenario/ops/dispatch.rs:2032): waits up to 60s for the accessor-init worker to quiesce (handles the case where the boot scheduler’s first publish is still in flight), captures the pre-spawn publish seqno, spawns the staged scheduler binary, re-installs the sched_exit_monitor against the new SCHED_PID, then waits up to 30s for a fresh accessor publish.

Already-attached behavior. No framework-level idempotency guard: if a scheduler is already running, the kernel rejects the new attach at the scx_enable_state() != SCX_DISABLED gate (kernel/sched/ext.c:6837, returns -EBUSY); the spawned binary exits, no fresh publish lands, and the dispatch bails on the 30s publish-wait timeout. Use Op::DetachScheduler (then AttachScheduler) or Op::ReplaceScheduler to swap schedulers.

The scheduler reference holds a 'static lifetime: the test author declares each crate::test_support::Scheduler at static scope (via declare_scheduler! or a static MY_SCHED: Scheduler = ... item) and passes the borrow into the constructor. The staging slot that ships the binary into the initramfs is KtstrTestEntry::staged_schedulers; the dispatch arm reads its path via test_support::staged::staged_scheduler_binary_path.

§

DetachScheduler

Detach the currently-running scheduler.

Dispatch (dispatch_detach_scheduler → kill_current_scheduler at src/scenario/ops/dispatch.rs:1896): stops the host’s sched_exit_monitor so the intentional kill isn’t promoted into a test-fatal scheduler-died signal, writes 'S' to /proc/sysrq-trigger to start the kernel- side scx_disable cascade asynchronously (avoiding the D-state stall inside scx_flush_disable_work’s kthread_flush_work(&sch->disable_work) at kernel/sched/ext.c:6145, reached on the struct_ops detach path via bpf_scx_unreg at kernel/sched/ext.c:7666), sends SIGTERM to the scheduler pid, waits up to SCHED_LIFECYCLE_KILL_GRACE (10s) for the kernel BPF state to reach SCX_DISABLED, then clears the SCHED_PID atomic (defined in src/vmm/rust_init/mod.rs) so subsequent crate::vmm::rust_init::sched_pid() reads return None.

Bails when no scheduler is currently attached (SCHED_PID is 0), when the SIGTERM syscall fails, or when the SCX_DISABLED wait times out. NOT idempotent: a second detach with no scheduler attached bails rather than no-oping. For defensive “ensure clean slate” scaffolds, gate on crate::vmm::rust_init::sched_pid() returning Some before emitting the Detach step rather than relying on no-op tolerance.

§

RestartScheduler

Kill the currently-running scheduler and respawn the BOOT scheduler. Useful for hot-restart validation of the boot scheduler. Bails if no scheduler is currently attached.

v0 limitation. Always respawns the boot scheduler at /scheduler + /sched_args regardless of which scheduler was most-recently attached — after an Op::AttachScheduler or Op::ReplaceScheduler to a staged scheduler, this op restarts the BOOT scheduler, not the most-recently-attached one. For restarting a staged scheduler, use Op::ReplaceScheduler with the same staged spec.

Dispatch (dispatch_restart_scheduler at src/scenario/ops/dispatch.rs:2129): kills the current scheduler via the shared kill_current_scheduler helper, spawns the boot scheduler from the hardcoded /scheduler + /sched_args paths with log at /tmp/sched.log, then re-installs the sched_exit_monitor against the re-spawned boot pid.

§

ReplaceScheduler

Detach the currently-running scheduler and attach a different one. Equivalent to [DetachScheduler, AttachScheduler { scheduler: new }] but expressed as a single op so the no-scheduler window is bounded and the per-phase scheduler tagging on the sidecar can record the transition atomically.

The mid-experiment swap case the operator typically wants: run scheduler A for the first phase of a multi-step test, swap to scheduler B (or A-with-different-CLI-args, modeled as a distinct Scheduler declaration) for the second phase, and assert a per-phase metric delta across the boundary.

Bails if no scheduler is currently attached — there is no scheduler to detach from, so the “replace” semantic has no meaning. Use Op::AttachScheduler for the first attach.

Dispatch (dispatch_replace_scheduler at src/scenario/ops/dispatch.rs:2153): kills the current scheduler via the shared kill_current_scheduler helper, spawns the named staged scheduler binary from /staging/schedulers/<name>/, re-installs the sched_exit_monitor against the new SCHED_PID, waits up to REPLACE_NOT_TRYING_DEADLINE_S (5s) for the accessor-init worker to quiesce, captures the pre-publish seqno, then waits up to 10s for fresh accessors to publish against the new BPF object. The 10s budget aligns with SCHED_LIFECYCLE_KILL_GRACE and covers a cold-cache vmlinux re-parse during the worker reinit.

§

PinBpfMap

Open a BPF map fd by name and hold it for the scenario lifetime.

Why this exists. Op::ReplaceScheduler kills the outgoing scheduler process; libbpf’s drop path then releases the map fds the loader was holding. Once the last refcount on a map drops, the kernel frees it — typically before any post-swap freeze captures, so the multi-bss “same-binary swap window” case (two <obj>.bss copies coexisting briefly) closes too fast to be reliably observed in a test. PinBpfMap holds an extra refcount on the named map so the kernel keeps it alive until the scenario ends.

Semantics. Walks the kernel’s map ID space (via [libbpf_rs::query::MapInfoIter], which wraps BPF_MAP_GET_NEXT_ID + BPF_MAP_GET_FD_BY_ID + BPF_OBJ_GET_INFO_BY_FD) and keeps the fd whose name matches. The held fd lives in the scenario’s Backdrop state and drops (via std OwnedFd Drop) at scenario teardown. Multiple PinBpfMap ops with distinct names accumulate; pinning the same name twice is a no-op (the second call returns without re-opening the fd, so the originally-pinned map instance is the one held — not the second-call-time instance).

Name truncation. BPF map names are capped at BPF_OBJ_NAME_LEN = 16 bytes including the trailing NUL, so 15 usable chars max per kernel/bpf/syscall.c’s bpf_obj_name_cpy. Pass the kernel-visible name (typically <obj>.bss / <obj>.data / <obj>.rodata). When a libbpf object name + section suffix exceeds the 15-char cap, libbpf truncates the object prefix at load time and the kernel-side name is the truncated form; the framework does not auto- truncate the user-supplied string, so pass the post-truncation form. Reading the map names from a prior crate::monitor::dump::FailureDumpReport’s maps[].name or via bpftool map list is the safe way to discover the exact string the kernel sees.

Order. Place this op AFTER the scheduler that owns the target map has attached (typically a small fixed hold suffices — ~100ms for the small scx-ktstr fixture, longer for heavyweight schedulers). For the same-binary swap-window scenario specifically: pin the outgoing scheduler’s bss before Op::ReplaceScheduler runs — pinning after the swap is too late because the outgoing scheduler’s bss has already been freed by libbpf’s drop path. The pin walker picks the lowest-id matching map, so the outgoing copy (the older id) is the one held; the incoming scheduler’s load then creates a second copy that’s also kept alive because the outgoing refcount blocks the kernel from freeing the id.

Failure surface. The pin runs at Step apply time inside execute_steps / execute_scenario. A failure (no matching map found in the walk) bails out of the apply path as an Err from execute_steps; the scenario stops before the next Step runs and the post_vm callback is not invoked. The underlying [libbpf_rs::query::MapInfoIter] silently terminates iteration on any non-ENOENT errno from the BPF ID walk (including EPERM from missing CAP_SYS_ADMIN), so such errors surface as the no-matching-map case rather than a distinct EPERM error — acceptable because ktstr always runs as root inside the guest, so the CAP_SYS_ADMIN gates at kernel/bpf/syscall.c:4761 (BPF_MAP_GET_NEXT_ID walk) and :4869 (BPF_MAP_GET_FD_BY_ID) are always satisfied and the EPERM path is unreachable in practice.

Example.

let steps = vec![
    // Phase 0: primary scheduler runs alone; pin BEFORE the swap.
    Step::with_op(
        Op::pin_bpf_map("<obj>.bss"),
        HoldSpec::frac(0.3),
    ),
    // Phase 1: swap to a same-binary alt — the pinned map
    // keeps the OUTGOING bss alive across the teardown.
    Step::with_op(
        Op::replace_scheduler(&STAGED_ALT_SCHED),
        HoldSpec::frac(0.7),
    ),
];

See also. crate::scenario::bpf_pin::open_bpf_map_fd_by_name for the underlying helper and tests/live_var_disambiguation_e2e.rs for the swap-window conditional walker-fired gate this pin is designed to make deterministic.

§

CaptureCgroupProcs

Capture the current cgroup.procs of cgroup and store the PID list on the active SnapshotBridge under tag.

Synchronous read of the cgroup-v2 cgroup.procs pseudofile in the dispatching thread (in-scenario — runs wherever execute_scenario runs; inside the guest VM for #[ktstr_test] e2e tests, on the host for host-only scenarios). Returns the thread-group leaders (PIDs / TGIDs) the kernel reports at apply time. The snapshot is appended to the bridge’s per-tag drain log; test bodies drain via SnapshotBridge::drain_cgroup_procs (or the by-tag lookup SnapshotBridge::cgroup_procs_by_tag) after the scenario completes to read the captured pids back.

Distinct from Op::CaptureSnapshot: that op routes through the host-side freeze coordinator (TLV transport in production, thread-local bridge in test fixtures); this op runs entirely in-process against the local cgroupfs.

§Use cases

Pin “did my workers land in cgroup X” assertions without the shell-probe + tmpfs-roundtrip pattern. Typical shape:

use ktstr::prelude::SnapshotBridge;
use std::sync::Arc;

// Install a bridge (dummy capture cb — only cgroup-procs drain
// is used). MUST clone before set_thread_local, which consumes
// self — the clone shares the Arc-internal state and is what
// we drain on after the scenario completes.
let bridge = SnapshotBridge::new(Arc::new(|_| None));
let bridge_for_drain = bridge.clone();
let _guard = bridge.set_thread_local();

let backdrop = Backdrop::new().push_op(Op::add_cgroup("workers"));
let steps = vec![
    Step::new(
        vec![
            Op::spawn(SpawnPlacement::cgroup("workers"),
                      WorkSpec::default().workers(4)),
            Op::capture_cgroup_procs("after_spawn", "workers"),
        ],
        HoldSpec::fixed(Duration::ZERO),
    ),
];
let _ = execute_scenario(&ctx, backdrop, steps)?;

// Either drain the whole log or look up by tag.
let after = bridge_for_drain.cgroup_procs_by_tag("after_spawn")
    .expect("Op::CaptureCgroupProcs(\"after_spawn\", ...) snapshot");
assert_eq!(after.pids.len(), 4);

§Within-Step ordering

Ops in a single Step apply sequentially in vec order, so a Op::CaptureCgroupProcs placed AFTER Op::Spawn / Op::MoveAllTasks observes the post-spawn / post-migrate kernel state. The producing ops complete synchronously (their cgroup.procs writes block on kernel commit), so the capture sees every PID those ops placed.

§PID vs TID grain

Reads cgroup.procs (thread-group leaders), NOT cgroup.threads (per-thread TIDs). Grain implications by spawn op:

Op::Spawn → ktstr workers are 1-thread-per-worker, so workers(N) produces N pids in cgroup.procs.
Op::RunPayload → an execve’d binary is ONE process; even if the binary spawns 100 threads, cgroup.procs reports the single thread-group leader. Tests asserting per-thread placement would need a sibling cgroup.threads accessor (future Op variant if a use case arises).

§Tag uniqueness

tag is the snapshot key the test body uses to find the capture in the drain log. The apply-ops dispatch rejects an empty tag with an actionable bail. Multiple captures of the same cgroup under DIFFERENT tags surface as separate entries (lets a scenario capture pre/post snapshots of the same cgroup); multiple captures with the same (tag, cgroup) also append rather than overwrite — tag uniqueness is a caller convention, not a framework-enforced contract. The by-tag lookup SnapshotBridge::cgroup_procs_by_tag returns the FIRST match; callers who care about multiplicity must use SnapshotBridge::drain_cgroup_procs and filter the Vec manually.

§Empty / unknown cgroup

Empty cgroup (exists but holds no tasks): captured snapshot has pids = vec![]. Lets callers assert “no tasks landed here” without conflating with “no such cgroup.”
Unknown cgroup (directory missing): apply bails with a layered anyhow chain — the outer wrap names the op + tag + cgroup; the inner crate::cgroup::CgroupOps::read_procs context surfaces the resolved path + the actionable hint about Op::AddCgroup / workload_root_cgroup. Use format!("{err:#}") (alternate display) to flatten both layers in test assertions.

§See also

Op::CaptureSnapshot — diagnostic-snapshot capture (full scheduler state dump via FailureDumpReport). Distinct from this op’s cgroup-procs read AND drains via a separate SnapshotBridge::drain / drain_ordered channel, not drain_cgroup_procs.
crate::cgroup::CgroupOps::read_procs — the underlying trait method this op dispatches through.

§

SteerIrq

Re-steer a hardware IRQ to a single CPU by writing /proc/irq/<N>/smp_affinity_list in the guest — the knob that drives the kernel’s write_irq_affinity → irq_set_affinity → irq_do_set_affinity → irqchip set_affinity path (kernel/irq/proc.c, kernel/irq/manage.c). Use it to place a NIC’s RX-completion interrupt on a chosen CPU so the hardirq, the NET_RX softirq it raises, and any task that path wakes all land where the scenario wants them: the steering half of an IRQ-locality test whose generating half is crate::workload::WorkType::NetTraffic and whose observing half is the per-CPU IRQ metric axis (max_cpu_hardirqs, max_cpu_softirq_net_rx, and their *_concentration ratios).

§In-guest file write, NOT a kernel-memory poke

A write to the irq_desc affinity mask in kernel memory would NOT re-route delivery — only the smp_affinity_list write runs the full set-affinity path that reprograms the interrupt controller (MSI-X message / IOAPIC RTE). So this Op is dispatched as a plain std::fs::write from the executor in-guest (mirroring the /proc/sysrq-trigger write Op::DetachScheduler performs), NOT through the kernel-memory rendezvous path of Op::WriteKernelHot / Op::WriteKernelCold.

§Online-CPU requirement

The kernel intersects the requested mask with cpu_online_mask before programming the irqchip (irq_do_set_affinity); a single-CPU target that is offline leaves no online CPU in the mask and the write returns -EINVAL (the !cpumask_intersects(new_value, cpu_online_mask) arm of write_irq_affinity). The dispatcher pre-checks cpu against /sys/devices/system/cpu/online and bails with an actionable message before the write, so an out-of-range / offline target names the CPU instead of surfacing a bare EINVAL. IRQ affinity is a system-wide property, NOT scoped to the writing task’s cpuset — the target need not be in the runner’s allowed set.

Construct via Op::steer_irq.

OpKind

Enum OpKind Copy item path

Variants§

AddCgroup

AddCgroupDef

RemoveCgroup

SetCpuset

ClearCpuset

SwapCpusets

Spawn

StopCgroup

SetAffinity

MoveAllTasks

§Self-move rejection

§Empty-string source

§Lifetime / ownership-direction asymmetry

§Backdrop-setup exemption

RunPayload

§Scheduler-kind rejection across surfaces

WaitPayload

KillPayload

FreezeCgroup

§Auto-unfreeze at teardown

§Worked example

§Observer-cgroup deadlock warning

UnfreezeCgroup

CaptureSnapshot

§Example

WatchSnapshot

§Guard rails

WriteKernelHot

WriteKernelCold

ReadKernelHot

ReadKernelCold

AttachScheduler

DetachScheduler

RestartScheduler

ReplaceScheduler

PinBpfMap

CaptureCgroupProcs

§Use cases

§Within-Step ordering

§PID vs TID grain

§Tag uniqueness

§Empty / unknown cgroup

§See also

SteerIrq

§In-guest file write, NOT a kernel-memory poke

§Online-CPU requirement

Trait Implementations§

impl Clone for OpKind

fn clone(&self) -> OpKind

fn clone_from(&mut self, source: &Self)

impl Debug for OpKind

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl<'_enum> From<&'_enum Op> for OpKind

fn from(val: &'_enum Op) -> OpKind

impl From<Op> for OpKind

fn from(val: Op) -> OpKind

impl IntoEnumIterator for OpKind

type Iterator = OpKindIter

fn iter() -> OpKindIter

impl PartialEq for OpKind

fn eq(&self, other: &OpKind) -> bool

fn ne(&self, other: &Rhs) -> bool

impl Copy for OpKind

impl Eq for OpKind

impl StructuralPartialEq for OpKind

Auto Trait Implementations§

impl Freeze for OpKind

impl RefUnwindSafe for OpKind

impl Send for OpKind

impl Sync for OpKind

impl Unpin for OpKind

impl UnwindSafe for OpKind

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

Enum OpKind

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<Q, K> Equivalent<K> for Q
where Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

impl<Q, K> Equivalent<K> for Q
where Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

impl<Q, K> Equivalent<K> for Q
where Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

impl<Q, K> Equivalent<K> for Q
where Q: Eq + ?Sized, K: Borrow<Q> + ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

impl<T> MaybeSend for T
where T: Send,

impl<T> MaybeSend for T
where T: Send,