Module ops

Module ops 

Source
Expand description

Composable ops/steps system for dynamic cgroup topology changes.

Op is an atomic cgroup operation. Step sequences ops with a hold period. CgroupDef bundles create + cpuset + spawn into a single declaration. execute_steps() runs a step sequence with scheduler liveness checks and stimulus event recording.

See the Ops and Steps chapter for a guide.

§Cgroup tooling at a glance

ktstr exposes the cgroup v2 surface across two layers — declarative steady-state via CgroupDef (set at scenario-setup time, holds for the cgroup’s lifetime) and imperative state-transitions via Op (applied mid-step, describe transitions over time):

KnobLayerAPI entryUnderlying fileWhen to use
CPU affinitysetupCgroupDef::cpusetcpuset.cpusBind workers to a CPU subset for the whole run.
NUMA-mem affinitysetupCgroupDef::cpuset_memscpuset.memsConstrain allocations to specific NUMA nodes.
CPU bandwidthsetupCgroupDef::cpu_quota_pct / CgroupDef::cpu_quota / CgroupDef::cpu_unlimitedcpu.maxCap CPU time per period (1 CPU at 50% / 2 CPU at 100% / etc).
CPU share weightsetupCgroupDef::cpu_weightcpu.weightBias relative CPU share when siblings contend.
Memory ceilingsetupCgroupDef::memory_max / CgroupDef::memory_unlimitedmemory.maxHard ceiling — exceeding triggers cgroup OOM.
Memory throttlesetupCgroupDef::memory_highmemory.highSoft throttle: triggers reclaim, not OOM.
Memory protectionsetupCgroupDef::memory_lowmemory.lowSoft protection: kernel reclaims from siblings first.
Swap capsetupCgroupDef::memory_swap_max / CgroupDef::memory_swap_unlimitedmemory.swap.maxCap how much memory can spill to swap (CONFIG_SWAP=y).
IO sharesetupCgroupDef::io_weightio.weightBias relative IO share when siblings contend.
Task ceilingsetupCgroupDef::pids_max / CgroupDef::pids_unlimitedpids.maxCap process+thread count — fork/clone returns EAGAIN at limit.
Mid-run cpuset rebindmid-stepOp::set_cpuset / Op::clear_cpuset / Op::swap_cpusetscpuset.cpusMove cpuset on a live cgroup mid-scenario.
Mid-run task migrationmid-stepOp::move_all_taskscgroup.procsMove workers from one cgroup to another.
Pause/resumemid-stepOp::freeze_cgroup / Op::unfreeze_cgroupcgroup.freezeSuspend every task in the cgroup; resume later.
Add/remove cgroupmid-stepOp::add_cgroup / Op::remove_cgroup / Op::stop_cgroup(cgroupfs mkdir/rmdir)Spawn / tear down a cgroup mid-scenario.

§Worked examples

§Implementation entry points

Every knob ends in crate::cgroup::CgroupOps (production: crate::cgroup::CgroupManager; tests: a recording MockCgroupOps double). apply_setup runs the CgroupDef passes; apply_ops dispatches the Op variants. Both share ctx.cgroups so a test that uses both layers writes through the same RAII teardown (crate::scenario::CgroupGroup::Drop).

§File layout

types holds the data model: Op, CgroupDef, Step, HoldSpec, Setup, CpusetSpec, the per-controller limits structs, and every builder constructor. Re-exported from this module so external paths remain crate::scenario::ops::Op etc. The executor drives that model against crate::cgroup::CgroupOps via apply_setup (sibling setup module) and apply_ops (sibling dispatch module), and exposes the execute_steps / execute_scenario family of public entry points (this file).

Structs§

CgroupDef
Declarative cgroup definition: name + cpuset + synthetic WorkSpec groups + optional userspace Payload.
CpuLimits
CPU controller limits (cpu.max + cpu.weight) for a cgroup. All fields default to “inherit from parent” — the framework only writes each knob when its corresponding field is Some.
IoLimits
IO controller limits (io.weight). Per-device throughput caps (io.max) are intentionally not surfaced here — the per-device interface needs major:minor device-id lookup which has no in-tree consumer; surface it when a concrete use case lands.
MemoryLimits
Memory controller limits (memory.max / memory.high / memory.low / memory.swap.max). Each field is None by default (inherit from parent / no limit).
PidsLimits
Pids controller limits (pids.max). None is the default (inherit from parent — typically "max", no ceiling).
Step
A sequence of ops followed by a hold period.

Enums§

CpusetSpec
How to compute a cpuset from topology.
HoldSpec
How a step advances after its ops are applied. Frac and Fixed hold for a duration; Loop repeatedly re-applies Step::ops at a fixed interval instead of holding.
IrqSelector
Which IRQ Op::SteerIrq targets.
KernelTarget
Host-side write/read target for the kernel-memory ops (Op::WriteKernelHot / Op::WriteKernelCold / Op::ReadKernelHot / Op::ReadKernelCold).
KernelValue
Value payload for the kernel-memory write ops, and the result shape for the read ops.
KernelValueWidth
Width specifier for the Op::ReadKernelHot / Op::ReadKernelCold ops — picks which crate::monitor::guest::GuestKernel read_*_u32 / read_*_u64 / read_*_bytes family the host dispatcher invokes for the read. Mirrors KernelValue’s variant tags but without payload data (reads do not carry an outgoing value — only a width hint that the dispatcher uses to size the resulting crate::vmm::wire::KernelOpValue in the reply).
Op
Atomic operation on the cgroup topology.
OpKind
Auto-generated discriminant enum variants
Setup
How to produce the CgroupDefs for a step’s setup phase.
SpawnPlacement
Placement target for Op::Spawn.

Constants§

PLACEMENT_LOG_PATH
Path of the scenario placement-history log. Captures placement events from both apply_setup (worker / payload spawn) and apply_ops (Op::MoveAllTasks pre/post cgroup.procs snapshots). Tests that assert on per-cgroup worker placement can read this file on failure to see exactly which PIDs landed in which cgroup at spawn time AND how Op::MoveAllTasks migrated them. The file lives in tmpfs and survives the scenario teardown (only the cgroup directories are rmdir’d; this file stays put). The log is append-only and is never cleared between runs (the only writer, append_placement_log, opens with .append(true) and never truncates), so lines from prior runs accumulate — a reader that needs only the current run’s placement must filter accordingly.

Functions§

await_accessor_ready
Block until the host freeze coordinator has ADOPTED its kernel-symbol accessor — signalled via the SIGNAL_ACCESSOR_READY wake byte → accessor_ready_latch (set by hvc0_poll_loop). A failure dump captured by a stall AFTER this returns renders real BPF map values instead of placeholders, because the coordinator’s owned_accessor is adopted before the stall fires. Call this in a dump-asserting scenario before triggering its stall (e.g. before execute_steps with a --stall-after scheduler).
execute_defs
Execute a single step with CgroupDefs that hold for the full duration.
execute_scenario
Execute a Backdrop + Steps sequence against the given context.
execute_scenario_with
execute_scenario with an explicit Assert override — the Backdrop equivalent of execute_steps_with.
execute_steps
Execute a sequence of steps against the given context.
execute_steps_with
Execute steps with an explicit Assert for worker checks. When checks is Some, it overrides ctx.assert. When None, uses ctx.assert (the merged three-layer config).