Expand description
Composable ops/steps system for dynamic cgroup topology changes.
Op is an atomic cgroup operation. Step sequences ops with a
hold period. CgroupDef bundles create + cpuset + spawn into a
single declaration. execute_steps() runs a step sequence with
scheduler liveness checks and stimulus event recording.
See the Ops and Steps chapter for a guide.
§Cgroup tooling at a glance
ktstr exposes the cgroup v2 surface across two layers — declarative
steady-state via CgroupDef (set at scenario-setup time, holds
for the cgroup’s lifetime) and imperative state-transitions via
Op (applied mid-step, describe transitions over time):
| Knob | Layer | API entry | Underlying file | When to use |
|---|---|---|---|---|
| CPU affinity | setup | CgroupDef::cpuset | cpuset.cpus | Bind workers to a CPU subset for the whole run. |
| NUMA-mem affinity | setup | CgroupDef::cpuset_mems | cpuset.mems | Constrain allocations to specific NUMA nodes. |
| CPU bandwidth | setup | CgroupDef::cpu_quota_pct / CgroupDef::cpu_quota / CgroupDef::cpu_unlimited | cpu.max | Cap CPU time per period (1 CPU at 50% / 2 CPU at 100% / etc). |
| CPU share weight | setup | CgroupDef::cpu_weight | cpu.weight | Bias relative CPU share when siblings contend. |
| Memory ceiling | setup | CgroupDef::memory_max / CgroupDef::memory_unlimited | memory.max | Hard ceiling — exceeding triggers cgroup OOM. |
| Memory throttle | setup | CgroupDef::memory_high | memory.high | Soft throttle: triggers reclaim, not OOM. |
| Memory protection | setup | CgroupDef::memory_low | memory.low | Soft protection: kernel reclaims from siblings first. |
| Swap cap | setup | CgroupDef::memory_swap_max / CgroupDef::memory_swap_unlimited | memory.swap.max | Cap how much memory can spill to swap (CONFIG_SWAP=y). |
| IO share | setup | CgroupDef::io_weight | io.weight | Bias relative IO share when siblings contend. |
| Task ceiling | setup | CgroupDef::pids_max / CgroupDef::pids_unlimited | pids.max | Cap process+thread count — fork/clone returns EAGAIN at limit. |
| Mid-run cpuset rebind | mid-step | Op::set_cpuset / Op::clear_cpuset / Op::swap_cpusets | cpuset.cpus | Move cpuset on a live cgroup mid-scenario. |
| Mid-run task migration | mid-step | Op::move_all_tasks | cgroup.procs | Move workers from one cgroup to another. |
| Pause/resume | mid-step | Op::freeze_cgroup / Op::unfreeze_cgroup | cgroup.freeze | Suspend every task in the cgroup; resume later. |
| Add/remove cgroup | mid-step | Op::add_cgroup / Op::remove_cgroup / Op::stop_cgroup | (cgroupfs mkdir/rmdir) | Spawn / tear down a cgroup mid-scenario. |
§Worked examples
- Static topology (one cgroup, fixed cpuset, weight-biased
compute):
CgroupDeftype-level docs. - Suspend/resume (3-Step idiom — run, freeze, run again):
Op::FreezeCgroupdoc. - Memory-cap teardown (rewind a base CgroupDef’s swap cap):
CgroupDef::memory_swap_unlimiteddoc.
§Implementation entry points
Every knob ends in crate::cgroup::CgroupOps (production:
crate::cgroup::CgroupManager; tests: a recording MockCgroupOps
double). apply_setup runs the CgroupDef passes; apply_ops
dispatches the Op variants. Both share ctx.cgroups so a test
that uses both layers writes through the same RAII teardown
(crate::scenario::CgroupGroup::Drop).
§File layout
types holds the data model: Op, CgroupDef, Step,
HoldSpec, Setup, CpusetSpec, the per-controller limits
structs, and every builder constructor. Re-exported from this module
so external paths remain crate::scenario::ops::Op etc. The executor
drives that model against crate::cgroup::CgroupOps via apply_setup
(sibling setup module) and apply_ops (sibling dispatch module),
and exposes the execute_steps / execute_scenario family of
public entry points (this file).
Structs§
- Cgroup
Def - Declarative cgroup definition: name + cpuset + synthetic
WorkSpecgroups + optional userspacePayload. - CpuLimits
- CPU controller limits (
cpu.max+cpu.weight) for a cgroup. All fields default to “inherit from parent” — the framework only writes each knob when its corresponding field isSome. - IoLimits
- IO controller limits (
io.weight). Per-device throughput caps (io.max) are intentionally not surfaced here — the per-device interface needs major:minor device-id lookup which has no in-tree consumer; surface it when a concrete use case lands. - Memory
Limits - Memory controller limits (
memory.max/memory.high/memory.low/memory.swap.max). Each field isNoneby default (inherit from parent / no limit). - Pids
Limits - Pids controller limits (
pids.max).Noneis the default (inherit from parent — typically"max", no ceiling). - Step
- A sequence of ops followed by a hold period.
Enums§
- Cpuset
Spec - How to compute a cpuset from topology.
- Hold
Spec - How a step advances after its ops are applied.
FracandFixedhold for a duration;Looprepeatedly re-appliesStep::opsat a fixed interval instead of holding. - IrqSelector
- Which IRQ
Op::SteerIrqtargets. - Kernel
Target - Host-side write/read target for the kernel-memory ops
(
Op::WriteKernelHot/Op::WriteKernelCold/Op::ReadKernelHot/Op::ReadKernelCold). - Kernel
Value - Value payload for the kernel-memory write ops, and the result shape for the read ops.
- Kernel
Value Width - Width specifier for the
Op::ReadKernelHot/Op::ReadKernelColdops — picks whichcrate::monitor::guest::GuestKernelread_*_u32/read_*_u64/read_*_bytesfamily the host dispatcher invokes for the read. MirrorsKernelValue’s variant tags but without payload data (reads do not carry an outgoing value — only a width hint that the dispatcher uses to size the resultingcrate::vmm::wire::KernelOpValuein the reply). - Op
- Atomic operation on the cgroup topology.
- OpKind
- Auto-generated discriminant enum variants
- Setup
- How to produce the CgroupDefs for a step’s setup phase.
- Spawn
Placement - Placement target for
Op::Spawn.
Constants§
- PLACEMENT_
LOG_ PATH - Path of the scenario placement-history log. Captures placement
events from both
apply_setup(worker / payload spawn) andapply_ops(Op::MoveAllTaskspre/post cgroup.procs snapshots). Tests that assert on per-cgroup worker placement can read this file on failure to see exactly which PIDs landed in which cgroup at spawn time AND howOp::MoveAllTasksmigrated them. The file lives in tmpfs and survives the scenario teardown (only the cgroup directories are rmdir’d; this file stays put). The log is append-only and is never cleared between runs (the only writer,append_placement_log, opens with.append(true)and never truncates), so lines from prior runs accumulate — a reader that needs only the current run’s placement must filter accordingly.
Functions§
- await_
accessor_ ready - Block until the host freeze coordinator has ADOPTED its
kernel-symbol accessor — signalled via the
SIGNAL_ACCESSOR_READYwake byte →accessor_ready_latch(set byhvc0_poll_loop). A failure dump captured by a stall AFTER this returns renders real BPF map values instead of placeholders, because the coordinator’sowned_accessoris adopted before the stall fires. Call this in a dump-asserting scenario before triggering its stall (e.g. beforeexecute_stepswith a--stall-afterscheduler). - execute_
defs - Execute a single step with CgroupDefs that hold for the full duration.
- execute_
scenario - Execute a
Backdrop+ Steps sequence against the given context. - execute_
scenario_ with execute_scenariowith an explicitAssertoverride — the Backdrop equivalent ofexecute_steps_with.- execute_
steps - Execute a sequence of steps against the given context.
- execute_
steps_ with - Execute steps with an explicit
Assertfor worker checks. WhenchecksisSome, it overridesctx.assert. WhenNone, usesctx.assert(the merged three-layer config).