pub struct CgroupManager { /* private fields */ }Expand description
RAII manager for cgroup v2 filesystem operations.
Creates, configures, and removes cgroups under a parent directory. Provides cpuset assignment and task migration.
§Outstanding-remove tracking
outstanding_removes counts cgroups whose
Self::remove_cgroup call failed (the directory still exists
in the cgroupfs tree). It increments on every removal failure,
decrements on every removal success, and gates further calls:
once the count exceeds MAX_OUTSTANDING_REMOVES,
Self::remove_cgroup returns Err without attempting the
underlying writes. The counter is AtomicUsize because
scenario code holds the manager behind &dyn CgroupOps and
shares it across threads via &self borrows.
§Walk root
walk_root bounds the cgroup-fs walk for two operations:
Self::setupwalks every ancestor’scgroup.subtree_controlbetweenwalk_rootandparent.Self::drain_tasksandcleanup_recursivedrain pids into{walk_root}/cgroup.procs(the writable root exempt from the no-internal-process constraint).
Defaults to /sys/fs/cgroup in Self::new for Mode A (root-owned
cgroup tree). Override via Self::with_walk_root for cgroup-v2
user delegation (Mode B/C: systemd Delegate=yes, container
nsdelegate). The override is validated against parent at
construction — if parent is not at or below walk_root, the
chained call returns an error rather than letting the strip-prefix
walk fall through to an opaque cgroupfs EACCES at the delegation
boundary.
Implementations§
Source§impl CgroupManager
impl CgroupManager
Sourcepub fn new(parent: &str) -> Self
pub fn new(parent: &str) -> Self
Create a manager rooted at the given cgroup v2 path.
The walk root defaults to /sys/fs/cgroup (Mode A: root-owned
cgroup tree). For cgroup-v2 user delegation (Mode B/C), chain
Self::with_walk_root before any Self::setup call.
Sourcepub fn with_walk_root(self, root: impl Into<PathBuf>) -> Result<Self>
pub fn with_walk_root(self, root: impl Into<PathBuf>) -> Result<Self>
Retarget the cgroup-fs walk root used by Self::setup and
Self::drain_tasks.
root becomes the upper bound of the
cgroup.subtree_control enable walk and the destination
{root}/cgroup.procs for pid drains. Use for cgroup-v2 user
delegation (Mode B/C) where the operator owns
subtree_control writes only inside the delegated subtree and
a blind walk from /sys/fs/cgroup would EACCES at the
user.slice / container-root boundary.
Returns an error when:
- Either
parentorrootcontains a..component —Path::starts_withis component-based and treats..as a literal segment, so/sys/fs/cgroup/op/../escapewould component-prefix/sys/fs/cgroup/opwhile the kernel resolves the path to/sys/fs/cgroup/escape(outside the delegation root). Rejecting..upfront keeps the prefix invariant honest against canonical-vs-component drift. - The manager’s
parentis not at or belowroot— without the prefix invariant theSelf::setup_under_rootstrip-prefix gate would silently skip the subtree_control walk and the caller would see downstream EACCES on the firstset_*write. Surfaces the misconfiguration upfront with both paths in the error message.
Sourcepub fn parent_path(&self) -> &Path
pub fn parent_path(&self) -> &Path
Path to the parent cgroup directory.
Sourcepub fn walk_root(&self) -> &Path
pub fn walk_root(&self) -> &Path
Path to the cgroup-fs root Self::setup walks down from and
Self::drain_tasks drains pids to. See Self::with_walk_root.
Sourcepub fn outstanding_removes(&self) -> usize
pub fn outstanding_removes(&self) -> usize
Count of un-removed cgroups currently tracked by this
manager — incremented when Self::remove_cgroup fails,
decremented when it succeeds. Exposed for tests and for
callers that want to inspect the budget without forcing a
remove attempt.
Sourcepub fn setup(&self, controllers: &BTreeSet<Controller>) -> Result<()>
pub fn setup(&self, controllers: &BTreeSet<Controller>) -> Result<()>
Create the parent directory and enable the requested cgroup
controllers in every ancestor cgroup.subtree_control between
self.walk_root (default /sys/fs/cgroup) and self.parent.
Pass the controllers the test actually needs — empty set means
“create the parent dir, write nothing to subtree_control”. The
scenario runtime computes the controller union from
CgroupDef declarations
(cpuset/cpuset_mems → Controller::Cpuset, cpu →
Controller::Cpu, memory → Controller::Memory, pids →
Controller::Pids, io → Controller::Io) so a test
that never sets a memory limit never enables +memory and
vice versa. cgroup.freeze and cgroup.procs are
cgroup-core, ungated by any controller, and need no entry.
§Walk root
The ancestor walk stops at self.walk_root so cgroup-v2 user
delegation (Mode B/C) does not attempt subtree_control writes
above the delegation boundary. Self::with_walk_root
retargets the walk; the constructor validates that
self.parent is below walk_root.
§Availability check
Each requested controller is verified against
{walk_root}/cgroup.controllers before any write. A
requested controller missing from the kernel’s available set
surfaces as controller {ctrl} not available; cgroup.controllers = {available:?} rather than the bare ENOENT/EACCES the
downstream set_* write would otherwise emit.
§Error propagation
All filesystem writes propagate via ?. A user inspecting
RUST_BACKTRACE=1 output sees the exact subtree_control path
that failed and the underlying errno, instead of a swallowed
tracing::warn! followed by a downstream EACCES at the
controller-knob write site.
Sourcepub fn create_cgroup(&self, name: &str) -> Result<()>
pub fn create_cgroup(&self, name: &str) -> Result<()>
Create a child cgroup directory.
For nested paths (e.g. "cg_0/narrow"), enables +cpuset on
each intermediate cgroup’s subtree_control so the leaf has
cpuset.cpus / cpuset.mems files available. The kernel
requires each parent to have the controller in
subtree_control for its children to have the corresponding
files (cgroup_control() returns parent->subtree_control).
§Limitation: only +cpuset is propagated through nested
intermediates
Self::enable_subtree_cpuset writes ONLY +cpuset to each
intermediate’s cgroup.subtree_control; the +cpu /
+memory / +pids / +io controllers enabled by
Self::setup cover only the manager’s parent cgroup, not
arbitrary intermediate cgroups created via nested
create_cgroup calls. As a result, a nested leaf like
"cg_0/narrow" exposes cpuset.* knobs but NOT
memory.max / pids.max / io.weight. If a future
CgroupDef addresses such
a leaf with a memory/pids/io knob, the corresponding
set_* write will return ENOENT.
Today’s in-tree consumers (host topology cpuset locks,
BuildSandbox, scenario ops) only nest cgroups for cpuset
scoping, so this matches the actual surface the framework
exercises. Extending Self::enable_subtree_cpuset to
propagate the remaining controllers across intermediates is
straightforward (write the same controller list as
Self::setup uses) but is deferred until a use case
concretely needs it; without one, the wider write would
race against concurrent sibling cgroup creation under the
same intermediate without buying anything.
Sourcepub fn add_parent_subtree_controller(&self, controller: &str) -> Result<()>
pub fn add_parent_subtree_controller(&self, controller: &str) -> Result<()>
Enable a controller on the parent cgroup’s cgroup.subtree_control.
Writes +{controller} to {parent}/cgroup.subtree_control so
children created under the parent inherit the controller and
expose the corresponding *.cpus, *.mems, etc. files. No-op
(returns Ok) when the subtree_control file does not exist —
callers treat that as “parent is not a cgroup v2 node” and
degrade elsewhere.
Unlike Self::setup and Self::enable_subtree_cpuset,
which swallow write failures via tracing::warn!, this method
propagates the underlying std::io::Error so callers can
classify errnos (EACCES/EPERM for permission, EBUSY for a
peer holding the subtree) via anyhow_first_io_errno and
map them to operator-facing degrade variants. Used by
crate::vmm::cgroup_sandbox::BuildSandbox::try_create under
the --cpu-cap hard-error contract.
Sourcepub fn remove_cgroup(&self, name: &str) -> Result<()>
pub fn remove_cgroup(&self, name: &str) -> Result<()>
Drain tasks from a child cgroup and remove it.
Auto-unfreezes the cgroup before draining: a frozen cgroup that
reaches teardown (e.g. a step body issues Op::FreezeCgroup and
never pairs it with Op::UnfreezeCgroup) would migrate its
frozen tasks to the cgroup root via drain_tasks and rely on
the kernel’s cgroup_freezer_migrate_task to clear the JOBCTL
freeze bit when the destination cgroup is unfrozen. The kernel
path is correct, but writing cgroup.freeze=0 first makes the
teardown deterministic regardless of who froze the cgroup and
when. Tolerates ENOENT on the freeze file (cgroup directory
already gone, or CONFIG_CGROUP_FREEZE absent on legacy
kernels) silently — only non-ENOENT failures warn.
§Post-drain settle window
Between Self::drain_tasks and rmdir,
remove_cgroup_inner calls wait_for_cgroup_unpopulated with
a 1s budget. Writes to cgroup.procs queue the task move but
the source cgroup’s populated state only clears once the
per-task css_set switch completes — rmdir returns EBUSY
while the cgroup is still populated. Rather than a blind
sleep, the wait is event-driven: it blocks on an
inotify(IN_MODIFY) watch of the cgroup’s cgroup.events file
and returns as soon as that file reports populated 0, so it
wakes on the actual kernel state-transition write.
The wait falls through to rmdir on deadline (or when
cgroup.events is absent / inotify setup fails), so a
genuinely stuck-populated cgroup still surfaces the same
EBUSY error from the subsequent rmdir.
§Outstanding-remove cap
A churn workload (rapid create→remove cycles) may legitimately
race freeze/drain and see EBUSY/ENOENT on individual remove
calls. Each failed remove increments
Self::outstanding_removes; once the counter exceeds
MAX_OUTSTANDING_REMOVES, the next call returns Err
without attempting any filesystem writes — bounding the peak
resident cgroup leak to that cap regardless of how long the
scenario runs. Successful removes decrement the counter, so a
transient stall that eventually clears (e.g. RCU drain
catches up between iterations) does not strand the manager
in the bailed state.
A name whose directory does not exist returns Ok(())
without touching the counter — the cgroup was already
reaped (e.g. by Self::cleanup_all or a prior remove),
so it is not “outstanding”.
Sourcepub fn set_cpuset(&self, name: &str, cpus: &BTreeSet<usize>) -> Result<()>
pub fn set_cpuset(&self, name: &str, cpus: &BTreeSet<usize>) -> Result<()>
Write cpuset.cpus for a child cgroup.
On write failure, captures and emits a snapshot of the
cgroup-tree state at the moment of failure: the parent’s
cgroup.controllers (controllers AVAILABLE to children),
the parent’s cgroup.subtree_control (controllers ENABLED
for children), the child’s cgroup.controllers (the
inheritance ROOT for children of the child), the
cpuset.cpus file’s existence, and a directory listing of
the child cgroup’s knob files. The capture lets a kernel /
hierarchy-state bug surface as a focused diagnostic instead
of a bare EACCES at the write site.
Sourcepub fn clear_cpuset(&self, name: &str) -> Result<()>
pub fn clear_cpuset(&self, name: &str) -> Result<()>
Clear cpuset.cpus for a child cgroup (empty string = inherit parent).
Sourcepub fn set_cpuset_mems(&self, name: &str, nodes: &BTreeSet<usize>) -> Result<()>
pub fn set_cpuset_mems(&self, name: &str, nodes: &BTreeSet<usize>) -> Result<()>
Write cpuset.mems for a child cgroup. Constrains which NUMA
nodes the cgroup’s tasks can allocate memory on.
Shape mirrors set_cpuset exactly — TestTopology::cpuset_string
range-compact-formats the node set, write_with_timeout bounds
the filesystem-write at 2s. Used by BuildSandbox under the
--cpu-cap flow to bind build memory to the NUMA nodes hosting
the locked LLCs, avoiding cross-socket DRAM latency for gcc’s
symbol tables and linker working sets.
§Ordering contract
Caller MUST have already called Self::set_cpuset (or
equivalent direct write to cpuset.cpus) and — when running
under a parent that may narrow the set — MUST have read back
cpuset.cpus.effective to detect kernel-side narrowing
BEFORE invoking this method. The per-knob ordering is
load-bearing: crate::vmm::cgroup_sandbox::BuildSandbox
interleaves cpuset.cpus.effective readback between the
cpuset.cpus and cpuset.mems writes to abort on narrowing
under the --cpu-cap hard-error contract; folding the two
writes into a single helper would erase that gate.
A cgroup whose cpuset.cpus is set should also have a
non-empty cpuset.mems.effective before any task is migrated
into it: the half-configured shape (cpus set locally, no
nodemask anywhere up the hierarchy) is suspicious enough that
the framework refuses it. The kernel itself does NOT
SIGKILL on first allocation — guarantee_online_mems
(kernel/cgroup/cpuset.c) walks UP via parent_cs(cs) until
effective_mems intersects node_states[N_MEMORY], and the
top cpuset always has online memory, so the walk always finds
a non-empty mask. The actual kernel behavior under a fully
empty hierarchy is path-dependent (parent-walk fallback
generally succeeds; degenerate states without any online
memory may OOM). cgroup v2’s cpuset_can_attach_check only
rejects empty effective_cpus, not empty effective_mems.
In cgroup v2, the local cpuset.mems file is normally empty
(the cgroup inherits from its parent via effective_mems),
so reading the local file alone would falsely flag every
inheriting child. Self::move_task enforces the gate at
runtime by reading the cgroup’s cpuset.cpus and
cpuset.mems.effective files before each migration and
refusing the write if cpuset.cpus is non-empty while
cpuset.mems.effective is empty — surfacing a focused
error rather than letting a half-configured cgroup through
to the kernel’s path-dependent behavior.
Sourcepub fn clear_cpuset_mems(&self, name: &str) -> Result<()>
pub fn clear_cpuset_mems(&self, name: &str) -> Result<()>
Clear cpuset.mems for a child cgroup (empty string = inherit parent).
Parallels clear_cpuset; callers use it only when tearing
down a cpuset-restricted cgroup that needs to accept a
fresh task binding with a different NUMA budget.
Sourcepub fn set_cpu_max(
&self,
name: &str,
quota_us: Option<u64>,
period_us: u64,
) -> Result<()>
pub fn set_cpu_max( &self, name: &str, quota_us: Option<u64>, period_us: u64, ) -> Result<()>
Write cpu.max for a child cgroup. quota_us = None writes
"max <period_us>" (no upper bound — same as a freshly
created cgroup); Some(q) writes "<q> <period_us>".
Per the kernel’s cgroup v2 docs (“Documentation/admin-guide/
cgroup-v2.rst”, “CPU Interface Files”): each period the
cgroup gets quota microseconds of CPU time across its
CPUs, and is throttled until the next period boundary once
the quota is exhausted. quota MAY exceed period to let
the cgroup use multiple CPUs concurrently (e.g. quota
200_000 / period 100_000 = up to 2 CPUs of throughput).
Requires +cpu in the parent’s cgroup.subtree_control;
missing controller surfaces as ENOENT on the file (handled
generically by write_with_timeout’s error path with the
errno suffix).
Sourcepub fn set_cpu_weight(&self, name: &str, weight: u32) -> Result<()>
pub fn set_cpu_weight(&self, name: &str, weight: u32) -> Result<()>
Write cpu.weight for a child cgroup (cgroup v2 weight,
range 1..=10000, default 100). Used together with sibling
cgroups to bias relative CPU share inside the parent’s
quota. Independent from cpu.max — weights govern share
when CPU is contended, max enforces an absolute ceiling.
Per “Documentation/admin-guide/cgroup-v2.rst” the legacy
“shares” knob is cpu.weight.nice (mapped from nice value);
this method targets the canonical cpu.weight knob.
Sourcepub fn set_memory_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
pub fn set_memory_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
Write memory.max for a child cgroup. bytes = None writes
"max" (no hard limit). When the cgroup’s RSS exceeds the
limit, the kernel OOM-kills tasks per the documented
memory.max semantics. Requires +memory in the parent’s
cgroup.subtree_control.
Sourcepub fn set_memory_high(&self, name: &str, bytes: Option<u64>) -> Result<()>
pub fn set_memory_high(&self, name: &str, bytes: Option<u64>) -> Result<()>
Write memory.high for a child cgroup. bytes = None
writes "max" (no high-water mark). Crossing the high
threshold triggers reclaim throttling but NOT OOM-kill,
distinguishing it from memory.max.
Sourcepub fn set_memory_low(&self, name: &str, bytes: Option<u64>) -> Result<()>
pub fn set_memory_low(&self, name: &str, bytes: Option<u64>) -> Result<()>
Write memory.low for a child cgroup. bytes = None writes
"0" (no low-water protection). The kernel preferentially
reclaims FROM other cgroups before reclaiming this cgroup’s
memory below memory.low; not a hard reservation.
Sourcepub fn set_io_weight(&self, name: &str, weight: u16) -> Result<()>
pub fn set_io_weight(&self, name: &str, weight: u16) -> Result<()>
Write io.weight for a child cgroup (cgroup v2 weight,
range 1..=10000, default 100). Biases relative IO share
across sibling cgroups when the io controller is enabled
in the parent’s cgroup.subtree_control. The kernel’s BFQ
or io.cost backend (whichever is active) applies the
weight when contending devices are saturated.
io.max (per-device throughput cap) is intentionally NOT
surfaced here — the per-device interface needs major:minor
device-id lookup which has no in-tree consumer; surface it
when a concrete use case lands.
Sourcepub fn set_freeze(&self, name: &str, frozen: bool) -> Result<()>
pub fn set_freeze(&self, name: &str, frozen: bool) -> Result<()>
Write cgroup.freeze for a child cgroup. frozen = true writes
"1", frozen = false writes "0".
cgroup.freeze is a cgroup-core file exposed on every non-root
cgroup automatically — it is NOT gated by cgroup.subtree_control.
The kernel’s cgroup_freeze_write parses the value via
kstrtoint, rejects anything outside {0, 1} with -ERANGE,
and dispatches cgroup_freeze(cgrp, freeze). Writing 1 to a
cgroup containing tasks transitions every task in the subtree to
the frozen state; writing 0 releases. The transition is
asynchronous — cgroup.events’s frozen field reaches 1 once
every task has parked.
Sourcepub fn set_pids_max(&self, name: &str, max: Option<u64>) -> Result<()>
pub fn set_pids_max(&self, name: &str, max: Option<u64>) -> Result<()>
Write pids.max for a child cgroup. max = None writes "max"
(the kernel’s PIDS_MAX_STR sentinel for unlimited);
Some(n) writes the decimal n.
Per the kernel’s pids_max_write: the parser short-circuits to
the unlimited limit when buf == PIDS_MAX_STR; otherwise
kstrtoll(buf, 0, &limit) parses a signed integer and rejects
< 0 or >= PIDS_MAX with -EINVAL. The update is atomic
(atomic64_set(&pids->limit, limit)); existing tasks are NOT
killed when the limit lands below the current task count — only
future fork() / clone() calls are blocked.
Requires +pids in the parent’s cgroup.subtree_control;
Self::setup enables it unconditionally so this write
succeeds on every ktstr-managed cgroup tree.
Sourcepub fn set_memory_swap_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
pub fn set_memory_swap_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
Write memory.swap.max for a child cgroup. bytes = None writes
"max" (no swap cap); Some(b) writes the decimal byte count.
Per the kernel’s swap_max_write: the value is parsed via
page_counter_memparse(buf, "max", &max), which accepts the
literal "max" token for unlimited or a numeric byte count.
The store is xchg(&memcg->swap.max, max) — atomic, with no
failure path beyond the parse.
Requires +memory in the parent’s cgroup.subtree_control;
Self::setup enables it unconditionally.
Requires CONFIG_SWAP=y in the test kernel. The file does not exist on swapless builds; the write returns ENOENT.
Sourcepub fn move_task(&self, name: &str, pid: pid_t) -> Result<()>
pub fn move_task(&self, name: &str, pid: pid_t) -> Result<()>
Move a single task into a child cgroup via cgroup.procs.
move_task is host-side scenario orchestration, never
invoked from a vCPU thread, so the bare fs::read_to_string
reads in Self::check_cpuset_ordering are not bounded by
the freeze-rendezvous timeout. A wedged cgroupfs read here
would stall the orchestrator thread, not a vCPU.
§cpuset ordering gate
Before issuing the cgroup.procs write, the method reads the
destination’s cpuset.cpus (the local-write knob the caller
either set or did not) and cpuset.mems.effective (the
kernel’s effective view, inheritance-aware). The gate
refuses migrations into a cgroup whose cpuset.cpus is set
but cpuset.mems.effective reads empty — a half-configured
state we surface as a focused error rather than letting it
through to the kernel.
The kernel’s behavior on the half-configured shape is
path-dependent: guarantee_online_mems
(kernel/cgroup/cpuset.c) walks UP via parent_cs(cs)
until effective_mems intersects node_states[N_MEMORY],
and the top cpuset always has online memory, so the walk
generally succeeds; the empty-nodemask OOM path is reachable
only in degenerate hierarchies. cgroup v2’s
cpuset_can_attach_check rejects only empty effective_cpus
(not empty effective_mems), so a v2 attach into a cgroup
with empty effective_mems is not a hard kernel error
either. The framework refuses the migration anyway because
the half-configured shape almost always reflects a missing
Self::set_cpuset_mems call; surfacing it directly is
more debuggable than letting it become whatever the kernel
happens to do on this particular hierarchy.
§Why cpuset.mems.effective, not cpuset.mems
In cgroup v2, the local cpuset.mems file echoes
cs->mems_allowed — the LOCAL nodemask, which is empty by
default until the caller explicitly writes it. The kernel’s
allocation path uses cs->effective_mems instead, which
inherits from the parent when the local mask is empty (per
cpuset_common_seq_show’s FILE_EFFECTIVE_MEMLIST branch and
guarantee_online_mems’s parent_cs(cs) walk). A gate that
reads the local file would falsely flag every inheriting
child as half-configured even though the kernel sees a
perfectly valid effective_mems from the parent. The
effective view captures both “this cgroup wrote cpuset.mems
directly” and “this cgroup inherits a non-empty mask from
its parent” without false positives.
Both reads are best-effort — a cgroup without cpuset
controllers (cpuset.cpus does not exist) bypasses the
gate, matching the kernel’s “no cpuset constraints to
enforce” path. Read errors on either knob are absorbed: the
gate exists to catch the configured-but-half-configured
shape, not to fight cgroupfs read failures. If
cpuset.mems.effective cannot be read for any reason, the
gate degrades to “accept” — it cannot make a sound decision
without the kernel’s effective view.
Sourcepub fn place_task_during_handshake(
&self,
cgroup_name: &str,
child_pid: pid_t,
) -> Result<()>
pub fn place_task_during_handshake( &self, cgroup_name: &str, child_pid: pid_t, ) -> Result<()>
Write child_pid to <cgroup_name>/cgroup.procs during the
payload-spawn cgroup-sync handshake.
Distinct from Self::move_task: this is the
placement-before-exec write that runs while the child is
paused in pre_exec between fork(2) and execve(2). The
move_task cpuset-ordering gate does NOT apply here —
placement runs before cpuset is finalised at scenario setup
time, and the gate would reject otherwise-valid spawn
requests. Callers that need the gate (post-spawn migration)
invoke Self::move_task / Self::move_tasks instead.
Uses the same write_with_timeout shape as the other
cgroup.procs write sites so a wedged cgroupfs is bounded
to CGROUP_WRITE_TIMEOUT rather than blocking the parent
indefinitely.
Sourcepub fn move_tasks(&self, name: &str, pids: &[pid_t]) -> Result<()>
pub fn move_tasks(&self, name: &str, pids: &[pid_t]) -> Result<()>
Move multiple tasks into a child cgroup by PID.
Tolerates per-pid ESRCH (a task that exited between the listing
snapshot and the migration write) and logs a warn for each
vanished pid — partial migration is a legitimate outcome when
one of N workers has voluntarily exited. Retries EBUSY up to
3 times with 100ms backoff for transient rejections from
sched_ext BPF cgroup_prep_move callbacks
(scx_cgroup_can_attach). Propagates EBUSY after retries
exhausted. Propagates all other errors immediately.
§All-vanished bail
When pids is non-empty AND every supplied pid ESRCH’d, this
fn bails with an actionable diagnostic rather than silently
returning Ok. The silent-Ok path violates the project’s
no-silent-drops rule (any data loss must fail loudly):
a downstream consumer reading the destination
cgroup.procs would see 0 pids and have no idea whether
the migration was supposed to move 0 or N — masking a real
test-setup regression (e.g. WorkloadHandle::spawn child
pre_exec init-panic cascade that killed every paused worker
before move_tasks ran) behind a downstream-state empty-read.
A test that LEGITIMATELY moves only already-exited workers
(post-Drop diagnostic, post-mortem capture) should pass an
empty pids slice rather than calling with non-empty + all
pre-vanished — the empty-slice path is the documented “no
move requested” form that returns Ok cleanly.
Sourcepub fn clear_subtree_control(&self, name: &str) -> Result<()>
pub fn clear_subtree_control(&self, name: &str) -> Result<()>
Clear subtree_control on a child cgroup by writing an empty
string. Disables all controllers for the cgroup’s children.
Required before moving tasks into a cgroup that has
subtree_control set: the kernel’s no-internal-process
constraint (cgroup_migrate_vet_dst) returns EBUSY when
tasks are written to cgroup.procs of a cgroup with
controllers in subtree_control.
Sourcepub fn drain_tasks(&self, name: &str) -> Result<()>
pub fn drain_tasks(&self, name: &str) -> Result<()>
Move all tasks from a child cgroup to the walk-root cgroup.
Drains to {self.walk_root}/cgroup.procs instead of the
parent because the parent has subtree_control set (enabling
cpuset for children), and the kernel’s no-internal-process
constraint rejects writes to cgroup.procs when
subtree_control is active. The walk-root cgroup is the
uppermost cgroup the operator can write to without crossing
the delegation boundary; under Mode A it is the canonical
/sys/fs/cgroup root (exempt from the no-internal-process
constraint), under Mode B/C it is the delegated subtree root
(which also has procs-writability inside the delegation).
Sourcepub fn read_procs(&self, name: &str) -> Result<Vec<pid_t>>
pub fn read_procs(&self, name: &str) -> Result<Vec<pid_t>>
Read cgroup.procs of name, returning the thread-group
leaders (PIDs) currently in the cgroup.
Distinct from Self::drain_tasks:
drain_tasksMIGRATES tasks to the walk-root and treats a missingcgroup.procsfile as a no-op (Ok(())) so best-effort teardown of an already-rmdir’d cgroup is safe.read_procsis a READ accessor for assertions (Op::CaptureCgroupProcsand direct callers). A missingcgroup.procsfile is a real error (cgroup doesn’t exist, typo’d name, race with teardown) — propagating it lets the caller distinguish “empty cgroup” from “no such cgroup.”
§Semantics
- Returns thread-group leaders (PIDs / TGIDs) as the kernel
exposes them via
cgroup_procs_showinkernel/cgroup/cgroup.c. For per-thread TIDs the kernel exposescgroup.threads; this method reads ONLYcgroup.procs. - Non-atomic snapshot as exposed by the kernel’s pidlist
iteration (
cgroup_procs_show/css_task_iter_nextinkernel/cgroup/cgroup.c): the kernel walks the css_set’s task list one entry at a time, so a task that joins or exits mid-read can appear in the next read but not this one (or vice versa). The userspacefs::read_to_stringhere returns when seq_file signals EOF; the per-pid atomicity is a kernel property, not an impl one. Callers asserting on membership of a stable task set (e.g. SpinWait workers spawned in the prior op) are unaffected. - Empty cgroup: returns
Ok(Vec::new())(kernel emits an empty file, not an error). Lets callers distinguish “no tasks” from “no such cgroup.” - Malformed pid lines: skipped with a
tracing::warn!naming the offending line, matchingdrain_pids_to_root’s tolerance. The kernel never emits such lines today; the tolerance exists so a future kernel gaining a header or comment line surfaces as a warn instead of an opaque parse error.
Sourcepub fn cleanup_all(&self) -> Result<()>
pub fn cleanup_all(&self) -> Result<()>
Remove all child cgroups under the parent (keeps the parent itself).
Returns Ok even when individual filesystem probes fail; callers
treat cleanup as best-effort teardown (see the runner’s warn-
and-continue in src/runner.rs). Per-entry read_dir /
DirEntry / file_type errors are surfaced via
tracing::warn! — mirrors CgroupGroup::drop so a failure
shows up in logs instead of silently leaving children behind.
§Outer-read_dir failure semantic
When read_dir(self.parent) itself fails — e.g. the parent
directory is unreadable, the cgroup mount has been unmounted
out from under us, or a stat-side IO error fires — the
failure is surfaced via tracing::warn! and the function
still returns Ok(()). The deliberate semantic here is
“teardown that observes a hostile filesystem state must
not block scenario completion”: a hard Err would propagate
up through the runner’s teardown and abort the whole test
run on a transient cgroupfs failure that the operator can
follow up on by reading the warn line.
Production callers (the runner’s drop path, scenario teardown)
already log-and-continue on cleanup_all errors, so the
always-Ok return is consistent with how every consumer
already treats the result. Operators who need to detect
teardown leakage should grep tracing output for
"cleanup_all: read_dir failed" rather than relying on a
non-zero exit; the warn includes both the offending path and
the underlying io::Error.
Trait Implementations§
Source§impl CgroupOps for CgroupManager
impl CgroupOps for CgroupManager
Source§fn parent_path(&self) -> &Path
fn parent_path(&self) -> &Path
CgroupManager::parent_path.Source§fn setup(&self, controllers: &BTreeSet<Controller>) -> Result<()>
fn setup(&self, controllers: &BTreeSet<Controller>) -> Result<()>
CgroupManager::setup.Source§fn create_cgroup(&self, name: &str) -> Result<()>
fn create_cgroup(&self, name: &str) -> Result<()>
CgroupManager::create_cgroup.Source§fn remove_cgroup(&self, name: &str) -> Result<()>
fn remove_cgroup(&self, name: &str) -> Result<()>
CgroupManager::remove_cgroup.Source§fn set_cpuset(&self, name: &str, cpus: &BTreeSet<usize>) -> Result<()>
fn set_cpuset(&self, name: &str, cpus: &BTreeSet<usize>) -> Result<()>
cpuset.cpus. See CgroupManager::set_cpuset.Source§fn clear_cpuset(&self, name: &str) -> Result<()>
fn clear_cpuset(&self, name: &str) -> Result<()>
cpuset.cpus (inherit from parent). See
CgroupManager::clear_cpuset.Source§fn set_cpuset_mems(&self, name: &str, nodes: &BTreeSet<usize>) -> Result<()>
fn set_cpuset_mems(&self, name: &str, nodes: &BTreeSet<usize>) -> Result<()>
cpuset.mems. See CgroupManager::set_cpuset_mems.Source§fn clear_cpuset_mems(&self, name: &str) -> Result<()>
fn clear_cpuset_mems(&self, name: &str) -> Result<()>
cpuset.mems (inherit from parent). See
CgroupManager::clear_cpuset_mems.Source§fn set_cpu_max(
&self,
name: &str,
quota_us: Option<u64>,
period_us: u64,
) -> Result<()>
fn set_cpu_max( &self, name: &str, quota_us: Option<u64>, period_us: u64, ) -> Result<()>
cpu.max. See CgroupManager::set_cpu_max.Source§fn set_cpu_weight(&self, name: &str, weight: u32) -> Result<()>
fn set_cpu_weight(&self, name: &str, weight: u32) -> Result<()>
cpu.weight. See CgroupManager::set_cpu_weight.Source§fn set_memory_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
fn set_memory_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
memory.max. See CgroupManager::set_memory_max.Source§fn set_memory_high(&self, name: &str, bytes: Option<u64>) -> Result<()>
fn set_memory_high(&self, name: &str, bytes: Option<u64>) -> Result<()>
memory.high. See CgroupManager::set_memory_high.Source§fn set_memory_low(&self, name: &str, bytes: Option<u64>) -> Result<()>
fn set_memory_low(&self, name: &str, bytes: Option<u64>) -> Result<()>
memory.low. See CgroupManager::set_memory_low.Source§fn set_io_weight(&self, name: &str, weight: u16) -> Result<()>
fn set_io_weight(&self, name: &str, weight: u16) -> Result<()>
io.weight. See CgroupManager::set_io_weight.Source§fn set_freeze(&self, name: &str, frozen: bool) -> Result<()>
fn set_freeze(&self, name: &str, frozen: bool) -> Result<()>
cgroup.freeze. See CgroupManager::set_freeze.Source§fn set_pids_max(&self, name: &str, max: Option<u64>) -> Result<()>
fn set_pids_max(&self, name: &str, max: Option<u64>) -> Result<()>
pids.max. See CgroupManager::set_pids_max.Source§fn set_memory_swap_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
fn set_memory_swap_max(&self, name: &str, bytes: Option<u64>) -> Result<()>
memory.swap.max. See
CgroupManager::set_memory_swap_max.Source§fn move_task(&self, name: &str, pid: pid_t) -> Result<()>
fn move_task(&self, name: &str, pid: pid_t) -> Result<()>
cgroup.procs. See
CgroupManager::move_task.Source§fn move_tasks(&self, name: &str, pids: &[pid_t]) -> Result<()>
fn move_tasks(&self, name: &str, pids: &[pid_t]) -> Result<()>
CgroupManager::move_tasks.Source§fn place_task_during_handshake(
&self,
cgroup_name: &str,
child_pid: pid_t,
) -> Result<()>
fn place_task_during_handshake( &self, cgroup_name: &str, child_pid: pid_t, ) -> Result<()>
cgroup.procs
during the payload-spawn cgroup-sync handshake. Read moreSource§fn clear_subtree_control(&self, name: &str) -> Result<()>
fn clear_subtree_control(&self, name: &str) -> Result<()>
cgroup.subtree_control on a child. See
CgroupManager::clear_subtree_control.Source§fn drain_tasks(&self, name: &str) -> Result<()>
fn drain_tasks(&self, name: &str) -> Result<()>
CgroupManager::drain_tasks.Source§fn read_procs(&self, name: &str) -> Result<Vec<pid_t>>
fn read_procs(&self, name: &str) -> Result<Vec<pid_t>>
cgroup.procs of a child, returning thread-group leaders.
See CgroupManager::read_procs.Source§fn cleanup_all(&self) -> Result<()>
fn cleanup_all(&self) -> Result<()>
CgroupManager::cleanup_all.Auto Trait Implementations§
impl !Freeze for CgroupManager
impl RefUnwindSafe for CgroupManager
impl Send for CgroupManager
impl Sync for CgroupManager
impl Unpin for CgroupManager
impl UnwindSafe for CgroupManager
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more