pub struct WorkloadHandle { /* private fields */ }Expand description
Handle to spawned worker tasks. Workers block until
start() is called.
The CloneMode in the WorkloadConfig selects how each
worker is created. Within one WorkloadHandle every worker
uses the same mode, so exactly one of children or threads
is populated; the other is empty. This avoids per-worker mode
dispatch on the hot path and keeps each vec’s per-mode
invariants (pid-based vs JoinHandle-based reaping) cohesive.
CloneMode::Forkpopulateschildren— separate process per worker, reaped viawaitpid, signaled via SIGUSR1.CloneMode::Threadpopulatesthreads— separate kernel task in the parent’s thread group viastd::thread::spawn, joined viaJoinHandle. Workers share the parent’s tgid; per-worker cgroup placement requirescgroup.threads(cgroup v2 thread mode), which ktstr scenarios do not currently configure — Thread-mode workers inherit the parent’s cgroup.
Implementations§
Source§impl WorkloadHandle
impl WorkloadHandle
Sourcepub fn spawn_pcomm_cgroup(
pcomm: &str,
container_uid: Option<u32>,
container_gid: Option<u32>,
works: &[WorkSpec],
) -> Result<Self>
pub fn spawn_pcomm_cgroup( pcomm: &str, container_uid: Option<u32>, container_gid: Option<u32>, works: &[WorkSpec], ) -> Result<Self>
Fork one thread-group leader hosting every worker in works
as worker threads inside a single forked process. Used by
apply_setup when a CgroupDef declares pcomm — every
WorkSpec in the same CgroupDef is coalesced into one
thread-group leader whose task->comm carries pcomm
exactly (the WorkSpec::pcomm builder rejects > 15 bytes
— TASK_COMM_LEN - 1 — so the framework never feeds the
kernel a name __set_task_comm would truncate). Every
spawned thread reads its task->group_leader->comm as
pcomm for the leader’s lifetime.
works must already be fully resolved: each entry’s
num_workers must be Some(_) and affinity must be
non-topology-aware (Inherit / Exact / RandomSubset).
apply_setup runs the standard scenario-engine resolution
(resolve_num_workers + intent_for_spawn) before calling
in.
container_uid / container_gid apply to the leader’s
process credentials via setresuid / setresgid once,
inside the forked leader, before threads spawn. Each worker
thread additionally re-applies its merged uid/gid inside
worker_main at thread creation time; for the common case
of a single CgroupDef-level default flowing through
merged_works, the per-thread call is idempotent.
On works.is_empty() the function returns a handle with no
children — no fork, no resource allocation. Same for the
empty-thread case across ALL groups (every num_workers == 0): the leader is skipped because there is no work to host.
This matches the pcomm_zero_workers_no_container_spawn
contract.
Group-level admission rejects WorkType variants that conflict with the threaded shape (every worker shares the leader’s tgid):
WorkType::ForkExit: a fork from a thread of a multi-threaded process inherits all locks held by other threads at fork time, which the child cannot release; safe-but-degraded use cases would still needCloneMode::Forkfor clean lock-free child state.WorkType::CgroupChurn: writing the worker tid tocgroup.procsmigrates the entire leader tgid (every sibling thread) — the test loses control of its own cgroup placement.
Sourcepub fn spawn(config: &WorkloadConfig) -> Result<Self>
pub fn spawn(config: &WorkloadConfig) -> Result<Self>
Spawn worker tasks. Workers block until
start() is called, allowing the caller to
move fork-mode workers into cgroups first. The worker creation
primitive (fork or std::thread::spawn) is selected by
WorkloadConfig::clone_mode.
Sourcepub fn worker_pids(&self) -> Vec<pid_t> ⓘ
pub fn worker_pids(&self) -> Vec<pid_t> ⓘ
Kernel TIDs of all worker tasks, in spawn order.
Returned as libc::pid_t — the kernel’s native type — so
callers feed them directly into kill, waitpid,
Pid::from_raw, and sched_setaffinity writes without any
sign-cast at the libc boundary.
§WARNING — cgroup.procs for CloneMode::Thread
For CloneMode::Thread, passing these TIDs to a
cgroup.procs write migrates the ENTIRE test-runner process
into that cgroup: cgroup.procs writes are tgid-scoped, and
every Thread worker shares the test runner’s tgid. The first
such write moves the test harness, every parent thread, and
every sibling worker into the destination cgroup; subsequent
writes are no-ops because they all point at the same tgid.
Use cgroup v2 threaded-mode cgroups with cgroup.threads
for per-thread placement. CloneMode::Fork is the right
choice when each worker needs its own cgroup.
§Per-mode interpretation
CloneMode::Fork: each entry is the worker’s pid (== tgid == kernel tid because the worker is its own thread-group leader). Safe to feed intocgroup.procs.CloneMode::Thread: each entry is the worker’sgettid()value — distinct kernel tasks inside the parent’s tgid. Safe forsched_setaffinity(tid, ...); safe forcgroup.threadswrites under a threaded-mode cgroup; not safe forcgroup.procs(see warning above).
§Thread tid publish ordering
Thread workers publish their gettid() via an
Arc<AtomicI32> BEFORE the start handshake (the publish is
the first thing the worker closure does, before blocking on
start_rx). spawn_thread_worker blocks on a paired
rendezvous channel until the worker reaches the publish
point, so by the time WorkloadHandle::spawn returns,
every thread worker’s tid is non-zero (or spawn would have
bailed with a “failed to publish gettid() within 2 s”
error). Post-spawn worker_pids() is safe to call without
first calling Self::start. The publish uses Release;
this reader uses Acquire, pairing release-acquire so that
any reader observing a non-zero tid is also guaranteed to
observe the worker’s post-publish state.
§pcomm containers
Groups configured with WorkSpec::pcomm yield a single
container process whose tgid leader pid is what the parent
holds; the per-thread gettid() values of the workers running
inside the container are not exported across the process
boundary. Each pcomm group contributes ONE entry to the
returned vector — the container pid — regardless of how many
thread workers it hosts. This pid is correct for cgroup.procs
migration (the container’s whole tgid moves) but is NOT
suitable as a target for per-thread sched_setaffinity —
see Self::set_affinity for the matching error path.
Order: forked children (conventional workers + pcomm containers, in spawn order) followed by Thread-mode workers (in spawn order). A workload that mixes pcomm groups with non-pcomm Thread-mode groups produces both populated collections; a workload using only one dispatch path produces only one populated collection.
Sourcepub fn worker_pids_for_cgroup_procs(&self) -> Result<Vec<pid_t>>
pub fn worker_pids_for_cgroup_procs(&self) -> Result<Vec<pid_t>>
Worker pids suitable for cgroup.procs migration.
cgroup.procs is tgid-scoped in the kernel: writing a
tid migrates the entire thread group containing that tid
(kernel/cgroup/cgroup.c::__cgroup_procs_write resolves the
passed pid to its leader via find_lock_task_mm /
cgroup_procs_write_start). Under CloneMode::Thread
every worker shares the test harness’s tgid, so feeding
Self::worker_pids to cgroup.procs would migrate the
harness itself — catastrophic.
Returns the per-worker pids when the spawn used
CloneMode::Fork (each worker has its own tgid). Bails
for CloneMode::Thread with an actionable diagnostic
pointing at cgroup.threads (the thread-scoped sibling) as
the right migration sink for thread workers.
Callers that integrate with cgroup.procs writes — e.g.
crate::cgroup::CgroupManager::move_tasks — should call
this in place of Self::worker_pids so a misconfigured
Thread-mode test fails at the migration step rather than
silently moving the harness into the per-test cgroup.
Sourcepub fn start(&mut self)
pub fn start(&mut self)
Signal all workers to start working (after they’ve been placed in cgroups, if applicable).
Idempotent — subsequent calls after the first are no-ops.
Sourcepub fn set_affinity(&self, idx: usize, cpus: &BTreeSet<usize>) -> Result<()>
pub fn set_affinity(&self, idx: usize, cpus: &BTreeSet<usize>) -> Result<()>
Set CPU affinity for worker at idx.
For CloneMode::Fork the per-worker pid addresses a
distinct kernel task. For CloneMode::Thread the worker’s
gettid() is what sched_setaffinity(tid, ...) accepts;
this method reads the tid from the worker’s
Arc<AtomicI32> (with Acquire ordering, paired with the
Release publish on the worker thread). Returns an error
if the thread has not yet published its tid — call
start() first so the worker reaches its
gettid() publish before reading.
Bails for ForkedChildKind::PcommContainer entries: the
container’s tgid leader pid is the only kernel handle the
parent holds for that group, but sched_setaffinity against
that pid pins only the leader thread (a placeholder thread
that never enters the work loop), not the worker threads
running the workload. The container’s worker tids are not
published across the process boundary. Bake the affinity into
WorkSpec::affinity at spawn time instead — worker_main
applies it per-thread inside the container.
Index space: [0, children.len()) addresses forked children
(conventional workers and pcomm containers), and
[children.len(), children.len() + threads.len()) addresses
Thread-mode workers, matching the ordering of
Self::worker_pids.
Sourcepub fn snapshot_iterations(&self) -> Vec<u64>
pub fn snapshot_iterations(&self) -> Vec<u64>
Read all workers’ current iteration counts from shared memory.
Each element is the monotonically increasing iteration count for that worker, read with Relaxed ordering. Returns an empty vec if no workers were spawned.
§Ordering rationale — why Relaxed is sound
Every producer (the worker-side store at the
worker_main publish sites) writes its slot with Relaxed
ordering, and this reader loads with Relaxed too. No
happens-before edge is needed because no host-side consumer
pairs the iteration count with OTHER shared state: the
parent samples these counters to answer “is this worker
still making progress?” and feeds deltas into gap
detection, not into any data-dependent follow-up read from
a different shared memory location. A stale value on one
sample is self-correcting — the next snapshot picks up the
newer count without any cross-field invariant to break.
The per-slot single-producer / multi-sampler shape is inherently non-tearing on every supported target (AtomicU64 is architecture-primitive on x86_64 and aarch64 LSE with 8-byte alignment enforced by the type). The only question is ordering, and the audit above concludes Relaxed is load-bearingly correct — promoting either side to Acquire/Release would add a barrier with no corresponding paired operation to synchronise with.
Sourcepub fn stop_and_collect(self) -> Vec<WorkerReport>
pub fn stop_and_collect(self) -> Vec<WorkerReport>
Stop all workers, collect their reports, and wait for exit.
Auto-starts workers if start() was not called,
then waits on an event-driven barrier — each fork worker
writes a single b'r' byte to its report pipe immediately
after the start handshake completes, and the parent polls
every report fd for POLLIN with a 5 s deadline. The
barrier wakes the moment the slowest worker finishes its
post-fork init, replacing the prior unconditional 500 ms
sleep that under-waited under host CPU contention and
over-waited on idle hosts. Thread-mode workers are pre-
synchronised by start()’s mpsc::sync_channel(0) rendezvous,
so the barrier is a no-op when no fork children were spawned.
Consumes self – workers cannot be restarted.
Workers that fail to produce a report (died, timed out, or wrote
corrupt data) get a zeroed-out sentinel report with work_units: 0.
This ensures assert_not_starved catches dead workers as starvation
failures.
§Shutdown latency
Workers spend their steady-state time blocked inside a
futex_wait with timeout WORKER_STOP_POLL_NS (~100 ms).
The “stop signal” is a per-mode flag the worker checks on
every futex-wait wake; the wake interval bounds shutdown
latency.
Fork mode — stop_and_collect sends SIGUSR1 to each
worker pid; the per-process sigusr1_handler flips the
global STOP in that worker’s CoW address space, and the
worker observes it on the NEXT futex wake (partner-writes
or the 100 ms timeout, whichever comes first). The signal
handler is process-wide and reaches one worker per kill().
Thread mode — stop_and_collect calls
worker.stop.store(true, Relaxed) directly on each
worker’s Arc<AtomicBool>. SIGUSR1 is process-wide and
useless for per-thread stop control, so no signal is sent;
the worker observes the flag flip on its next futex-wait
wake at the same 100 ms cadence.
Callers that budget a graceful-shutdown window should
allow at least one WORKER_STOP_POLL_NS tick (~100 ms)
between flag flip and final collect, over and above any
report-flush / IO latency. Tighter windows can race the
worker’s pre-stop iteration and surface as a missing
report, which is then mapped to the sentinel path above.
§Exit-shape invariance
Collection discriminates purely on the presence and validity of
the worker’s pipe-delivered postcard payload — not on
waitpid exit status. Under panic = "unwind" (dev/test
profile) the worker’s
catch_unwind arm calls _exit(1) so the parent sees
WIFEXITED=true, WEXITSTATUS=1; under panic = "abort"
(release profile) the worker aborts with SIGABRT so the parent
sees WIFEXITED=false, WTERMSIG=6. Either way, a panicking
worker never finishes f.write_all(&bytes) on the report pipe,
so poll + read_to_end hands back an empty (or truncated)
buffer, postcard::from_bytes fails, and the
sentinel path fires. Partial writes from a panic between successful
write_all and _exit(0) are not reachable — the write is the
last non-trivial statement inside the catch_unwind closure.
The waitpid call later in this function exists solely for
reaping zombies; its return value feeds only the “still alive
→ SIGKILL escalate” branch and is never mapped to report
state (the sentinel path DOES now read it to populate
WorkerExitInfo on the attached diagnostic, but the
correctness discrimination — sentinel vs real report — still
happens purely on pipe payload presence).
Trait Implementations§
Source§impl Drop for WorkloadHandle
impl Drop for WorkloadHandle
impl Send for WorkloadHandle
impl Sync for WorkloadHandle
Auto Trait Implementations§
impl Freeze for WorkloadHandle
impl !RefUnwindSafe for WorkloadHandle
impl Unpin for WorkloadHandle
impl !UnwindSafe for WorkloadHandle
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more