WorkloadHandle

Struct WorkloadHandle 

Source
pub struct WorkloadHandle { /* private fields */ }
Expand description

Handle to spawned worker tasks. Workers block until start() is called.

The CloneMode in the WorkloadConfig selects how each worker is created. Within one WorkloadHandle every worker uses the same mode, so exactly one of children or threads is populated; the other is empty. This avoids per-worker mode dispatch on the hot path and keeps each vec’s per-mode invariants (pid-based vs JoinHandle-based reaping) cohesive.

  • CloneMode::Fork populates children — separate process per worker, reaped via waitpid, signaled via SIGUSR1.
  • CloneMode::Thread populates threads — separate kernel task in the parent’s thread group via std::thread::spawn, joined via JoinHandle. Workers share the parent’s tgid; per-worker cgroup placement requires cgroup.threads (cgroup v2 thread mode), which ktstr scenarios do not currently configure — Thread-mode workers inherit the parent’s cgroup.

Implementations§

Source§

impl WorkloadHandle

Source

pub fn spawn_pcomm_cgroup( pcomm: &str, container_uid: Option<u32>, container_gid: Option<u32>, works: &[WorkSpec], ) -> Result<Self>

Fork one thread-group leader hosting every worker in works as worker threads inside a single forked process. Used by apply_setup when a CgroupDef declares pcomm — every WorkSpec in the same CgroupDef is coalesced into one thread-group leader whose task->comm carries pcomm exactly (the WorkSpec::pcomm builder rejects > 15 bytes — TASK_COMM_LEN - 1 — so the framework never feeds the kernel a name __set_task_comm would truncate). Every spawned thread reads its task->group_leader->comm as pcomm for the leader’s lifetime.

works must already be fully resolved: each entry’s num_workers must be Some(_) and affinity must be non-topology-aware (Inherit / Exact / RandomSubset). apply_setup runs the standard scenario-engine resolution (resolve_num_workers + intent_for_spawn) before calling in.

container_uid / container_gid apply to the leader’s process credentials via setresuid / setresgid once, inside the forked leader, before threads spawn. Each worker thread additionally re-applies its merged uid/gid inside worker_main at thread creation time; for the common case of a single CgroupDef-level default flowing through merged_works, the per-thread call is idempotent.

On works.is_empty() the function returns a handle with no children — no fork, no resource allocation. Same for the empty-thread case across ALL groups (every num_workers == 0): the leader is skipped because there is no work to host. This matches the pcomm_zero_workers_no_container_spawn contract.

Group-level admission rejects WorkType variants that conflict with the threaded shape (every worker shares the leader’s tgid):

  • WorkType::ForkExit: a fork from a thread of a multi-threaded process inherits all locks held by other threads at fork time, which the child cannot release; safe-but-degraded use cases would still need CloneMode::Fork for clean lock-free child state.
  • WorkType::CgroupChurn: writing the worker tid to cgroup.procs migrates the entire leader tgid (every sibling thread) — the test loses control of its own cgroup placement.
Source

pub fn spawn(config: &WorkloadConfig) -> Result<Self>

Spawn worker tasks. Workers block until start() is called, allowing the caller to move fork-mode workers into cgroups first. The worker creation primitive (fork or std::thread::spawn) is selected by WorkloadConfig::clone_mode.

Source

pub fn worker_pids(&self) -> Vec<pid_t>

Kernel TIDs of all worker tasks, in spawn order.

Returned as libc::pid_t — the kernel’s native type — so callers feed them directly into kill, waitpid, Pid::from_raw, and sched_setaffinity writes without any sign-cast at the libc boundary.

§WARNING — cgroup.procs for CloneMode::Thread

For CloneMode::Thread, passing these TIDs to a cgroup.procs write migrates the ENTIRE test-runner process into that cgroup: cgroup.procs writes are tgid-scoped, and every Thread worker shares the test runner’s tgid. The first such write moves the test harness, every parent thread, and every sibling worker into the destination cgroup; subsequent writes are no-ops because they all point at the same tgid. Use cgroup v2 threaded-mode cgroups with cgroup.threads for per-thread placement. CloneMode::Fork is the right choice when each worker needs its own cgroup.

§Per-mode interpretation
  • CloneMode::Fork: each entry is the worker’s pid (== tgid == kernel tid because the worker is its own thread-group leader). Safe to feed into cgroup.procs.
  • CloneMode::Thread: each entry is the worker’s gettid() value — distinct kernel tasks inside the parent’s tgid. Safe for sched_setaffinity(tid, ...); safe for cgroup.threads writes under a threaded-mode cgroup; not safe for cgroup.procs (see warning above).
§Thread tid publish ordering

Thread workers publish their gettid() via an Arc<AtomicI32> BEFORE the start handshake (the publish is the first thing the worker closure does, before blocking on start_rx). spawn_thread_worker blocks on a paired rendezvous channel until the worker reaches the publish point, so by the time WorkloadHandle::spawn returns, every thread worker’s tid is non-zero (or spawn would have bailed with a “failed to publish gettid() within 2 s” error). Post-spawn worker_pids() is safe to call without first calling Self::start. The publish uses Release; this reader uses Acquire, pairing release-acquire so that any reader observing a non-zero tid is also guaranteed to observe the worker’s post-publish state.

§pcomm containers

Groups configured with WorkSpec::pcomm yield a single container process whose tgid leader pid is what the parent holds; the per-thread gettid() values of the workers running inside the container are not exported across the process boundary. Each pcomm group contributes ONE entry to the returned vector — the container pid — regardless of how many thread workers it hosts. This pid is correct for cgroup.procs migration (the container’s whole tgid moves) but is NOT suitable as a target for per-thread sched_setaffinity — see Self::set_affinity for the matching error path.

Order: forked children (conventional workers + pcomm containers, in spawn order) followed by Thread-mode workers (in spawn order). A workload that mixes pcomm groups with non-pcomm Thread-mode groups produces both populated collections; a workload using only one dispatch path produces only one populated collection.

Source

pub fn worker_pids_for_cgroup_procs(&self) -> Result<Vec<pid_t>>

Worker pids suitable for cgroup.procs migration.

cgroup.procs is tgid-scoped in the kernel: writing a tid migrates the entire thread group containing that tid (kernel/cgroup/cgroup.c::__cgroup_procs_write resolves the passed pid to its leader via find_lock_task_mm / cgroup_procs_write_start). Under CloneMode::Thread every worker shares the test harness’s tgid, so feeding Self::worker_pids to cgroup.procs would migrate the harness itself — catastrophic.

Returns the per-worker pids when the spawn used CloneMode::Fork (each worker has its own tgid). Bails for CloneMode::Thread with an actionable diagnostic pointing at cgroup.threads (the thread-scoped sibling) as the right migration sink for thread workers.

Callers that integrate with cgroup.procs writes — e.g. crate::cgroup::CgroupManager::move_tasks — should call this in place of Self::worker_pids so a misconfigured Thread-mode test fails at the migration step rather than silently moving the harness into the per-test cgroup.

Source

pub fn start(&mut self)

Signal all workers to start working (after they’ve been placed in cgroups, if applicable).

Idempotent — subsequent calls after the first are no-ops.

Source

pub fn set_affinity(&self, idx: usize, cpus: &BTreeSet<usize>) -> Result<()>

Set CPU affinity for worker at idx.

For CloneMode::Fork the per-worker pid addresses a distinct kernel task. For CloneMode::Thread the worker’s gettid() is what sched_setaffinity(tid, ...) accepts; this method reads the tid from the worker’s Arc<AtomicI32> (with Acquire ordering, paired with the Release publish on the worker thread). Returns an error if the thread has not yet published its tid — call start() first so the worker reaches its gettid() publish before reading.

Bails for ForkedChildKind::PcommContainer entries: the container’s tgid leader pid is the only kernel handle the parent holds for that group, but sched_setaffinity against that pid pins only the leader thread (a placeholder thread that never enters the work loop), not the worker threads running the workload. The container’s worker tids are not published across the process boundary. Bake the affinity into WorkSpec::affinity at spawn time instead — worker_main applies it per-thread inside the container.

Index space: [0, children.len()) addresses forked children (conventional workers and pcomm containers), and [children.len(), children.len() + threads.len()) addresses Thread-mode workers, matching the ordering of Self::worker_pids.

Source

pub fn snapshot_iterations(&self) -> Vec<u64>

Read all workers’ current iteration counts from shared memory.

Each element is the monotonically increasing iteration count for that worker, read with Relaxed ordering. Returns an empty vec if no workers were spawned.

§Ordering rationale — why Relaxed is sound

Every producer (the worker-side store at the worker_main publish sites) writes its slot with Relaxed ordering, and this reader loads with Relaxed too. No happens-before edge is needed because no host-side consumer pairs the iteration count with OTHER shared state: the parent samples these counters to answer “is this worker still making progress?” and feeds deltas into gap detection, not into any data-dependent follow-up read from a different shared memory location. A stale value on one sample is self-correcting — the next snapshot picks up the newer count without any cross-field invariant to break.

The per-slot single-producer / multi-sampler shape is inherently non-tearing on every supported target (AtomicU64 is architecture-primitive on x86_64 and aarch64 LSE with 8-byte alignment enforced by the type). The only question is ordering, and the audit above concludes Relaxed is load-bearingly correct — promoting either side to Acquire/Release would add a barrier with no corresponding paired operation to synchronise with.

Source

pub fn stop_and_collect(self) -> Vec<WorkerReport>

Stop all workers, collect their reports, and wait for exit.

Auto-starts workers if start() was not called, then waits on an event-driven barrier — each fork worker writes a single b'r' byte to its report pipe immediately after the start handshake completes, and the parent polls every report fd for POLLIN with a 5 s deadline. The barrier wakes the moment the slowest worker finishes its post-fork init, replacing the prior unconditional 500 ms sleep that under-waited under host CPU contention and over-waited on idle hosts. Thread-mode workers are pre- synchronised by start()’s mpsc::sync_channel(0) rendezvous, so the barrier is a no-op when no fork children were spawned. Consumes self – workers cannot be restarted.

Workers that fail to produce a report (died, timed out, or wrote corrupt data) get a zeroed-out sentinel report with work_units: 0. This ensures assert_not_starved catches dead workers as starvation failures.

§Shutdown latency

Workers spend their steady-state time blocked inside a futex_wait with timeout WORKER_STOP_POLL_NS (~100 ms). The “stop signal” is a per-mode flag the worker checks on every futex-wait wake; the wake interval bounds shutdown latency.

Fork modestop_and_collect sends SIGUSR1 to each worker pid; the per-process sigusr1_handler flips the global STOP in that worker’s CoW address space, and the worker observes it on the NEXT futex wake (partner-writes or the 100 ms timeout, whichever comes first). The signal handler is process-wide and reaches one worker per kill().

Thread modestop_and_collect calls worker.stop.store(true, Relaxed) directly on each worker’s Arc<AtomicBool>. SIGUSR1 is process-wide and useless for per-thread stop control, so no signal is sent; the worker observes the flag flip on its next futex-wait wake at the same 100 ms cadence.

Callers that budget a graceful-shutdown window should allow at least one WORKER_STOP_POLL_NS tick (~100 ms) between flag flip and final collect, over and above any report-flush / IO latency. Tighter windows can race the worker’s pre-stop iteration and surface as a missing report, which is then mapped to the sentinel path above.

§Exit-shape invariance

Collection discriminates purely on the presence and validity of the worker’s pipe-delivered postcard payload — not on waitpid exit status. Under panic = "unwind" (dev/test profile) the worker’s catch_unwind arm calls _exit(1) so the parent sees WIFEXITED=true, WEXITSTATUS=1; under panic = "abort" (release profile) the worker aborts with SIGABRT so the parent sees WIFEXITED=false, WTERMSIG=6. Either way, a panicking worker never finishes f.write_all(&bytes) on the report pipe, so poll + read_to_end hands back an empty (or truncated) buffer, postcard::from_bytes fails, and the sentinel path fires. Partial writes from a panic between successful write_all and _exit(0) are not reachable — the write is the last non-trivial statement inside the catch_unwind closure. The waitpid call later in this function exists solely for reaping zombies; its return value feeds only the “still alive → SIGKILL escalate” branch and is never mapped to report state (the sentinel path DOES now read it to populate WorkerExitInfo on the attached diagnostic, but the correctness discrimination — sentinel vs real report — still happens purely on pipe payload presence).

Trait Implementations§

Source§

impl Drop for WorkloadHandle

Source§

fn drop(&mut self)

Executes the destructor for this type. Read more
Source§

impl Send for WorkloadHandle

Source§

impl Sync for WorkloadHandle

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

§

impl<T> Instrument for T

§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided [Span], returning an Instrumented wrapper. Read more
§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
§

impl<T> Pointable for T

§

const ALIGN: usize

The alignment of pointer.
§

type Init = T

The type for initializers.
§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
§

impl<T> PolicyExt for T
where T: ?Sized,

§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] only if self and other return Action::Follow. Read more
§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> WithSubscriber for T

§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

impl<T> MaybeSend for T
where T: Send,

§

impl<T> MaybeSend for T
where T: Send,