pub struct WorkerReport {Show 31 fields
pub tid: i32,
pub work_units: u64,
pub cpu_time_ns: u64,
pub wall_time_ns: u64,
pub off_cpu_ns: u64,
pub migration_count: u64,
pub cpus_used: BTreeSet<usize>,
pub migrations: Vec<Migration>,
pub max_gap_ms: u64,
pub max_gap_cpu: usize,
pub max_gap_at_ms: u64,
pub wake_latencies_ns: Vec<u64>,
pub wake_sample_total: u64,
pub iteration_costs_ns: Vec<u64>,
pub iteration_cost_sample_total: u64,
pub timer_latencies_ns: Vec<u64>,
pub timer_sample_total: u64,
pub iterations: u64,
pub schedstat_run_delay_ns: u64,
pub schedstat_run_count: u64,
pub schedstat_cpu_time_ns: u64,
pub completed: bool,
pub numa_pages: BTreeMap<usize, u64>,
pub vmstat_numa_pages_migrated: u64,
pub exit_info: Option<WorkerExitInfo>,
pub is_messenger: bool,
pub group_idx: usize,
pub affinity_error: Option<String>,
pub sched_policy_error: Option<String>,
pub phase_slices: Vec<PhaseSlice>,
pub taobench_whole: Option<TaobenchStats>,
}Expand description
Telemetry collected from a worker process after it stops.
Normal reports: each field is populated by the worker itself
(inside the VM) and serialized via a pipe to the parent process.
Sentinel reports: sentinel reports synthesized by
WorkloadHandle::stop_and_collect on worker-exit carry
parent-populated exit_info with the remaining fields at their
Default values (the worker never emitted on the pipe, so
the parent is the sole source of truth for the surfaced
outcome).
§Default trade-off
Default produces a zero/empty report. The trade-off:
- Pro: sentinel/test code can spread
..WorkerReport::default()so adding a field does not require touching every sentinel site. - Con: zero-valued fields are valid report outputs (e.g. a
worker that never blocked has
wake_latencies_ns: vec![]), so a missing field cannot be distinguished from a real-zero field at the reader. Consumers that need “was this field actually set” must track presence out-of-band (e.g. whether the work type populates the field perwake_latencies_ns’s doc).
Decision: keep the Default impl. Sentinel ergonomics outweigh
the distinguishability cost — every real consumer already knows
which fields a given WorkType populates, and the alternative
(removing Default and hand-listing every field at sentinel
sites) introduces a worse drift problem that silently skips new
telemetry instead of reporting it as zero.
Callers building a sentinel report should spread
..WorkerReport::default() rather than listing every field by hand
– the sentinel drifts silently when a field is added.
Fields§
§tid: i32Kernel TID from gettid(2). For CloneMode::Fork each
worker is its own thread-group leader so gettid() == getpid() == tgid; the report’s tid is interchangeable with the
worker’s pid in libc / cgroup APIs. For CloneMode::Thread
every worker shares the parent’s tgid and gettid() is the
only identifier that discriminates per-task identity, so the
report’s tid is what feeds sched_setaffinity(tid, ...) and
cgroup.threads writes (NOT cgroup.procs — see the warning
on WorkloadHandle::worker_pids). Stored as pid_t (i32)
to match the kernel’s native type and avoid the silent
u32→i32 sign-cast wraparound at libc boundaries
(kill/waitpid/Pid::from_raw).
work_units: u64Cumulative work iterations (incremented by spin_burst or I/O loops).
Read by the fairness/starvation gate (assert_not_starved /
min_work_units) and assert_throughput_parity; NOT summed into
CgroupStats::total_iterations, which reads iterations.
A Custom worker that wants throughput assertions must also populate
iterations.
cpu_time_ns: u64Thread CPU time from CLOCK_THREAD_CPUTIME_ID (ns).
wall_time_ns: u64Wall-clock time from worker-start to stop flag (ns).
Measured from the worker’s first Instant::now() in
worker_main (immediately after the start handshake) to the
outer-loop exit (when the per-worker stop flag is observed
true); covers both Fork-mode workers (signal-driven flag)
and Thread-mode workers (parent-driven flag).
off_cpu_ns: u64wall_time_ns - cpu_time_ns: total off-CPU time (ns).
Includes all time the worker was not executing on a CPU: runnable queue wait, voluntary sleep, I/O wait, futex wait, etc.
migration_count: u64Number of observed CPU migrations (checked every 1024 work units).
cpus_used: BTreeSet<usize>Set of all CPUs this worker ran on.
migrations: Vec<Migration>Ordered list of CPU migration events with timestamps.
max_gap_ms: u64Longest wall-clock gap observed at 1024-work-unit checkpoints (ms). High values indicate the task was preempted or descheduled near a checkpoint boundary.
max_gap_cpu: usizeCPU where the longest gap happened.
max_gap_at_ms: u64When the longest gap happened (ms from start).
wake_latencies_ns: Vec<u64>Per-wakeup latency samples (ns). Measures off-CPU time
between the call that blocks (any blocking primitive — pipe
read, futex wait, poll, sched_yield, nanosleep, etc.)
and the wakeup that resumes execution; not a yield-specific
measure.
Populated for blocking work types: Bursty, PipeIo, FutexPingPong,
FutexFanOut, FanOutCompute, CacheYield, CachePipe, IoSyncWrite,
IoRandRead, IoConvoy, NiceSweep,
AffinityChurn, PolicyChurn, MutexContention, ForkExit (parent’s
waitpid wait), Sequence with Sleep/Yield/Io phases.
Distinct from iteration_costs_ns:
this field measures the OFF-CPU gap between blocks (scheduler
wake latency); iteration_costs_ns measures the wall-clock
duration of a single compute iteration. The three pure-compute
variants that populate iteration_costs_ns —
WorkType::AluHot, WorkType::SmtSiblingSpin, and
WorkType::IpcVariance — never block and report
wake_latencies_ns: vec![]. Other compute variants
(e.g. SpinWait, YieldHeavy, Mixed) populate neither
reservoir.
wake_sample_total: u64Total number of wake-latency observations the worker
recorded, INCLUDING any that were dropped by the reservoir
sampler. wake_latencies_ns is reservoir-clamped to at
most MAX_WAKE_SAMPLES (100_000) entries; on a long run
that accumulates more than that many wake events, the
vector stays at its cap while this counter keeps climbing.
Host-side consumers that want to report “total wakeups
observed” (vs. “entries in the sample”) read this field;
percentile / CV computations read wake_latencies_ns.
iteration_costs_ns: Vec<u64>Per-iteration wall-clock duration of one compute iteration (ns),
including any scheduler preemption. Measured via
Instant::now() (CLOCK_MONOTONIC), so a sample includes any
off-CPU time the kernel inserted mid-iteration. The variance
across iterations is the load-bearing scheduler signal —
preemption inflates samples and that inflation is the
observable.
Reservoir-sampled at the same cap (MAX_WAKE_SAMPLES =
100_000) as wake_latencies_ns,
using the same Algorithm-R sampler.
Populated for pure compute work types where the worker
never blocks: WorkType::AluHot, WorkType::SmtSiblingSpin,
and WorkType::IpcVariance. Each sample is the elapsed
time from the start to the end of one outer-loop iteration’s
compute burst.
Distinct from wake_latencies_ns:
the wake-latency reservoir captures off-CPU time (futex /
pipe / nanosleep wakeups); this reservoir captures the
wall-clock duration of one compute iteration (which
includes any scheduler preemption inside the iteration).
The two are NOT comparable across variants — a
scheduler-A/B test that wants iteration cost for a compute
variant reads this field; a test that wants wake latency
for a blocking variant reads wake_latencies_ns.
iteration_cost_sample_total: u64Total number of iteration-cost observations the worker
recorded, INCLUDING any that were dropped by the reservoir
sampler. Mirrors wake_sample_total
but for iteration_costs_ns:
host-side consumers that want “total compute iterations
observed” read this field; distribution computations read
iteration_costs_ns directly.
timer_latencies_ns: Vec<u64>Per-timer-cycle latency samples (ns) for
crate::workload::WorkType::TimerLatency: the observed
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) wake time minus the
absolute deadline, floored at 0. Reservoir-clamped to at most
MAX_WAKE_SAMPLES, distinct from wake_latencies_ns so cyclictest-style
timer latency does not blur with the blocking variants’ wake latency.
vec![] for every non-TimerLatency variant.
timer_sample_total: u64Total timer-cycle observations, INCLUDING any the reservoir dropped —
the true population for unbiased cross-phase weighting. Mirrors
wake_sample_total for timer_latencies_ns.
iterations: u64Outer-loop iteration count. What CgroupStats::total_iterations sums
and what the derived throughput rates (iterations_per_worker /
iterations_per_cpu_sec) and migration_ratio divide by; NOT read by
the starvation gate, which reads work_units. A
Custom worker that wants the starvation / min_work_units gate
honored must also populate work_units.
schedstat_run_delay_ns: u64Delta of /proc/self/schedstat field 2 (run_delay) over the work loop.
schedstat_run_count: u64Delta of /proc/self/schedstat field 3 (pcount — number of
times the task was scheduled in over the work loop). This is
NOT a context-switch count; /proc/<pid>/status’s
voluntary_ctxt_switches / nonvoluntary_ctxt_switches are
the true context-switch counters and are not read here.
schedstat_cpu_time_ns: u64Delta of /proc/self/schedstat field 1 (cpu_time) over the work loop.
completed: booltrue when the worker reached its natural end — either the
outer work loop observed STOP and exited cleanly, or a
custom-closure payload returned from its run function. A
sentinel report synthesised by
WorkloadHandle::stop_and_collect’s decode-failure
fallback (see exit_info below) carries false. Lets downstream
consumers distinguish “worker ran to completion and
observed zero iterations” (completed: true, iterations: 0
— legitimate for pathologically short test windows) from
“worker died / timed out before recording anything”
(completed: false, iterations: 0 — the sentinel shape).
numa_pages: BTreeMap<usize, u64>Per-NUMA-node page counts from /proc/self/numa_maps after workload.
Keyed by node ID. Empty when numa_maps is unavailable. numa_maps reports
the per-node RESIDENT pages of the calling task’s mm. For
CloneMode::Fork workers (the scenario-engine default) each worker has
a disjoint mm, so SUMming across workers is the true cgroup page total;
for CloneMode::Thread siblings share one mm and each reports the SAME
residency, so a SUM counts shared pages once per thread. Consumers
(cgroup_stats / phase_cgroup_stats) SUM
this — correct for the Fork default; the Thread-mode caveat is inherited
identically by both reducers (no per-phase divergence).
vmstat_numa_pages_migrated: u64Delta of /proc/vmstat numa_pages_migrated over the work loop. This is
the SYSTEM-WIDE vmstat NUMA_PAGE_MIGRATE vm_event (summed across all
CPUs), NOT the per-task task_struct field of the same name surfaced in
/proc/PID/sched. Because every worker reads the same system-wide
counter, consumers fold it as MAX across workers (a SUM would inflate it
by the worker count) — see PhaseCgroupStats::cross_node_migrated. The vm_event is bumped only by
NUMA balancing (the source of NUMA page migrations), so the delta is 0 on
kernels/configs without it (a measurement-availability caveat, not a wrong
value).
exit_info: Option<WorkerExitInfo>Diagnostic attached only to sentinel reports — populated when
stop_and_collect synthesized the entry because no (or
unparseable) postcard payload came back on the report pipe.
None on every real worker-produced report. Lets operators
distinguish the four failure shapes that all collapse to
“empty pipe + no report”:
WorkerExitInfo::Exitedwith a non-zero code: worker reached_exit(code)without writing the report — typically thecatch_unwindErr arm in the worker-child closure (panic underpanic = "unwind") or the 30s poll-start timeout’s early_exit(1).WorkerExitInfo::Signaled: worker was killed — SIGABRT underpanic = "abort", SIGKILL from the still-alive escalation instop_and_collect, or an external signal (OOM killer, operator SIGKILL).WorkerExitInfo::TimedOut: worker never exited within the 5s collection deadline and the WNOHANG reap observedStillAlive— escalated via SIGKILL +waitpid(None).WorkerExitInfo::WaitFailed:waitpiditself returned an error (ECHILD / EINTR). Typically a plumbing bug — the child was reaped by an external signal handler, a double-reap regression, or the pid was recycled.
No skip_serializing_if: postcard is a positional, schemaless
format — every Serialize call must emit every field in the
same order or the decoder reads the next field’s bytes off
the wire (silent data corruption). The Option<…> tag itself
(one byte) is the only overhead on the live-worker path.
is_messenger: booltrue when this worker served as the messenger for a
wake-fanout work type (WorkType::FutexFanOut or
WorkType::FanOutCompute) — the single writer that
advances the shared generation and issues futex_wake for
its group. false for receivers and for every non-fanout
work type.
Populated from the is_messenger flag on the
futex: Option<(*mut u32, bool)> parameter threaded into
worker_main. A sentinel report synthesized by the
decode-failure fallback in
WorkloadHandle::stop_and_collect carries false via
Default, matching its completed: false shape.
Enables per-worker latency-participation assertions in
tests — a receiver worker produces wake_latencies_ns
entries while its messenger pair records wake-side work but
no wake latency. Without this field, tests had to
cross-reference per-group indexing or guess from the empty
vector — ambiguous on groups where the messenger legitimately
exits before producing a report.
group_idx: usizeIndex of the worker group this report belongs to.
0 denotes the primary group described by
WorkloadConfig’s top-level work_type / num_workers /
affinity / sched_policy fields. 1..=N denotes
composed groups in the order they appear in
WorkloadConfig::composed. Reports collected by
WorkloadHandle::stop_and_collect are tagged with the
group_idx of the spawning WorkSpec (or 0 for the
primary), so per-group filtering in test assertions can
cleanly partition the vector.
Sentinel reports (synthesized on missing or undecodable
payload / panic / timeout) carry the group_idx of the
worker whose pid the sentinel replaces, so a “this composed
group failed” assertion still works on an outright crash.
affinity_error: Option<String>Rendered error from the worker’s set_thread_affinity
call, or None when affinity setup succeeded (or the
worker had no affinity to apply). Populated by
worker_main when the pre-loop
set_thread_affinity(tid, cpus) returns Err — the
worker continues with the inherited (or kernel-default)
cpumask so the test still produces an observable outcome,
but the failure is now surfaced in the report instead of
being silently dropped via let _ = …. The expected
failure shape is EINVAL from a requested cpu that is
outside the cpuset cgroup’s cpus.allowed mask or the
kernel’s online mask; EPERM is reachable when a more
privileged tracer set the worker’s cpus_allowed and a
container policy denies further widening. Sentinel
reports synthesised by
WorkloadHandle::stop_and_collect leave this field at
its default None — a worker that died before
worker_main ran has no affinity-error observation.
No skip_serializing_if: postcard is positional and
schemaless, so every Serialize call must emit every field
in the same order — skipping a field shifts the decoder
onto the next field’s bytes (silent corruption). The
Option<…> tag (one byte) is the only overhead on the
success path.
sched_policy_error: Option<String>set_sched_policy failure text ({e:#}), or None when the
per-worker scheduling-policy set succeeded (or the policy was the
Normal no-op). Load-bearing for the verifier dispatch probe: a
probe worker configured SchedPolicy::Ext whose
sched_setattr(SCHED_EXT) was rejected (e.g. the scheduler set
scx.disallow on it) stays SCHED_OTHER, so its iterations
progress does NOT prove the BPF scheduler dispatched it —
run_and_confirm_dispatch excludes any worker with a
sched_policy_error from the dispatch proof. Sentinel reports
synthesised by WorkloadHandle::stop_and_collect leave this
None (no policy set was attempted). Same positional-serde
reasoning as Self::affinity_error: no skip_serializing_if.
phase_slices: Vec<PhaseSlice>Per-phase telemetry slices for a backdrop (persistent) worker
that spanned multiple scenario steps. EMPTY for step-local
workers and for any backdrop worker that observed no phase
boundary: the worker pushes a PhaseSlice only when the
parent-driven phase_epoch actually changes, so a worker whose
epoch never moved (step-local pools are never bumped) ships none
— keeping the wire empty on the common path. Each slice carries
the per-phase subset of the whole-run telemetry above, scoped to
one phase’s hold window; the host expands these into per-epoch
PhaseBucket entries — the per-phase attribution a backdrop
worker otherwise lacks, since it is collected once with
step_index = None.
Appended LAST so the positional postcard decode order of every
prior field is unchanged. #[claim(skip)]: there is no
test-author claim surface for the raw wire slices — assertions
run against the host-expanded PhaseCgroupStats carriers, which
carry their own #[derive(Claim)].
taobench_whole: Option<TaobenchStats>Whole-run taobench COUNTER aggregate — Some only for a Taobench worker,
None otherwise. Shipped so the host can derive run-level qps/hit Rate keys
(taobench_*_ops_per_sec / taobench_hit_fraction /
taobench_command_hit_rate) for --noise-adjust spread analysis. The
per-phase PhaseSlice::taobench carriers feed the per-phase metrics + the
serve-latency distribution; this whole-run carrier holds COUNTERS only
(TaobenchStats) — the
serve histogram is per-phase data, and the per-phase elapsed_ns is
MAX-merged across concurrent threads so summing phase windows is the wrong
qps denominator, hence the engine’s authoritative whole-run counter
aggregate is shipped directly. Appended LAST (after phase_slices) to keep
the positional postcard decode order of every prior field unchanged. pub
(every WorkerReport field is pub, so external custom workers can
struct-literal construct it via crate::workload::WorkType::custom; a
non-taobench worker sets None). #[claim(skip)]: framework wire plumbing,
not a test-author claim surface; assertions run against the host-derived
run-level taobench_* Rate metrics.
Trait Implementations§
Source§impl Clone for WorkerReport
impl Clone for WorkerReport
Source§fn clone(&self) -> WorkerReport
fn clone(&self) -> WorkerReport
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for WorkerReport
impl Debug for WorkerReport
Source§impl Default for WorkerReport
impl Default for WorkerReport
Source§fn default() -> WorkerReport
fn default() -> WorkerReport
Source§impl<'de> Deserialize<'de> for WorkerReport
impl<'de> Deserialize<'de> for WorkerReport
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Source§impl Serialize for WorkerReport
impl Serialize for WorkerReport
Source§impl WorkerReportClaim for WorkerReport
impl WorkerReportClaim for WorkerReport
fn claim_tid<'a>(&'a self, verdict: &'a mut Verdict) -> ClaimBuilder<'a, i32>
fn claim_work_units<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_cpu_time_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_wall_time_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_off_cpu_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_migration_count<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_cpus_used<'a>( &'a self, verdict: &'a mut Verdict, ) -> SetClaim<'a, usize>
fn claim_migrations<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, Migration>
fn claim_max_gap_ms<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_max_gap_cpu<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, usize>
fn claim_max_gap_at_ms<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_wake_latencies_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>
fn claim_wake_sample_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_iteration_costs_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>
fn claim_iteration_cost_sample_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_timer_latencies_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>
fn claim_timer_sample_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_iterations<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_schedstat_run_delay_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_schedstat_run_count<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_schedstat_cpu_time_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_completed<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, bool>
fn claim_vmstat_numa_pages_migrated<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_exit_info<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, Option<WorkerExitInfo>>
fn claim_is_messenger<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, bool>
fn claim_group_idx<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, usize>
fn claim_affinity_error<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, Option<String>>
fn claim_sched_policy_error<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, Option<String>>
Auto Trait Implementations§
impl Freeze for WorkerReport
impl RefUnwindSafe for WorkerReport
impl Send for WorkerReport
impl Sync for WorkerReport
impl Unpin for WorkerReport
impl UnwindSafe for WorkerReport
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more