pub struct PhaseCgroupStats {Show 18 fields
pub num_workers: usize,
pub cpus_used: BTreeSet<usize>,
pub wake_latencies_ns: Vec<u64>,
pub wake_sample_total: u64,
pub timer_latencies_ns: Vec<u64>,
pub timer_sample_total: u64,
pub run_delays_ns: Vec<u64>,
pub off_cpu_pcts: Vec<f64>,
pub total_migrations: u64,
pub total_iterations: u64,
pub total_cpu_time_ns: u64,
pub numa_pages_local: u64,
pub numa_pages_total: u64,
pub cross_node_migrated: u64,
pub max_gap_ms: u64,
pub max_gap_cpu: usize,
pub stripped: bool,
pub metrics: BTreeMap<String, f64>,
/* private fields */
}Expand description
Per-phase per-cgroup raw telemetry components — the per-phase analogue of
CgroupStats. Holds RAW components (sample vectors + counters), NOT the
reduced ratios/percentiles CgroupStats computes, so whole-run and
cross-run aggregates RE-POOL from the components at every level (the
per-phase telemetry thesis: an aggregate is recomputed over the pooled
components, never averaged from ready-made per-phase reductions — a
percentile or weighted ratio cannot be recovered from per-phase scalars).
Covers every TYPED CgroupStats reduction: avg/min/max off-CPU% and
spread from off_cpu_pcts; p99/median/CV wake latency from
wake_latencies_ns; mean/worst run-delay from run_delays_ns;
migration_ratio, iterations_per_cpu_sec, iterations_per_worker,
page_locality, cross_node_migration_ratio from their counter components;
the COUPLED worst gap (ms + the CPU that owned it) from max_gap_ms /
max_gap_cpu; cpus_used / num_cpus from cpus_used. EXCLUDES
CgroupStats::ext_metrics (the generic extensible map — a per-phase
per-cgroup custom metric is a future extension, not part of the typed
carrier). Lives in PhaseBucket::per_cgroup, keyed by cgroup name. The
structural carrier is empty until a capture path populates it per phase.
Fields§
§num_workers: usizeWorker count in this cgroup for the phase — the denominator for the
re-pooled per-worker iteration rate (iterations_per_worker =
total_iterations / this). This is a set CARDINALITY (reports.len()),
not a kernel counter, but it SUMs in merge because a single cgroup name
can emit MULTIPLE carriers in one step — collect_handles builds one per
WorkloadHandle, and a CgroupDef with several WorkSpec entries
(.work(..).work(..)) spawns one handle per WorkSpec under the same
name (apply_setup). Those carriers cover DISJOINT worker subsets, so the
cardinality of their union is the SUM (4 + 2 → 6), matching cgroup_stats
over the pooled reports (reports.len()); a MAX would understate the count
and inflate iterations_per_worker. (The disjointness is the real
justification — were carriers ever to overlap, the SUM would over-count.)
cpus_used: BTreeSet<usize>Distinct CPUs the cgroup’s workers ran on in the phase (union of each
worker’s cpus_used). Re-pools CgroupStats::cpus_used / num_cpus
(= the set / its length) via a set UNION.
wake_latencies_ns: Vec<u64>Pooled per-wakeup latency samples (ns) across the cgroup’s workers in
the phase, un-reduced so p99 / median / CV re-pool over the combined set.
The POOL is reservoir-capped at MAX_WAKE_SAMPLES (the per-worker bound,
re-applied when same-name carriers merge so the carrier payload stays
bounded on the size-limited guest bulk port — without it the pool would be
workers × MAX_WAKE_SAMPLES); wake_sample_total carries the true
pre-cap population. The CARRIER-level reductions divide by
wake_latencies_ns.len() (this capped pool size), NOT by
wake_sample_total: Self::wake_summary takes p99 / median over len,
and cgroup_stats computes cv = stddev/mean with
n = all_latencies.len(). The RUN-level cross-phase re-pool
(populate_run_distribution_metrics) instead population-WEIGHTS (see
the PARITY CONTRACT below): its CV / mean divide by Σ per-sample weights
(the reconstructed true population), which equals len only below the cap.
PARITY CONTRACT (the one component whose parity is size-dependent): for
pools ≤ MAX_WAKE_SAMPLES the reservoir IS the full concatenation, so the
p99 / median / CV re-pool reproduces cgroup_stats VALUE-FOR-VALUE.
Above the cap the carrier holds a distribution-preserving reservoir
SUBSAMPLE while cgroup_stats reduces over the full per-worker concat,
so the re-pool is DISTRIBUTION-EQUIVALENT, not byte-identical (the bounded
bulk-port frame forbids carrying the full pool; staged reservoirs cannot be
byte-identical to a single full-pool reduction). This is BY DESIGN:
cgroup_stats stays the uncapped run-level authority (capping it to match
the carrier would discard most of a multi-worker cgroup’s samples to chase
a sub-display-precision artifact), and the carrier’s >cap merge is WEIGHTED
by wake_sample_total (Self::weighted_merge_reservoirs) so the subsample
is an UNBIASED sample of the combined population — no smaller-population
skew. Both layers de-skew the cap: the carrier MERGE weights by
wake_sample_total (Self::weighted_merge_reservoirs), and the
cross-PHASE run-level pool in populate_run_distribution_metrics weights
each phase carrier’s samples by wake_sample_total / wake_latencies_ns.len()
(so a phase that exceeded the cap contributes by true population, not
capped length) and reduces with the weighted percentile / moments — the
prior length-weighted concat is gone. Below the cap every weight is 1.0,
so the weighted P99 / median / mean / worst are BYTE-identical to the
unweighted concat; the weighted CV matches only within ~1e-9 (it sums in
f64 where the unweighted path sums the mean in u64 — a weighted variance
cannot keep the u64 sum).
wake_sample_total: u64True wakeup count before reservoir clamping (wake_latencies_ns is
capped), so the re-pool can report the real population size. An
intentional ADDITION over CgroupStats (which has no such field), NOT
a mirrored reduction — do not strip it in a strict-parity audit; it is
the only source of the true wakeup population once wake_latencies_ns is
reservoir-clamped, and it is for REPORTING, not the CV denominator.
timer_latencies_ns: Vec<u64>Pooled per-timer-cycle latency samples (ns) across the cgroup’s
crate::workload::WorkType::TimerLatency workers in the phase,
un-reduced so median / p99 / p999 / worst re-pool over the combined set.
Reservoir-capped at MAX_WAKE_SAMPLES on same-name-carrier merge exactly
like wake_latencies_ns (population-weighted >cap); timer_sample_total
carries the true pre-cap population. Distinct carrier from
wake_latencies_ns.
timer_sample_total: u64True timer-cycle count before reservoir clamping (timer_latencies_ns is
capped), so the re-pool reports the real population. Mirrors
wake_sample_total for the timer carrier.
run_delays_ns: Vec<u64>Pooled per-worker schedstat run-delay samples (RAW ns) for the phase,
un-reduced so mean / worst run-delay re-pool over the combined set; the
re-pool converts ns → µs to match CgroupStats’s run-delay-µs fields.
Stored as raw kernel ns (like wake_latencies_ns), not pre-converted,
per the raw-component thesis. GRANULARITY: unlike wake_latencies_ns
(one per WAKEUP), each entry here is ONE per-worker value — that
worker’s sched_info.run_delay delta over the carrier’s window: the
whole-run schedstat_run_delay_ns (end−start) for the step-local
carrier, or the per-phase delta for the backdrop slice carrier. So the
pool size is the worker
count, the mean is the average per-worker total queued-to-run delay, and
worst_run_delay_us selects the single worker with the largest total
queued-to-run delay (NOT the worst single dispatch).
off_cpu_pcts: Vec<f64>Per-worker off-CPU% samples for the phase, un-reduced. Carried for the
per-phase per-cgroup off-CPU% RENDER — the avg / min / max /
spread of the combined set. NOT consumed by the run-level
distributional re-pool: off-CPU% has no run-level Distribution metric
(off-CPU%/spread is intrinsically per-cgroup, so the run-level
worst_spread stays the cross-cgroup max of per-cgroup
CgroupStats::spread via the typed AssertResult::merge fold, not a
pooled distribution). An EMPTY vec is the not-measured state (no worker
with positive wall time), preserving the not-measured vs measured-zero
distinction CgroupStats keeps. Stored as raw samples, not pre-reduced
extremes, because the mean is unrecoverable from min/max alone for >2
workers. Each sample is off_cpu_ns / wall_time_ns * 100, where off_cpu_ns = wall_time_ns - cpu_time_ns and
cpu_time_ns is the CLOCK_THREAD_CPUTIME_ID thread on-CPU time
(workload/worker off_cpu_ns at report build). total_cpu_time_ns is a
DISTINCT on-CPU measurement (schedstat_cpu_time_ns, the /proc
schedstat se.sum_exec_runtime): both ultimately track on-CPU runtime but
are sampled at different points (the CLOCK_THREAD_CPUTIME_ID read folds
the in-flight delta; the schedstat field reads the stored value), so the
two need not be byte-identical and must not be cross-wired in a re-pool.
total_migrations: u64Sum of per-worker CPU-migration counts in the phase (Counter).
total_iterations: u64Sum of per-worker iteration counts in the phase (Counter).
total_cpu_time_ns: u64Sum of per-worker on-CPU time (ns) in the phase — the
overcommit-invariant rate denominator (Counter). Sourced from
schedstat_cpu_time_ns (the /proc schedstat se.sum_exec_runtime,
rq-charged on-CPU ns) — a DISTINCT on-CPU-time sample from the
CLOCK_THREAD_CPUTIME_ID time behind off_cpu_pcts (different sample
point; not byte-identical), so do not cross-wire the two in a re-pool.
numa_pages_local: u64Pages on the expected NUMA node(s) — page-locality numerator. A per-task
/proc/self/numa_maps residency GAUGE (current snapshot of the task’s mm,
recomputed each read — the kernel zeroes and re-walks the page tables),
SPATIALLY summed across the cgroup’s workers within a phase: disjoint-mm
under the CloneMode::Fork default (the true cgroup total), but
CloneMode::Thread siblings share one mm and the SUM over-counts shared
pages once per thread (caveat inherited from WorkerReport::numa_pages).
The CROSS-PHASE fold takes the LATEST measured snapshot (see
numa_agg_per_cgroup), never a sum (summing residency across phases
over-counts by the phase count).
numa_pages_total: u64Total allocated pages — the SHARED denominator for BOTH page_locality
(numa_pages_local / this) AND cross_node_migration_ratio
(cross_node_migrated / this). A per-task /proc/self/numa_maps residency
GAUGE (same class/folds as numa_pages_local: within-phase SUM across
workers — disjoint-mm under the CloneMode::Fork default, Thread-mode
over-count caveat inherited from WorkerReport::numa_pages — cross-phase
LATEST snapshot); the kernel computes both ratios over the identical page
total, so one field serves both — a separate cross_node_total would invite
a silent desync.
cross_node_migrated: u64Cross-node migrated pages — cross_node_migration_ratio numerator
(denominator is numa_pages_total). A SYSTEM-WIDE
/proc/vmstat numa_pages_migrated monotonic-COUNTER delta each worker
observes redundantly, so the within-phase fold is MAX across
workers/sources (summing would inflate it by the worker count — mirrors
CgroupStats’s deliberate max-fold); the CROSS-PHASE fold SUMs the
per-phase deltas over disjoint intervals to the run total.
max_gap_ms: u64Longest scheduling gap (ms) across the cgroup’s workers in the phase,
coupled with max_gap_cpu. A Peak folded as an ARGMAX of the (ms, cpu)
pair so the worst gap and its CPU survive together — mirrors
CgroupStats’s max_gap_ms / max_gap_cpu coupling (a bare
independent max would desync the gap from its CPU).
max_gap_cpu: usizeCPU that owned the worst scheduling gap — max_gap_ms’s argmax
companion. Folded together with max_gap_ms, never independently.
stripped: boolTrue when this carrier’s raw sample vectors (wake_latencies_ns /
timer_latencies_ns / run_delays_ns / off_cpu_pcts, plus the
schbench histograms) were dropped by
AssertResult::strip_phase_cgroup_samples to fit the size-limited guest
bulk frame — distinct from a carrier that genuinely measured no samples.
The reduced counters survive; only the per-phase distribution render
loses its source, so the render shows “samples stripped” rather
than the not-measured “n/a”. Defaults to false (not stripped) and is set
only on a carrier that actually HAD samples to drop; ORs across merge so
a merged carrier is stripped if either input was.
metrics: BTreeMap<String, f64>Per-cgroup DERIVED scalar metrics for this (phase, cgroup), keyed by
crate::stats::MetricDef name — the per-cgroup analog of
PhaseBucket::metrics (which is the pooled-across-cgroups set). Populated
post-fold by derive_phase_metrics (both the schbench per-phase family AND
the non-schbench families — wake p99/median/cv, mean/max run-delay,
avg/min/max/spread off-CPU%, and the migration / iterations / locality
ratios) from the SAME reducers that fill
the pooled map, so a
test can query “metric M of cgroup C in phase P” as readily as the phase
aggregate (N cgroups -> N queryable sets + the pooled aggregate). DERIVED, not a
raw component: PhaseCgroupStats::merge leaves it empty and it is (re)derived
POST-merge, exactly as the pooled map skips is_derived keys in
merge_matched_phase_buckets. ALWAYS serialized (no skip_serializing_if):
PhaseCgroupStats rides the postcard bulk-TLV port, a NON-self-describing
POSITIONAL format — a conditionally-omitted field desyncs the byte stream and
corrupts the fields after it (here schbench), so the field must always be
present. No serde(default) either: pre-1.0, old sidecar/cache data is
disposable and regenerates (no compat shim). Read via Self::get; the
crate::Claim derive skips a BTreeMap field (matching
PhaseBucket::metrics, which has no Claim accessor either).
Implementations§
Source§impl PhaseCgroupStats
impl PhaseCgroupStats
Sourcepub fn get(&self, metric_name: &str) -> Option<f64>
pub fn get(&self, metric_name: &str) -> Option<f64>
Look up this cgroup’s per-phase DERIVED value for metric_name — the
per-cgroup analog of PhaseBucket::get (see Self::metrics). None
when this cgroup carried no finite samples for the metric (the ABSENT
discipline), distinct from Some(0.0) (a real reducer zero).
Sourcepub fn expect_metric(&self, metric_name: &str) -> f64
pub fn expect_metric(&self, metric_name: &str) -> f64
Like Self::get but panics citing the metric keys actually present when
the metric is absent — use when the caller knows this cgroup MUST carry the
metric in the phase.
Sourcepub fn cgroup_counter(&self, name: &str) -> Option<f64>
pub fn cgroup_counter(&self, name: &str) -> Option<f64>
The per-cgroup analog of PhaseBucket::cgroup_counter_total for the
three per-cgroup Counters that live ONLY on the carrier (no derived
metrics entry): total_migrations, total_iterations, and
total_cpu_time_ns. Lets
crate::vmm::VmResult::phase_cgroup_metric expose them by metric name
symmetrically with the pooled crate::vmm::VmResult::phase_metric
cgroup_counter_total fallback (the pooled _total suffix marks the
cross-cgroup SUM; this per-cgroup form returns one cgroup’s value). None
for any other name (derived metrics are read via Self::get).
Sourcepub fn off_cpu_summary(&self) -> Option<(f64, f64, f64, f64)>
pub fn off_cpu_summary(&self) -> Option<(f64, f64, f64, f64)>
Off-CPU% reduction for the per-phase per-cgroup render:
(avg, min, max, spread) over Self::off_cpu_pcts, or None when
the vec is empty — the NOT-measured state (no worker had positive wall
time). Reduces the SAME per-worker pcts cgroup_stats reduces
(off_cpu_ns / wall_time_ns × 100), so for a phase spanning the whole run
it reproduces that whole-run reduction; spread = max − min.
Some((0.0, ..)) is a MEASURED zero (distinct from the None
not-measured state), preserving the discipline the empty-vec contract on
off_cpu_pcts keeps. Display-only: never written back into a re-pool.
Sourcepub fn wake_summary(&self) -> Option<(f64, f64)>
pub fn wake_summary(&self) -> Option<(f64, f64)>
Wake-latency reduction for the per-phase render:
(p99_us, median_us) over the pooled Self::wake_latencies_ns, or
None when the pool is empty. Nearest-rank percentile via percentile
(ns→µs once), reproducing cgroup_stats’s p99/median value-for-value
for the ≤cap pool (and the run-level re-pool’s reduce_sorted_distribution).
Above MAX_WAKE_SAMPLES the pool is a distribution-preserving reservoir
subsample (see Self::wake_latencies_ns), so p99/median is then
distribution-equivalent, NOT byte-identical, to the full-pool reduction —
the rendered tail stays accurate, only exact parity is size-bounded.
None-on-empty omits the wake segment from the render rather than
painting a misleading 0µs (the display analogue of cgroup_stats’s
0.0-sentinel, which has no Option to carry not-measured).
Sourcepub fn timer_summary(&self) -> Option<(f64, f64, f64)>
pub fn timer_summary(&self) -> Option<(f64, f64, f64)>
Timer-latency reduction for the per-phase metric emission:
(median_us, p99_us, p999_us) over the pooled
Self::timer_latencies_ns (nearest-rank, ns→µs), or None when the
pool is empty (the not-measured state — write_carrier_scalars omits all
three timer keys together). Reproduces the per-cgroup
crate::assert::cgroup_stats timer reductions value-for-value below the
reservoir cap (distribution-equivalent above it), exactly like
Self::wake_summary. The run-level WORST (max) is NOT here — it comes
from the cross-phase re-pool (populate_run_distribution_metrics).
Sourcepub fn wake_cv(&self) -> Option<f64>
pub fn wake_cv(&self) -> Option<f64>
Wake-latency coefficient of variation (stddev / mean) over the pooled
Self::wake_latencies_ns (n = len), or None when the pool is
empty (the not-measured state — so the run_metrics carrier writer
omits all three wake keys together). Some(0.0) when
the mean is zero is a MEASURED zero. Reproduces cgroup_stats’s
wake_latency_cv value-for-value for the ≤cap pool (Σv in u64,
same accumulation); above MAX_WAKE_SAMPLES the pool is a
distribution-preserving reservoir subsample, so the CV is then
distribution-equivalent, not byte-identical.
Sourcepub fn run_delay_summary(&self) -> Option<(f64, f64)>
pub fn run_delay_summary(&self) -> Option<(f64, f64)>
Run-delay reduction for the per-phase render:
(mean_us, worst_us) over the per-worker Self::run_delays_ns (raw
ns), or None when empty. Divides ns→µs ONCE on the summed / maxed ns.
worst reproduces cgroup_stats’s value-for-value (max(ns)/1000 == max(ns/1000), division is monotone). mean reproduces it to f64 ULP,
not bit-exactly: this f64-sums then divides once (Σns/n/1000), while
cgroup_stats divides each worker’s ns by 1000 first then sums
(Σ(ns/1000)/n) — the same value reassociated, differing only
sub-display-precision (a divergent-input parity test bounds it at 1e-9).
Each sample is
one worker’s whole-phase cumulative sched_info.run_delay delta, so
mean is the average per-worker total queued-to-run delay and worst
the largest. None-on-empty omits the segment.
Trait Implementations§
Source§impl Clone for PhaseCgroupStats
impl Clone for PhaseCgroupStats
Source§fn clone(&self) -> PhaseCgroupStats
fn clone(&self) -> PhaseCgroupStats
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for PhaseCgroupStats
impl Debug for PhaseCgroupStats
Source§impl Default for PhaseCgroupStats
impl Default for PhaseCgroupStats
Source§fn default() -> PhaseCgroupStats
fn default() -> PhaseCgroupStats
Source§impl<'de> Deserialize<'de> for PhaseCgroupStats
impl<'de> Deserialize<'de> for PhaseCgroupStats
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Source§impl PartialEq for PhaseCgroupStats
impl PartialEq for PhaseCgroupStats
Source§impl PhaseCgroupStatsClaim for PhaseCgroupStats
impl PhaseCgroupStatsClaim for PhaseCgroupStats
fn claim_num_workers<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, usize>
fn claim_cpus_used<'a>( &'a self, verdict: &'a mut Verdict, ) -> SetClaim<'a, usize>
fn claim_wake_latencies_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>
fn claim_wake_sample_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_timer_latencies_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>
fn claim_timer_sample_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_run_delays_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>
fn claim_off_cpu_pcts<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, f64>
fn claim_total_migrations<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_total_iterations<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_total_cpu_time_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_numa_pages_local<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_numa_pages_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_cross_node_migrated<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_max_gap_ms<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>
fn claim_max_gap_cpu<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, usize>
fn claim_stripped<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, bool>
Source§impl Serialize for PhaseCgroupStats
impl Serialize for PhaseCgroupStats
impl StructuralPartialEq for PhaseCgroupStats
Auto Trait Implementations§
impl Freeze for PhaseCgroupStats
impl RefUnwindSafe for PhaseCgroupStats
impl Send for PhaseCgroupStats
impl Sync for PhaseCgroupStats
impl Unpin for PhaseCgroupStats
impl UnwindSafe for PhaseCgroupStats
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§impl<T> Instrument for T
impl<T> Instrument for T
§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more