Struct PhaseCgroupStats

Source

pub struct PhaseCgroupStats {Show 18 fields
    pub num_workers: usize,
    pub cpus_used: BTreeSet<usize>,
    pub wake_latencies_ns: Vec<u64>,
    pub wake_sample_total: u64,
    pub timer_latencies_ns: Vec<u64>,
    pub timer_sample_total: u64,
    pub run_delays_ns: Vec<u64>,
    pub off_cpu_pcts: Vec<f64>,
    pub total_migrations: u64,
    pub total_iterations: u64,
    pub total_cpu_time_ns: u64,
    pub numa_pages_local: u64,
    pub numa_pages_total: u64,
    pub cross_node_migrated: u64,
    pub max_gap_ms: u64,
    pub max_gap_cpu: usize,
    pub stripped: bool,
    pub metrics: BTreeMap<String, f64>,
    /* private fields */
}

Expand description

Per-phase per-cgroup raw telemetry components — the per-phase analogue of CgroupStats. Holds RAW components (sample vectors + counters), NOT the reduced ratios/percentiles CgroupStats computes, so whole-run and cross-run aggregates RE-POOL from the components at every level (the per-phase telemetry thesis: an aggregate is recomputed over the pooled components, never averaged from ready-made per-phase reductions — a percentile or weighted ratio cannot be recovered from per-phase scalars). Covers every TYPED CgroupStats reduction: avg/min/max off-CPU% and spread from off_cpu_pcts; p99/median/CV wake latency from wake_latencies_ns; mean/worst run-delay from run_delays_ns; migration_ratio, iterations_per_cpu_sec, iterations_per_worker, page_locality, cross_node_migration_ratio from their counter components; the COUPLED worst gap (ms + the CPU that owned it) from max_gap_ms / max_gap_cpu; cpus_used / num_cpus from cpus_used. EXCLUDES CgroupStats::ext_metrics (the generic extensible map — a per-phase per-cgroup custom metric is a future extension, not part of the typed carrier). Lives in PhaseBucket::per_cgroup, keyed by cgroup name. The structural carrier is empty until a capture path populates it per phase.

Fields§

§num_workers: usize

Worker count in this cgroup for the phase — the denominator for the re-pooled per-worker iteration rate (iterations_per_worker = total_iterations / this). This is a set CARDINALITY (reports.len()), not a kernel counter, but it SUMs in merge because a single cgroup name can emit MULTIPLE carriers in one step — collect_handles builds one per WorkloadHandle, and a CgroupDef with several WorkSpec entries (.work(..).work(..)) spawns one handle per WorkSpec under the same name (apply_setup). Those carriers cover DISJOINT worker subsets, so the cardinality of their union is the SUM (4 + 2 → 6), matching cgroup_stats over the pooled reports (reports.len()); a MAX would understate the count and inflate iterations_per_worker. (The disjointness is the real justification — were carriers ever to overlap, the SUM would over-count.)

§cpus_used: BTreeSet<usize>

Distinct CPUs the cgroup’s workers ran on in the phase (union of each worker’s cpus_used). Re-pools CgroupStats::cpus_used / num_cpus (= the set / its length) via a set UNION.

§wake_latencies_ns: Vec<u64>

Pooled per-wakeup latency samples (ns) across the cgroup’s workers in the phase, un-reduced so p99 / median / CV re-pool over the combined set. The POOL is reservoir-capped at MAX_WAKE_SAMPLES (the per-worker bound, re-applied when same-name carriers merge so the carrier payload stays bounded on the size-limited guest bulk port — without it the pool would be workers × MAX_WAKE_SAMPLES); wake_sample_total carries the true pre-cap population. The CARRIER-level reductions divide by wake_latencies_ns.len() (this capped pool size), NOT by wake_sample_total: Self::wake_summary takes p99 / median over len, and cgroup_stats computes cv = stddev/mean with n = all_latencies.len(). The RUN-level cross-phase re-pool (populate_run_distribution_metrics) instead population-WEIGHTS (see the PARITY CONTRACT below): its CV / mean divide by Σ per-sample weights (the reconstructed true population), which equals len only below the cap.

PARITY CONTRACT (the one component whose parity is size-dependent): for pools ≤ MAX_WAKE_SAMPLES the reservoir IS the full concatenation, so the p99 / median / CV re-pool reproduces cgroup_stats VALUE-FOR-VALUE. Above the cap the carrier holds a distribution-preserving reservoir SUBSAMPLE while cgroup_stats reduces over the full per-worker concat, so the re-pool is DISTRIBUTION-EQUIVALENT, not byte-identical (the bounded bulk-port frame forbids carrying the full pool; staged reservoirs cannot be byte-identical to a single full-pool reduction). This is BY DESIGN: cgroup_stats stays the uncapped run-level authority (capping it to match the carrier would discard most of a multi-worker cgroup’s samples to chase a sub-display-precision artifact), and the carrier’s >cap merge is WEIGHTED by wake_sample_total (Self::weighted_merge_reservoirs) so the subsample is an UNBIASED sample of the combined population — no smaller-population skew. Both layers de-skew the cap: the carrier MERGE weights by wake_sample_total (Self::weighted_merge_reservoirs), and the cross-PHASE run-level pool in populate_run_distribution_metrics weights each phase carrier’s samples by wake_sample_total / wake_latencies_ns.len() (so a phase that exceeded the cap contributes by true population, not capped length) and reduces with the weighted percentile / moments — the prior length-weighted concat is gone. Below the cap every weight is 1.0, so the weighted P99 / median / mean / worst are BYTE-identical to the unweighted concat; the weighted CV matches only within ~1e-9 (it sums in f64 where the unweighted path sums the mean in u64 — a weighted variance cannot keep the u64 sum).

§wake_sample_total: u64

True wakeup count before reservoir clamping (wake_latencies_ns is capped), so the re-pool can report the real population size. An intentional ADDITION over CgroupStats (which has no such field), NOT a mirrored reduction — do not strip it in a strict-parity audit; it is the only source of the true wakeup population once wake_latencies_ns is reservoir-clamped, and it is for REPORTING, not the CV denominator.

§timer_latencies_ns: Vec<u64>

Pooled per-timer-cycle latency samples (ns) across the cgroup’s crate::workload::WorkType::TimerLatency workers in the phase, un-reduced so median / p99 / p999 / worst re-pool over the combined set. Reservoir-capped at MAX_WAKE_SAMPLES on same-name-carrier merge exactly like wake_latencies_ns (population-weighted >cap); timer_sample_total carries the true pre-cap population. Distinct carrier from wake_latencies_ns.

§timer_sample_total: u64

True timer-cycle count before reservoir clamping (timer_latencies_ns is capped), so the re-pool reports the real population. Mirrors wake_sample_total for the timer carrier.

§run_delays_ns: Vec<u64>

Pooled per-worker schedstat run-delay samples (RAW ns) for the phase, un-reduced so mean / worst run-delay re-pool over the combined set; the re-pool converts ns → µs to match CgroupStats’s run-delay-µs fields. Stored as raw kernel ns (like wake_latencies_ns), not pre-converted, per the raw-component thesis. GRANULARITY: unlike wake_latencies_ns (one per WAKEUP), each entry here is ONE per-worker value — that worker’s sched_info.run_delay delta over the carrier’s window: the whole-run schedstat_run_delay_ns (end−start) for the step-local carrier, or the per-phase delta for the backdrop slice carrier. So the pool size is the worker count, the mean is the average per-worker total queued-to-run delay, and worst_run_delay_us selects the single worker with the largest total queued-to-run delay (NOT the worst single dispatch).

§off_cpu_pcts: Vec<f64>

Per-worker off-CPU% samples for the phase, un-reduced. Carried for the per-phase per-cgroup off-CPU% RENDER — the avg / min / max / spread of the combined set. NOT consumed by the run-level distributional re-pool: off-CPU% has no run-level Distribution metric (off-CPU%/spread is intrinsically per-cgroup, so the run-level worst_spread stays the cross-cgroup max of per-cgroup CgroupStats::spread via the typed AssertResult::merge fold, not a pooled distribution). An EMPTY vec is the not-measured state (no worker with positive wall time), preserving the not-measured vs measured-zero distinction CgroupStats keeps. Stored as raw samples, not pre-reduced extremes, because the mean is unrecoverable from min/max alone for >2 workers. Each sample is off_cpu_ns / wall_time_ns * 100, where off_cpu_ns = wall_time_ns - cpu_time_ns and cpu_time_ns is the CLOCK_THREAD_CPUTIME_ID thread on-CPU time (workload/worker off_cpu_ns at report build). total_cpu_time_ns is a DISTINCT on-CPU measurement (schedstat_cpu_time_ns, the /proc schedstat se.sum_exec_runtime): both ultimately track on-CPU runtime but are sampled at different points (the CLOCK_THREAD_CPUTIME_ID read folds the in-flight delta; the schedstat field reads the stored value), so the two need not be byte-identical and must not be cross-wired in a re-pool.

§total_migrations: u64

Sum of per-worker CPU-migration counts in the phase (Counter).

§total_iterations: u64

Sum of per-worker iteration counts in the phase (Counter).

§total_cpu_time_ns: u64

Sum of per-worker on-CPU time (ns) in the phase — the overcommit-invariant rate denominator (Counter). Sourced from schedstat_cpu_time_ns (the /proc schedstat se.sum_exec_runtime, rq-charged on-CPU ns) — a DISTINCT on-CPU-time sample from the CLOCK_THREAD_CPUTIME_ID time behind off_cpu_pcts (different sample point; not byte-identical), so do not cross-wire the two in a re-pool.

§numa_pages_local: u64

Pages on the expected NUMA node(s) — page-locality numerator. A per-task /proc/self/numa_maps residency GAUGE (current snapshot of the task’s mm, recomputed each read — the kernel zeroes and re-walks the page tables), SPATIALLY summed across the cgroup’s workers within a phase: disjoint-mm under the CloneMode::Fork default (the true cgroup total), but CloneMode::Thread siblings share one mm and the SUM over-counts shared pages once per thread (caveat inherited from WorkerReport::numa_pages). The CROSS-PHASE fold takes the LATEST measured snapshot (see numa_agg_per_cgroup), never a sum (summing residency across phases over-counts by the phase count).

§numa_pages_total: u64

Total allocated pages — the SHARED denominator for BOTH page_locality (numa_pages_local / this) AND cross_node_migration_ratio (cross_node_migrated / this). A per-task /proc/self/numa_maps residency GAUGE (same class/folds as numa_pages_local: within-phase SUM across workers — disjoint-mm under the CloneMode::Fork default, Thread-mode over-count caveat inherited from WorkerReport::numa_pages — cross-phase LATEST snapshot); the kernel computes both ratios over the identical page total, so one field serves both — a separate cross_node_total would invite a silent desync.

§cross_node_migrated: u64

Cross-node migrated pages — cross_node_migration_ratio numerator (denominator is numa_pages_total). A SYSTEM-WIDE /proc/vmstat numa_pages_migrated monotonic-COUNTER delta each worker observes redundantly, so the within-phase fold is MAX across workers/sources (summing would inflate it by the worker count — mirrors CgroupStats’s deliberate max-fold); the CROSS-PHASE fold SUMs the per-phase deltas over disjoint intervals to the run total.

§max_gap_ms: u64

Longest scheduling gap (ms) across the cgroup’s workers in the phase, coupled with max_gap_cpu. A Peak folded as an ARGMAX of the (ms, cpu) pair so the worst gap and its CPU survive together — mirrors CgroupStats’s max_gap_ms / max_gap_cpu coupling (a bare independent max would desync the gap from its CPU).

§max_gap_cpu: usize

CPU that owned the worst scheduling gap — max_gap_ms’s argmax companion. Folded together with max_gap_ms, never independently.

§stripped: bool

True when this carrier’s raw sample vectors (wake_latencies_ns / timer_latencies_ns / run_delays_ns / off_cpu_pcts, plus the schbench histograms) were dropped by AssertResult::strip_phase_cgroup_samples to fit the size-limited guest bulk frame — distinct from a carrier that genuinely measured no samples. The reduced counters survive; only the per-phase distribution render loses its source, so the render shows “samples stripped” rather than the not-measured “n/a”. Defaults to false (not stripped) and is set only on a carrier that actually HAD samples to drop; ORs across merge so a merged carrier is stripped if either input was.

§metrics: BTreeMap<String, f64>

Per-cgroup DERIVED scalar metrics for this (phase, cgroup), keyed by crate::stats::MetricDef name — the per-cgroup analog of PhaseBucket::metrics (which is the pooled-across-cgroups set). Populated post-fold by derive_phase_metrics (both the schbench per-phase family AND the non-schbench families — wake p99/median/cv, mean/max run-delay, avg/min/max/spread off-CPU%, and the migration / iterations / locality ratios) from the SAME reducers that fill the pooled map, so a test can query “metric M of cgroup C in phase P” as readily as the phase aggregate (N cgroups -> N queryable sets + the pooled aggregate). DERIVED, not a raw component: PhaseCgroupStats::merge leaves it empty and it is (re)derived POST-merge, exactly as the pooled map skips is_derived keys in merge_matched_phase_buckets. ALWAYS serialized (no skip_serializing_if): PhaseCgroupStats rides the postcard bulk-TLV port, a NON-self-describing POSITIONAL format — a conditionally-omitted field desyncs the byte stream and corrupts the fields after it (here schbench), so the field must always be present. No serde(default) either: pre-1.0, old sidecar/cache data is disposable and regenerates (no compat shim). Read via Self::get; the crate::Claim derive skips a BTreeMap field (matching PhaseBucket::metrics, which has no Claim accessor either).

Struct PhaseCgroupStats Copy item path

Fields§

Implementations§

impl PhaseCgroupStats

pub fn get(&self, metric_name: &str) -> Option<f64>

pub fn expect_metric(&self, metric_name: &str) -> f64

pub fn cgroup_counter(&self, name: &str) -> Option<f64>

pub fn off_cpu_summary(&self) -> Option<(f64, f64, f64, f64)>

pub fn wake_summary(&self) -> Option<(f64, f64)>

pub fn timer_summary(&self) -> Option<(f64, f64, f64)>

pub fn wake_cv(&self) -> Option<f64>

pub fn run_delay_summary(&self) -> Option<(f64, f64)>

Trait Implementations§

impl Clone for PhaseCgroupStats

fn clone(&self) -> PhaseCgroupStats

fn clone_from(&mut self, source: &Self)

impl Debug for PhaseCgroupStats

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Default for PhaseCgroupStats

fn default() -> PhaseCgroupStats

impl<'de> Deserialize<'de> for PhaseCgroupStats

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where __D: Deserializer<'de>,

impl PartialEq for PhaseCgroupStats

fn eq(&self, other: &PhaseCgroupStats) -> bool

fn ne(&self, other: &Rhs) -> bool

impl PhaseCgroupStatsClaim for PhaseCgroupStats

fn claim_num_workers<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, usize>

fn claim_cpus_used<'a>( &'a self, verdict: &'a mut Verdict, ) -> SetClaim<'a, usize>

fn claim_wake_latencies_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>

fn claim_wake_sample_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_timer_latencies_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>

fn claim_timer_sample_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_run_delays_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, u64>

fn claim_off_cpu_pcts<'a>( &'a self, verdict: &'a mut Verdict, ) -> SeqClaim<'a, f64>

fn claim_total_migrations<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_total_iterations<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_total_cpu_time_ns<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_numa_pages_local<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_numa_pages_total<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_cross_node_migrated<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_max_gap_ms<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, u64>

fn claim_max_gap_cpu<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, usize>

fn claim_stripped<'a>( &'a self, verdict: &'a mut Verdict, ) -> ClaimBuilder<'a, bool>

impl Serialize for PhaseCgroupStats

fn serialize<__S>(&self, __serializer: __S) -> Result<__S::Ok, __S::Error>where __S: Serializer,

impl StructuralPartialEq for PhaseCgroupStats

Auto Trait Implementations§

impl Freeze for PhaseCgroupStats

impl RefUnwindSafe for PhaseCgroupStats

impl Send for PhaseCgroupStats

impl Sync for PhaseCgroupStats

impl Unpin for PhaseCgroupStats

impl UnwindSafe for PhaseCgroupStats

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> PolicyExt for Twhere T: ?Sized,

Struct PhaseCgroupStats

fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where __D: Deserializer<'de>,

fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where S: Serializer,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,

impl<T> MaybeSend for T
where T: Send,

impl<T> MaybeSend for T
where T: Send,