ktstr/ctprof/mod.rs
1//! Per-thread ctprof (cgroup/thread profiler) data model + capture layer.
2//!
3//! [`CtprofSnapshot`] is the serialized container for a single
4//! host-wide per-thread profile. Capture produces one via the
5//! `ktstr ctprof capture -o snapshot.ctprof.zst` subcommand;
6//! comparison reads two and joins them on the selected grouping
7//! axis (pcomm, cgroup, or comm).
8//!
9//! Field families and probe-timing invariance:
10//!
11//! - **Cumulative counters and totals** (the majority): wakeups,
12//! migrations, csw, run/wait/sleep/block/iowait time, schedstat
13//! counts, page-fault counters, syscall counters, byte counters,
14//! the taskstats per-bucket `*_count` and `*_delay_total_ns`,
15//! the jemalloc per-thread allocated/deallocated TSD counters,
16//! etc. Sampled twice at different instants the value increases
17//! monotonically; probe-attach latency does not alter the
18//! reading.
19//! - **Lifetime high-water peaks**: schedstat `*_max` family
20//! (`wait_max`, `sleep_max`, `block_max`, `exec_max`,
21//! `slice_max`), every taskstats `*_delay_max_ns` /
22//! `*_delay_min_ns`, and the memory watermarks
23//! (`hiwater_rss_bytes`, `hiwater_vm_bytes`). These are
24//! non-decreasing-over-time but per-event extrema rather than
25//! sums, so they are non-summable across threads (the registry
26//! reduces them via `MaxPeak` / `MaxPeakBytes`). Same
27//! probe-timing invariance as the cumulative counters.
28//! - **Instantaneous gauges** (sensitive to probe timing):
29//! [`ThreadState::nr_threads`] (signal_struct->nr_threads
30//! snapshot), [`ThreadState::fair_slice_ns`] (instantaneous
31//! `p->se.slice`), and [`ThreadState::state`]
32//! (task_state_array letter). Sampled at capture time and can
33//! genuinely differ between two probes of the same thread.
34//! The registry pairs them with `MaxGaugeCount` /
35//! `MaxGaugeNs` / `ModeChar` reductions rather than the
36//! `Sum*` rules used for cumulative counters.
37//! - **Categorical / ordinal scalars** (point-in-time
38//! snapshots): `policy`, `nice`, `priority`, `processor`,
39//! `rt_priority`, plus the identity strings (`pcomm`, `comm`,
40//! `cgroup`) and the [`crate::metric_types::CpuSet`]
41//! `cpu_affinity`. These are sampled at capture time and can
42//! change at runtime (e.g. `sched_setaffinity` mid-run flips
43//! `processor` and `cpu_affinity`), so they share the
44//! gauge family's probe-timing sensitivity. The registry
45//! reduces them via `Mode*` / `Range*` / `Affinity` rather
46//! than `Sum*`.
47//!
48//! The jemalloc per-thread TSD counters
49//! (`tsd_s.thread_allocated` / `thread_deallocated`) jemalloc
50//! maintains unconditionally on its alloc/dalloc fast and slow
51//! paths, so the ptrace-based attach this layer performs does
52//! not perturb them; counters previously accumulated remain
53//! valid across the brief stop the attach induces. Metrics not
54//! derivable from cumulative state (e.g. perf_event_open
55//! counters that reset on attachment) are intentionally absent
56//! from this capture layer.
57//!
58//! # Capture model
59//!
60//! [`capture`] walks `/proc` for every live tgid, enumerates its
61//! threads, and populates each [`ThreadState`] from a handful of
62//! procfs sources: `stat`, `schedstat`, `status`, `io`, `sched`,
63//! `comm`, `cgroup`. The procfs walk runs sequentially per tid in
64//! `capture_with` phase 2. Phase 1 attaches the jemalloc TSD
65//! probe in parallel across tgids when `use_syscall_affinity` is
66//! `true` (the production path); under `use_syscall_affinity =
67//! false` (the synthetic-tree test path), phase 1 is skipped
68//! entirely — the per-tgid probe map starts and stays empty, and
69//! phase 2's per-tid lookup falls through to the absent-counter
70//! default of zero. See "Probe wiring" below for the per-tgid
71//! mechanics.
72//!
73//! ## Probe wiring (most-expensive step)
74//!
75//! For every tgid the walk reaches, the capture pipeline calls
76//! the `pub(crate)` `host_thread_probe::attach_jemalloc_at` (or
77//! its default-root `attach_jemalloc` wrapper) to resolve the
78//! target's jemalloc TLS symbol + per-`tsd_s` field offsets via
79//! an ELF parse and DWARF walk; per-thread counter reads then
80//! dispatch through `host_thread_probe::probe_thread` for one
81//! ptrace cycle: seize → interrupt → waitpid → getregset →
82//! `process_vm_readv` → detach (the detach happens automatically
83//! via the `ScopeDetach` Drop guard, so any fallible step still
84//! leaves the target unstuck). The remote read pulls a
85//! contiguous 24-byte counter span — the canonical jemalloc
86//! `TSD_DATA_FAST` layout (allocated, fast-event slot,
87//! deallocated) — but the byte count is computed dynamically by
88//! `combined_read_span` from the DWARF-resolved field offsets, so
89//! a future jemalloc layout change is absorbed. This is the
90//! dominant wall-clock cost of a snapshot:
91//! O(unique-exe-inode tgids) ELF parses + O(jemalloc-linked
92//! tgids) DWARF walks + O(threads of jemalloc-linked tgids)
93//! ptrace cycles. The first term covers non-jemalloc tgids: each
94//! distinct `/proc/<pid>/exe` inode still costs one ELF parse to
95//! discover absence (the inode-keyed cache below collapses
96//! repeats). `attach_jemalloc_at` is the sole detection gate —
97//! tgids that attach successfully populate `allocated_bytes` /
98//! `deallocated_bytes`; tgids that fail attach (not jemalloc-
99//! linked, stripped binary, ptrace denied, arch mismatch — see
100//! `host_thread_probe::AttachError`) land their threads at the
101//! absent-counter default of zero.
102//!
103//! Phase 1 parallelism is gated by host CPU headroom (read from
104//! `<proc_root>/loadavg`, clamped to `[1, num_cpus/2 + 1]`) so the
105//! capture cannot drown a hot host with concurrent ELF reads.
106//! Per-tgid attach results are inode-keyed cached so a fork-bombed
107//! tgid family resolves DWARF once. The per-tgid wrapper
108//! `try_attach_probe_for_tgid_at` records every outcome in a single
109//! `ProbeSummary` tally; `emit_probe_summary` surfaces a single
110//! info-level line per snapshot summarising tgids walked, jemalloc
111//! detected, probed OK, failed, plus the dominant actionable
112//! failure tag and an EPERM remediation hint when ptrace-attach
113//! failures dominate.
114//!
115//! Each internal procfs reader returns `Option` (graceful on
116//! missing/unreadable — a kernel without `CONFIG_SCHED_INFO` (the
117//! `schedstat` file) or `CONFIG_TASK_IO_ACCOUNTING` (the `io` file)
118//! makes that file absent, so its reader yields `None` without
119//! failing the rest of the thread). The assembled
120//! [`ThreadState`] treats `None` as "absent at capture" via the
121//! field type — counters collapse to `0`, identity strings
122//! collapse to empty, affinity collapses to an empty vec. A
123//! missing reading is therefore indistinguishable from a genuine
124//! zero in the serialized output; the capture contract is
125//! best-effort, never-fail-the-snapshot. Tests that need stronger
126//! guarantees inspect the underlying readers directly (they remain
127//! `Option`-shaped, unit-tested in this module).
128//!
129//! # Privilege
130//!
131//! Pulling the jemalloc per-thread TSD counters requires
132//! `ptrace(PTRACE_SEIZE)` against the target. Under
133//! `kernel.yama.ptrace_scope=0` any same-uid process attaches.
134//! Under `=1` (Debian/Ubuntu host default) the tracer must be an
135//! ancestor of the target or carry `CAP_SYS_PTRACE`; `=2` and `=3`
136//! raise the bar further. When attach fails, the per-thread
137//! `allocated_bytes` / `deallocated_bytes` collapse to 0 per the
138//! best-effort contract — the rest of the snapshot still
139//! populates from procfs.
140
141use std::collections::BTreeMap;
142use std::fs;
143use std::path::{Path, PathBuf};
144
145use crate::sync::MutexExt;
146use anyhow::Result;
147
148/// Top-level serialized artifact produced by `ktstr ctprof`.
149///
150/// The file layout on disk is zstd-compressed JSON of this struct.
151/// Extension `.ctprof.zst` is conventional; nothing in the loader
152/// depends on the extension beyond being passed a path that
153/// resolves to a readable file.
154#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
155#[non_exhaustive]
156pub struct CtprofSnapshot {
157 /// Wall-clock time at capture, nanoseconds since the Unix
158 /// epoch. Useful as a tie-breaker when comparing two snapshots
159 /// that originate from the same host — the newer one is
160 /// candidate by default — but carries no load-bearing role in
161 /// any grouping axis.
162 pub captured_at_unix_ns: u64,
163
164 /// Host context snapshot (kernel, CPU, memory, tunables).
165 /// Optional because older tools or synthetic fixtures may
166 /// omit it; comparison degrades to a "host context unavailable"
167 /// line rather than failing the whole compare when either
168 /// side is missing.
169 pub host: Option<crate::host_context::HostContext>,
170
171 /// One entry per observed thread on the host at capture time.
172 /// Order is not load-bearing; the comparison pipeline groups
173 /// by `pcomm` / `cgroup` / `comm` depending on `--group-by`.
174 pub threads: Vec<ThreadState>,
175
176 /// Enrichment metadata for every cgroup that at least one
177 /// sampled thread resides in. Keyed by the cgroup path
178 /// relative to the v2 mount (e.g.
179 /// `/kubepods/burstable/pod-<id>/container`). Populated from
180 /// the cgroup filesystem, not the per-thread sample, because
181 /// cpu.stat / memory.current describe the cgroup's aggregate
182 /// state, not per-thread contribution.
183 pub cgroup_stats: BTreeMap<String, CgroupStats>,
184
185 /// Probe outcome statistics for the snapshot, when the probe
186 /// pass ran. `None` indicates the snapshot was assembled
187 /// without the per-tgid jemalloc probe walk (synthetic-tree
188 /// tests pass `use_syscall_affinity=false` to skip it).
189 /// `Some(_)` carries the per-snapshot tally — see
190 /// [`CtprofProbeSummary`] for the curated field set.
191 pub probe_summary: Option<CtprofProbeSummary>,
192
193 /// Procfs-read failure statistics for the snapshot, when the
194 /// capture pass ran in production mode. Mirrors the
195 /// `probe_summary` discipline: `None` indicates synthetic-tree
196 /// tests skipped it (`use_syscall_affinity=false`); `Some(_)`
197 /// carries the per-snapshot read-level failure tally — see
198 /// [`CtprofParseSummary`].
199 pub parse_summary: Option<CtprofParseSummary>,
200
201 /// Per-snapshot taskstats genetlink query outcome tally,
202 /// populated when the capture pass ran in production mode.
203 /// `None` mirrors `probe_summary` / `parse_summary`:
204 /// synthetic-tree tests pass `use_syscall_affinity=false`
205 /// which skips the netlink path entirely. `Some(_)` carries
206 /// the per-snapshot ok/eperm/esrch/other counts so an operator
207 /// can distinguish "no taskstats data because every tid raced
208 /// exit" (high `esrch_count`) from "no taskstats data because
209 /// the kernel was built without `CONFIG_TASKSTATS`" (the
210 /// netlink open failed up-front so every counter is zero)
211 /// from "no taskstats data because `CAP_NET_ADMIN` is missing"
212 /// (high `eperm_count`). See `crate::taskstats::TaskstatsSummary`
213 /// for the per-counter semantics and remediation guidance.
214 pub taskstats_summary: Option<crate::taskstats::TaskstatsSummary>,
215
216 /// Host-level Pressure Stall Information, populated from
217 /// `<proc_root>/pressure/{cpu,memory,io,irq}`. Captures
218 /// system-wide stall pressure across the four kernel-exposed
219 /// resources. Defaults to all-zero when the kernel has
220 /// CONFIG_PSI off or when individual resource files are
221 /// absent. See [`Psi`] for the per-resource shape and the
222 /// system-level cpu.full / irq.some caveats.
223 pub psi: Psi,
224
225 /// Global sched_ext sysfs state from `/sys/kernel/sched_ext/`.
226 /// `None` when CONFIG_SCHED_CLASS_EXT is not built (no
227 /// `sched_ext` sysfs directory exists), or when the
228 /// directory itself is unreadable. See [`SchedExtSysfs`]
229 /// for the per-field shape and kernel cites. Populated
230 /// during the same capture pass as PSI.
231 pub sched_ext: Option<SchedExtSysfs>,
232}
233
234/// Per-snapshot probe outcome statistics. Curated projection of
235/// the capture pipeline's internal probe tally — exposes the
236/// counters, the dominant failure tag, and a `privilege_dominant`
237/// boolean a downstream consumer needs to decide whether the
238/// snapshot's `allocated_bytes` / `deallocated_bytes` fields are
239/// trustworthy on a given host without parsing the operator-
240/// facing tracing line.
241///
242/// The internal probe taxonomy (the per-variant
243/// `host_thread_probe::AttachError` and `ProbeError` enums) is
244/// deliberately NOT mirrored here — it is implementation
245/// detail that may change shape without breaking this contract.
246/// `dominant_failure` carries the operator-facing tag string
247/// (e.g. `"ptrace-seize"`, `"dwarf-parse-failure"`) that the
248/// capture pipeline already surfaces in its tracing summary; the
249/// stable token format is documented in the `ktstr ctprof
250/// capture` CLI help. `privilege_dominant` mirrors the same gate
251/// that prints the EPERM remediation hint — true when ≥ 50% of
252/// `failed` is `ptrace-seize` or `ptrace-interrupt`.
253///
254/// The four counters are zero when the probe pass reached zero
255/// tgids (e.g. an empty `proc_root`); `dominant_failure` is
256/// `None` when no actionable failures landed; `privilege_dominant`
257/// is `false` when there are no failures or when ptrace failures
258/// are strictly less than half of `failed` (the `>= 50%` gate
259/// accepts equality at the boundary).
260///
261/// # Examples
262///
263/// ```no_run
264/// let snap = ktstr::ctprof::capture();
265/// if let Some(ps) = &snap.probe_summary {
266/// if let Some(hint) = ps.remediation_hint() {
267/// eprintln!("{hint}");
268/// }
269/// if let Some(tag) = &ps.dominant_failure {
270/// eprintln!("dominant failure: {tag}");
271/// }
272/// }
273/// ```
274#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
275#[non_exhaustive]
276pub struct CtprofProbeSummary {
277 /// Total tgids the probe pass walked. Equals the number of
278 /// `/proc/<pid>` directories the capture saw, minus the
279 /// calling process's own tgid (which is skipped because
280 /// `PTRACE_SEIZE` rejects self-attach).
281 pub tgids_walked: u64,
282 /// Tgids whose `attach_jemalloc_at` call succeeded — i.e.
283 /// the target was identified as jemalloc-linked, the TSD
284 /// symbol resolved, and the per-`tsd_s` field offsets came
285 /// out of the DWARF walk. A subset of `tgids_walked`.
286 pub jemalloc_detected: u64,
287 /// Per-thread probe reads that returned a counter pair.
288 /// Bounded above by the sum of thread counts across all
289 /// `jemalloc_detected` tgids; per-thread failures (target
290 /// thread exited mid-attach, EPERM, etc.) reduce this count
291 /// below the upper bound.
292 pub probed_ok: u64,
293 /// Attach-or-probe failures whose tag is classified
294 /// ACTIONABLE — see the `ktstr ctprof capture` CLI help
295 /// for the full filter rule and tag taxonomy. Routine
296 /// non-actionable outcomes (target not jemalloc-linked,
297 /// `readlink` race-with-exit) do NOT contribute to this
298 /// count.
299 pub failed: u64,
300 /// Tag string for the most-frequent actionable failure across
301 /// all attach-and-probe failures. `None` when `failed == 0`.
302 /// Stable single-word identifiers — the wire contract that
303 /// downstream consumers match against. The full taxonomy is
304 /// documented in the `ktstr ctprof capture` CLI help.
305 /// Examples: `"ptrace-seize"`, `"dwarf-parse-failure"`,
306 /// `"jemalloc-in-dso"`.
307 pub dominant_failure: Option<String>,
308 /// `true` when the ptrace failure share crosses the
309 /// hint-trigger threshold (≥ 50% of `failed` is `ptrace-seize`
310 /// or `ptrace-interrupt`). Mirrors the same gate that prints
311 /// the EPERM remediation hint in the operator-facing tracing
312 /// summary, so a downstream consumer can reproduce that
313 /// signal without parsing the log line. When `true`,
314 /// rerunning the capture binary with `CAP_SYS_PTRACE`
315 /// (e.g. `sudo setcap cap_sys_ptrace+eip $(which ktstr)`,
316 /// or run as root, or `sysctl kernel.yama.ptrace_scope=0`)
317 /// resolves most attach failures so jemalloc TSD attach
318 /// succeeds across foreign tgids. `false` when
319 /// `failed == 0` (no failures to dominate) or when ptrace
320 /// failures are strictly less than half of `failed` (the
321 /// `>= 50%` gate accepts equality at the boundary).
322 ///
323 /// Independent of [`Self::dominant_failure`]: ptrace failures
324 /// are tallied across both `ptrace-seize` and
325 /// `ptrace-interrupt` for the threshold, while
326 /// `dominant_failure` reports a single per-tag plurality.
327 /// When ptrace counts split across the two tags,
328 /// `privilege_dominant` may be `true` while
329 /// `dominant_failure` names a non-ptrace tag that won the
330 /// single-tag plurality. Conversely, `dominant_failure` may
331 /// name a ptrace tag while `privilege_dominant` is `false`
332 /// when ptrace failures are below the 50% threshold.
333 pub privilege_dominant: bool,
334}
335
336impl CtprofProbeSummary {
337 /// Operator-facing remediation hint when ptrace failures
338 /// dominate the snapshot. Returns `Some(&'static str)` —
339 /// the same `PTRACE_EPERM_HINT` constant the capture
340 /// pipeline embeds in its tracing summary line (a one-liner
341 /// naming `cap_sys_ptrace` — the `setcap`-form spelling of
342 /// the capability — and `kernel.yama.ptrace_scope`), or
343 /// `None` when [`Self::privilege_dominant`] is false. Lets a
344 /// downstream consumer surface the same fix-it message
345 /// without parsing the log line or hand-rolling the gate.
346 pub fn remediation_hint(&self) -> Option<&'static str> {
347 if self.privilege_dominant {
348 Some(PTRACE_EPERM_HINT)
349 } else {
350 None
351 }
352 }
353}
354
355/// Per-snapshot procfs read-failure statistics. Curated projection
356/// of the capture pipeline's internal read-tally — exposes per-file
357/// counters and a dominant-failure tag a downstream consumer needs
358/// to decide whether the snapshot's procfs-derived fields (CSW,
359/// schedstats, IO, etc.) are trustworthy on a given host without
360/// scanning every thread for default values.
361///
362/// The read-failure tally ([`Self::read_failures`] /
363/// [`Self::read_failures_by_file`]) is read-level only — it
364/// counts failures of `fs::read_to_string` against
365/// `/proc/<tgid>/task/<tid>/<file>`, not per-field parse failures
366/// inside an otherwise-readable file.
367/// A present-but-malformed file (e.g. a corrupt `stat` whose
368/// `parse_stat` returns all-`None`) does NOT count: the file read
369/// succeeded so the tally stays at zero for that category, even
370/// though the per-field parsers fold every value to its absent-
371/// counter default. Read failures correspond to the kernel never
372/// having written the file (ENOENT / kernel without
373/// `CONFIG_SCHED_INFO`), the file disappearing mid-capture (race),
374/// or any other I/O-level error from the procfs reader. A snapshot
375/// with 1 K schedstat failures across 1 K tids implies a kernel
376/// build without `CONFIG_SCHED_INFO`; 47 stat failures across 1 K
377/// tids implies mid-capture races.
378///
379/// One parse-level signal IS surfaced separately:
380/// [`Self::negative_dotted_values`] counts the per-line cases in
381/// `/proc/<tid>/sched` where the kernel's PN_SCHEDSTAT format
382/// emitted a leading `-` — a rare but observable clock-skew /
383/// suspend-resume artifact that the parser otherwise folds
384/// silently to zero. Other forms of per-field corruption (
385/// non-numeric fractional, malformed key, …) stay outside this
386/// summary's scope and surface as zero values on the affected
387/// `ThreadState` fields.
388///
389/// Per-file tokens in [`Self::read_failures_by_file`] are stable
390/// kebab-case identifiers downstream consumers match against. The
391/// recognized set: `"stat"`, `"schedstat"`, `"io"`, `"status"`,
392/// `"sched"`, `"cgroup"`, `"smaps_rollup"`. Adding a new procfs
393/// file to the capture adds a new key; the wire shape carries
394/// any token the capture emitted, so a consumer that only knows
395/// the existing set absorbs new keys without breaking.
396///
397/// Ghost-filtered tids do NOT contribute to `read_failures` /
398/// `read_failures_by_file` — their pending failure bumps are
399/// unwound via `discard_pending` when a thread ends up filtered
400/// out of `threads` (empty comm + zero start_time), so a busy
401/// host with mid-capture exits doesn't inflate the failure tallies
402/// with counts that would correspond to threads the snapshot
403/// doesn't even contain. `tids_walked` still counts every walk
404/// attempt regardless of the ghost filter outcome.
405///
406/// # Examples
407///
408/// ```no_run
409/// let snap = ktstr::ctprof::capture();
410/// if let Some(ps) = &snap.parse_summary
411/// && let Some(hint) = ps.kernel_config_hint()
412/// {
413/// eprintln!("{hint}");
414/// }
415/// ```
416#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
417#[non_exhaustive]
418pub struct CtprofParseSummary {
419 /// Total tids the capture pass attempted to read across every
420 /// tgid. Non-zero whenever the capture walked any tid; the
421 /// denominator a downstream consumer uses to compute "what
422 /// fraction of reads failed" without parsing the operator-
423 /// facing tracing line.
424 pub tids_walked: u64,
425 /// Total file-level read failures across all categories. Sum
426 /// of [`Self::read_failures_by_file`] values.
427 pub read_failures: u64,
428 /// Per-file-kind failure tally, keyed by stable kebab tokens
429 /// (`"stat"`, `"schedstat"`, `"io"`, `"status"`, `"sched"`,
430 /// `"cgroup"`, `"smaps_rollup"`). Empty map when the capture
431 /// saw zero failures. Keys present in the map have non-zero
432 /// counts; absent keys imply zero failures for that category,
433 /// NOT "category unknown".
434 pub read_failures_by_file: BTreeMap<String, u64>,
435 /// Tag string for the file kind with the most read failures
436 /// across the snapshot. `None` when `read_failures == 0`.
437 /// Stable kebab tokens (the same vocabulary
438 /// [`Self::read_failures_by_file`] keys against). Ties resolve
439 /// REVERSE-alphabetically so the output is deterministic — the
440 /// alphabetically-earlier tag wins (e.g. `"io"` beats
441 /// `"status"` when both count equal).
442 pub dominant_read_failure: Option<String>,
443 /// `true` when ≥ 50% of `read_failures` are concentrated in
444 /// kernel-config-gated files (`"schedstat"`, `"io"`). These
445 /// two files are absent on kernels built without
446 /// `CONFIG_SCHED_INFO` / `CONFIG_TASK_IO_ACCOUNTING`
447 /// respectively, so a dominance signal here points the
448 /// operator at a kernel build/config issue rather than a
449 /// transient race or permission problem. `false` when
450 /// `read_failures == 0` or when failures are spread across
451 /// non-kconfig files.
452 pub kernel_config_dominant: bool,
453 /// Number of `/proc/<tid>/sched` PN_SCHEDSTAT dotted-ns
454 /// values whose integer part read as negative (kernel emitted
455 /// a leading `-`, e.g. `-5.000000`). The capture-side parser
456 /// (`parsed_ns_from_dotted`) rejects negative integer parts —
457 /// a `u64` parse cannot accept the sign — and the call site
458 /// then `unwrap_or(0)`s the resulting `None` per the
459 /// best-effort capture contract. Without this counter the
460 /// silent fold to zero leaves operators with no visibility
461 /// into the rate at which schedstat values were silently
462 /// truncated.
463 ///
464 /// Counts per-field-occurrence, NOT per-thread: a single
465 /// tid that exposed five negative dotted fields contributes
466 /// `5` to this counter (e.g. one tid with negative `wait_sum`,
467 /// `sleep_max`, `block_sum`, `iowait_sum`, and `exec_max`
468 /// adds 5). The denominator for "fraction of tids affected"
469 /// is therefore NOT this field — pair with
470 /// [`Self::tids_walked`] only as an upper bound on
471 /// affected-tid count.
472 ///
473 /// Distinct from [`Self::read_failures`]: a negative dotted
474 /// value comes from a `sched` file that READ successfully —
475 /// it is a parse-level signal, not a read-level signal. The
476 /// field stays at zero on a clean host because the kernel
477 /// emits non-negative values on every well-behaved schedstat
478 /// path; non-zero values are most commonly the result of
479 /// clock-skew on suspend/resume where a `delta` calculation
480 /// against a stale baseline lands negative.
481 ///
482 /// Ghost-filter discipline: per-tid bumps are held pending
483 /// (alongside the read-failure bumps in
484 /// [`crate::ctprof`]'s capture-side `ParseTally`), and
485 /// unwound via `discard_pending` when the surrounding tid is
486 /// rejected by the empty-comm + zero-start ghost filter so a
487 /// busy host with mid-capture exits doesn't inflate this
488 /// counter with bumps that correspond to threads the snapshot
489 /// doesn't even contain.
490 pub negative_dotted_values: u64,
491}
492
493impl CtprofParseSummary {
494 /// Operator-facing hint when kernel-config-gated file failures
495 /// dominate the snapshot. Returns `Some(&'static str)` naming
496 /// the two `CONFIG_*` knobs that gate the affected files
497 /// (`CONFIG_SCHED_INFO` for `schedstat`, `CONFIG_TASK_IO_ACCOUNTING`
498 /// for `io`), or `None` when [`Self::kernel_config_dominant`]
499 /// is `false`. Lets a downstream consumer surface a remediation
500 /// pointer without parsing the log line or hand-rolling the
501 /// gate, mirroring the [`CtprofProbeSummary::remediation_hint`]
502 /// pattern.
503 pub fn kernel_config_hint(&self) -> Option<&'static str> {
504 if self.kernel_config_dominant {
505 Some(PARSE_KCONFIG_HINT)
506 } else {
507 None
508 }
509 }
510}
511
512/// Stable kernel-config remediation hint for parse summaries.
513/// Names the two procfs files that disappear on kernels built
514/// without the corresponding `CONFIG_*` knobs.
515const PARSE_KCONFIG_HINT: &str = "hint: schedstat / io read failures dominate — \
516 kernel may be built without CONFIG_SCHED_INFO \
517 and/or CONFIG_TASK_IO_ACCOUNTING";
518
519/// Absent-value sentinel for [`ThreadState::state`]. Used by both
520/// the manual [`Default`] impl on [`ThreadState`] and the
521/// `serde(default = ...)` attribute on the field so the absent
522/// state is `'~'` regardless of how a [`ThreadState`] gets
523/// constructed (default-built test fixture, partial JSON
524/// deserialize, capture-time `unwrap_or` fallback).
525///
526/// `'~'` (U+007E = 126) is chosen specifically because it sorts
527/// strictly AFTER every entry in `fs/proc/array.c::task_state_array`
528/// — `R` (82), `S` (83), `D` (68), `T` (84), `t` (116), `X`
529/// (88), `Z` (90), `P` (80), `I` (73) all have lower codepoints.
530/// [`crate::ctprof_compare::aggregate`] breaks the
531/// categorical-mode count-ties (rules
532/// [`crate::ctprof_compare::AggRule::Mode`] /
533/// [`crate::ctprof_compare::AggRule::ModeChar`] /
534/// [`crate::ctprof_compare::AggRule::ModeBool`]) toward the
535/// LEX-SMALLEST candidate (the closure
536/// `a.1.cmp(&b.1).then(b.0.cmp(&a.0))` inside the
537/// `Modeable::mode_across` reduction), so a sentinel smaller
538/// than the real letters would HIJACK the tiebreak whenever a
539/// default-built thread sat alongside a real one in the same
540/// group. `'~'` is larger than all of them, so the real kernel
541/// letter always wins the tie.
542///
543/// `'?'` (U+003F = 63) was the obvious-looking pick but is
544/// numerically SMALLER than every state letter the kernel
545/// emits, which would make it a tiebreak hijacker rather than
546/// a safe sentinel. Avoid.
547fn default_state_char() -> char {
548 '~'
549}
550
551/// Per-thread resource profile.
552///
553/// Populated by the capture layer from `/proc/<tid>/{sched,status,
554/// io,stat,comm,cgroup}`, `sched_getaffinity`, the taskstats
555/// genetlink path (delay-accounting + memory-watermark fields),
556/// and (for jemalloc-linked processes only, via ptrace +
557/// `process_vm_readv`) the per-thread `tsd_s.thread_allocated` /
558/// `thread_deallocated` TLS counters.
559///
560/// Field families (mirrors the module-level breakdown, with
561/// the registry-pairing reductions named):
562///
563/// - **Cumulative counters and totals** (the majority): wakeups,
564/// migrations, csw, run/wait/sleep/block/iowait time,
565/// schedstat counts, page-fault counters, syscall counters,
566/// byte counters, the taskstats per-bucket `*_count` and
567/// `*_delay_total_ns`, and the jemalloc per-thread
568/// allocated/deallocated TSD counters. Probe-timing invariant
569/// modulo monotonic forward progress; reduced via the
570/// `Sum*` rules.
571/// - **Lifetime high-water peaks**: schedstat `*_max` family,
572/// every taskstats `*_delay_max_ns` / `*_delay_min_ns`, and
573/// the memory watermarks ([`Self::hiwater_rss_bytes`],
574/// [`Self::hiwater_vm_bytes`]). Non-decreasing-over-time but
575/// per-event extrema, so non-summable across threads; the
576/// registry reduces them via `MaxPeak` / `MaxPeakBytes`.
577/// - **Instantaneous gauges** (sensitive to probe timing):
578/// [`Self::nr_threads`] (signal_struct->nr_threads snapshot),
579/// [`Self::fair_slice_ns`] (instantaneous `p->se.slice`),
580/// and [`Self::state`] (task_state_array letter). Two probes
581/// of the same thread at different instants can legitimately
582/// produce different values. Reduced via `MaxGaugeCount` /
583/// `MaxGaugeNs` / `ModeChar`.
584/// - **Categorical / ordinal scalars** (point-in-time
585/// snapshots): [`Self::policy`], [`Self::nice`],
586/// [`Self::priority`], [`Self::processor`],
587/// [`Self::rt_priority`], plus the identity strings
588/// ([`Self::pcomm`], [`Self::comm`], [`Self::cgroup`]) and
589/// the [`crate::metric_types::CpuSet`]
590/// [`Self::cpu_affinity`]. Sampled at capture time and can
591/// change at runtime (e.g. `sched_setaffinity` mid-run flips
592/// `processor` and `cpu_affinity`); reduced via `Mode*` /
593/// `Range*` / `Affinity`.
594///
595/// Same family taxonomy as the module-level block at the top of
596/// the file; the per-field docs flag the family on each entry
597/// and the registry's [`AggRule`] pairing makes the
598/// "category-mismatched aggregation is a compile error"
599/// invariant load-bearing.
600///
601/// [`AggRule`]: crate::ctprof_compare::AggRule
602///
603/// `Default` is implemented manually rather than derived because
604/// the [`Self::state`] field needs `'~'` (the absent-value
605/// sentinel) instead of `'\0'` (the `char` Default). See the
606/// field doc on [`Self::state`] for why: `'\0'` lex-compares
607/// SMALLER than every real kernel state letter, which would
608/// poison [`crate::ctprof_compare::AggRule::ModeChar`]
609/// tie-breaks toward "absent" whenever a default-constructed
610/// thread sat alongside a real one in a group.
611#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)]
612#[non_exhaustive]
613pub struct ThreadState {
614 // -- identity --
615 /// Kernel task id. Ephemeral across runs; not used as a
616 /// grouping axis.
617 pub tid: u32,
618 /// Thread group id (process id). Ephemeral across runs.
619 pub tgid: u32,
620 /// Process name, read from `/proc/<tgid>/comm`. Stable across
621 /// runs on the same build. Feeds the grouping key under
622 /// `--group-by pcomm` (default), where it flows through the
623 /// token-based [`crate::ctprof_compare::pattern_key`]
624 /// normalizer so ephemeral worker pools (`worker-0`,
625 /// `worker-1`, ...) collapse into a single `worker-{N}`
626 /// bucket; pass `--no-thread-normalize` to group by literal
627 /// pcomm. Also feeds the smaps_rollup join key (with the same
628 /// normalization rules) so per-process memory rows survive
629 /// PID churn across snapshots.
630 pub pcomm: String,
631 /// Thread name, read from `/proc/<tid>/comm`. Stable when the
632 /// runtime assigns deterministic names (worker pools, async
633 /// runtimes). Feeds the grouping key under `--group-by comm`,
634 /// where it flows through the token-based
635 /// [`crate::ctprof_compare::pattern_key`] normalizer (same
636 /// rules as pcomm). Pass `--no-thread-normalize` to group by
637 /// literal comm, or `--group-by comm-exact` for the same
638 /// effect on this axis only (smaps still normalizes).
639 pub comm: String,
640 /// Cgroup v2 path.
641 ///
642 /// # Namespace semantics
643 ///
644 /// The path is read verbatim from `/proc/<tid>/cgroup` and
645 /// is therefore relative to the CGROUP NAMESPACE ROOT the
646 /// capturing process sees — NOT relative to the
647 /// system-global v2 mount root. A process outside the
648 /// capturing namespace would see the same cgroup under a
649 /// different path (prefixed with the namespace-root ancestors
650 /// the inner view hides); a process inside a nested cgroup
651 /// namespace sees a truncated path. Cross-namespace
652 /// comparison requires external canonicalization (e.g.
653 /// resolving via `cgroup.procs` inode chains or walking
654 /// `/proc/<tid>/ns/cgroup` to the common root) — the
655 /// capture layer deliberately does NOT attempt this because
656 /// the resolution depends on capture-site privilege and
657 /// namespace visibility that varies per caller.
658 ///
659 /// Kept as `cgroup` (not renamed to `cgroup_ns_relative`)
660 /// for consistency with `GroupBy::Cgroup`,
661 /// `cgroup_flatten`, `cgroup_stats`, and every CLI flag
662 /// that threads the same concept through the comparison
663 /// layer; a rename would cascade through every pinned
664 /// string in the compare pipeline without improving the
665 /// semantic guarantee. This doc is the canonical
666 /// documentation of the namespace-relative contract.
667 pub cgroup: String,
668 /// `/proc/<tid>/stat` field 22 (`start_time`) in USER_HZ
669 /// clock ticks since system boot. The kernel exports this
670 /// field in USER_HZ units (defined in
671 /// `include/asm-generic/param.h` as `USER_HZ == 100` on
672 /// every architecture the capture layer targets — x86_64
673 /// and aarch64) — NOT raw internal jiffies, which scale
674 /// with CONFIG_HZ. Cross-host comparison between x86_64 and
675 /// aarch64 is meaningful because USER_HZ is the same 100 on
676 /// both, so a diff between two hosts on different CONFIG_HZ
677 /// settings still compares correctly. Seconds-since-boot
678 /// is simply `start_time_clock_ticks / 100` on those
679 /// architectures. Other in-tree architectures carry
680 /// different USER_HZ (alpha defines 1024, for instance);
681 /// a future port must either restate the divisor or
682 /// normalise at capture time. `fs/proc/array.c::do_task_stat`
683 /// is where the kernel writes the field to procfs.
684 ///
685 /// Stored as raw `u64`, NOT wrapped in
686 /// [`crate::metric_types::ClockTicks`], because this field
687 /// is an identity / ghost-thread sentinel rather than a
688 /// metric that flows through the aggregation pipeline. The
689 /// ghost-filter in `capture_with` / `capture_pid_with`
690 /// keys on `start_time_clock_ticks == 0` (alongside an
691 /// empty `comm`) to drop `ThreadState`s assembled from a
692 /// tid that exited mid-capture, which is cleaner against a
693 /// raw `u64` than against a wrapped sentinel.
694 pub start_time_clock_ticks: u64,
695 /// Scheduling policy (SCHED_OTHER, SCHED_FIFO, SCHED_RR,
696 /// SCHED_BATCH, SCHED_IDLE, SCHED_DEADLINE, SCHED_EXT). Stored
697 /// as the canonical name string rather than the kernel
698 /// integer so comparison output is human-readable without a
699 /// reverse-lookup table. Wrapped in
700 /// [`crate::metric_types::CategoricalString`] so the
701 /// aggregation pipeline reduces by mode (most-frequent value)
702 /// rather than a category-mismatched sum or max.
703 pub policy: crate::metric_types::CategoricalString,
704 /// Nice value in the standard [-20, 19] range. Signed i32
705 /// because the range includes negative values and
706 /// `parse_stat` extracts the field via `get_i32` on
707 /// procfs's decimal text — the inner type matches the
708 /// extraction path and the kernel-visible range without
709 /// coercion. Wrapped in [`crate::metric_types::OrdinalI32`]
710 /// so the aggregation pipeline reduces by `[min, max]` range
711 /// rather than sum.
712 pub nice: crate::metric_types::OrdinalI32,
713 /// Allowed CPU set from `sched_getaffinity`. Sorted ascending.
714 /// Comparison aggregates via union across the group and
715 /// renders as "N cpus (range)" or "mixed" for heterogeneous
716 /// sets — see [`crate::ctprof_compare::AffinitySummary`].
717 /// Wrapped in [`crate::metric_types::CpuSet`] so the
718 /// aggregation pipeline routes through the dedicated
719 /// affinity-summary reduction rather than a numeric path.
720 pub cpu_affinity: crate::metric_types::CpuSet,
721
722 // -- task state (last-CPU, run-state) --
723 /// Last CPU the thread executed on. `/proc/<tid>/stat` field
724 /// 39 (`task_cpu(task)` in `fs/proc/array.c::do_task_stat`,
725 /// emitted via `seq_put_decimal_ll`). Signed for symmetry
726 /// with [`Self::nice`]; the kernel emits non-negative values
727 /// only — `task_cpu` (defined `unsigned int` in
728 /// `include/linux/sched.h`) zero-extends through the
729 /// `seq_put_decimal_ll` widening to `s64`. `0` is the
730 /// absent-value default (collisions with a legitimate CPU 0
731 /// are distinguished by inspecting `cpu_affinity`).
732 /// Wrapped in [`crate::metric_types::OrdinalI32`] so the
733 /// aggregation pipeline reduces by `[min, max]` range across
734 /// the group.
735 pub processor: crate::metric_types::OrdinalI32,
736 /// Single-letter task state from `/proc/<tid>/status` `State:`
737 /// line. Real kernel chars are `R`, `S`, `D`, `T`, `t`, `X`,
738 /// `Z`, `P`, `I` (see `fs/proc/array.c::task_state_array`,
739 /// emitted via `get_task_state`). `'~'` is the absent-value
740 /// sentinel — visually distinct from every real kernel char
741 /// so a downstream consumer can distinguish "no state read"
742 /// from a real value. When `'~'` appears in compare output,
743 /// the `/proc/<tid>/status` read failed (thread likely
744 /// exited mid-capture).
745 ///
746 /// `ThreadState::default()`, the capture-time
747 /// `unwrap_or_else(default_state_char)` fallback, and
748 /// `serde(default)` deserialize of a partial JSON record all
749 /// produce `'~'` (NOT `'\0'`, the bare `char` Default). The
750 /// manual `Default` impl on `ThreadState`, the
751 /// `unwrap_or_else` site in `capture_thread_at_with_tally`,
752 /// and the `serde(default = ...)` attribute on this field
753 /// are paired specifically so the absent-value sentinel is
754 /// the same byte everywhere.
755 ///
756 /// `'~'` (U+007E = 126) is chosen so it sorts AFTER every
757 /// real kernel state letter — `R` (82), `S` (83), `D` (68),
758 /// `T` (84), `t` (116), `X` (88), `Z` (90), `P` (80), `I`
759 /// (73). [`crate::ctprof_compare::AggRule::ModeChar`]
760 /// breaks count-ties toward the LEX-SMALLEST candidate, so
761 /// a sentinel smaller than the real letters would silently
762 /// elect "absent" whenever a default-built thread sat
763 /// alongside a real one in the same group. `'~'` being
764 /// larger than all of them lets the real letter win the
765 /// tie. The earlier `'?'` (U+003F = 63) sentinel was
766 /// numerically smaller than every real state letter — a
767 /// tiebreak hijacker; do not return to it.
768 #[serde(default = "default_state_char")]
769 pub state: char,
770
771 // -- scheduling (cumulative + lifetime peaks; /proc/<tid>/sched schedstat fields, need CONFIG_SCHEDSTATS) --
772 // -- (sched_ext gate: ext.enabled requires CONFIG_SCHED_CLASS_EXT) --
773 /// `true` when the task is currently scheduled by sched_ext —
774 /// `/proc/<tid>/sched` `ext.enabled` line. The kernel emits
775 /// the literal key `ext.enabled` only when
776 /// `CONFIG_SCHED_CLASS_EXT` is enabled; on kernels without it
777 /// the field is absent and lands at the default `false`. When
778 /// `false` on a task expected under sched_ext, the task may
779 /// have been ejected (sched_ext fall-back to CFS on BPF error)
780 /// or never enrolled.
781 ///
782 /// Stays a bare `bool` — not wrapped in a categorical newtype
783 /// — because it is the only bool-valued metric in the
784 /// registry. The
785 /// [`crate::ctprof_compare::AggRule::ModeBool`] dispatch
786 /// coerces it to a `String` via `to_string()`/`Display` at
787 /// the call site (see the
788 /// [`crate::metric_types::CategoricalString`] doc note: if a
789 /// second bool-valued metric appears, promote both to a
790 /// dedicated `CategoricalBool` wrapper rather than keeping
791 /// the ad-hoc coercion).
792 pub ext_enabled: bool,
793 /// Cumulative on-CPU time, ns; `/proc/<tid>/schedstat`
794 /// field 1. `MonotonicNs` per the lifetime-accumulator
795 /// contract.
796 pub run_time_ns: crate::metric_types::MonotonicNs,
797 /// Cumulative time waiting on the runqueue, ns;
798 /// `/proc/<tid>/schedstat` field 2. `MonotonicNs`.
799 pub wait_time_ns: crate::metric_types::MonotonicNs,
800 /// Number of times the task was scheduled onto a CPU;
801 /// `/proc/<tid>/schedstat` field 3. `MonotonicCount`.
802 pub timeslices: crate::metric_types::MonotonicCount,
803 /// Voluntary context switches — task gave up the CPU itself;
804 /// `/proc/<tid>/status` `voluntary_ctxt_switches`.
805 /// `MonotonicCount`.
806 pub voluntary_csw: crate::metric_types::MonotonicCount,
807 /// Involuntary context switches — task was preempted;
808 /// `/proc/<tid>/status` `nonvoluntary_ctxt_switches`.
809 /// `MonotonicCount`.
810 pub nonvoluntary_csw: crate::metric_types::MonotonicCount,
811 /// Total wakeups via `try_to_wake_up()`; `/proc/<tid>/sched`
812 /// `nr_wakeups`. `MonotonicCount`.
813 pub nr_wakeups: crate::metric_types::MonotonicCount,
814 /// Wakeups landed on the same CPU as the waker;
815 /// `/proc/<tid>/sched` `nr_wakeups_local`. `MonotonicCount`.
816 pub nr_wakeups_local: crate::metric_types::MonotonicCount,
817 /// Wakeups landed on a different CPU than the waker;
818 /// `/proc/<tid>/sched` `nr_wakeups_remote`. `MonotonicCount`.
819 pub nr_wakeups_remote: crate::metric_types::MonotonicCount,
820 /// `WF_SYNC` synchronous-wakeup hint count;
821 /// `/proc/<tid>/sched` `nr_wakeups_sync`. `MonotonicCount`.
822 pub nr_wakeups_sync: crate::metric_types::MonotonicCount,
823 /// Wakeups where the task migrated to a different CPU than
824 /// its prior one (`WF_MIGRATED`); `/proc/<tid>/sched`
825 /// `nr_wakeups_migrate`. Distinct from `nr_wakeups_remote`
826 /// (waker CPU != target CPU). `MonotonicCount`.
827 pub nr_wakeups_migrate: crate::metric_types::MonotonicCount,
828 /// Wakeups onto this CPU (cache-affine wakeup
829 /// fast-path). `/proc/<tid>/sched` `nr_wakeups_affine`,
830 /// emitted via `P_SCHEDSTAT`. Plain u64. Zero on kernels
831 /// without `CONFIG_SCHEDSTATS`. Zero under sched_ext:
832 /// `wake_affine` is a CFS-only path.
833 pub nr_wakeups_affine: crate::metric_types::MonotonicCount,
834 /// Total invocations of the cache-affine wakeup heuristic
835 /// `wake_affine()` — denominator for the affine-wake success
836 /// ratio (`nr_wakeups_affine / nr_wakeups_affine_attempts`).
837 /// `/proc/<tid>/sched` `nr_wakeups_affine_attempts`, emitted
838 /// via `P_SCHEDSTAT` (plain u64). The kernel increments this
839 /// counter unconditionally on every `wake_affine()` call in
840 /// `kernel/sched/fair.c::wake_affine`, then increments
841 /// `nr_wakeups_affine` only when the heuristic chose this
842 /// CPU — so the ratio is the success rate of the cache-
843 /// affine fast-path. Zero on kernels without
844 /// `CONFIG_SCHEDSTATS`. Zero under sched_ext: `wake_affine`
845 /// is a CFS-only path and `kernel/sched/ext.c` does not
846 /// increment this counter.
847 pub nr_wakeups_affine_attempts: crate::metric_types::MonotonicCount,
848 /// Total cross-CPU migrations of the task. Incremented
849 /// unconditionally in `kernel/sched/core.c` (`p->se.nr_migrations++`)
850 /// — no schedstat macro, no class gating. Always populated
851 /// regardless of `CONFIG_SCHEDSTATS` or scheduling class.
852 /// `MonotonicCount`.
853 pub nr_migrations: crate::metric_types::MonotonicCount,
854 /// Migrations forced by load balance (the load balancer
855 /// migrated the task even though the local heuristic would
856 /// have skipped it). `/proc/<tid>/sched` `nr_forced_migrations`,
857 /// plain u64 via `P_SCHEDSTAT`. Zero on kernels without
858 /// `CONFIG_SCHEDSTATS`.
859 pub nr_forced_migrations: crate::metric_types::MonotonicCount,
860 /// Failed migrations attributed to affinity mismatch — the
861 /// destination CPU was not in `cpus_allowed`. `/proc/<tid>/sched`
862 /// `nr_failed_migrations_affine`, plain u64 via `P_SCHEDSTAT`.
863 /// Zero on kernels without `CONFIG_SCHEDSTATS`.
864 pub nr_failed_migrations_affine: crate::metric_types::MonotonicCount,
865 /// Failed migrations attributed to the task being currently
866 /// running on the source CPU. `/proc/<tid>/sched`
867 /// `nr_failed_migrations_running`, plain u64 via `P_SCHEDSTAT`.
868 /// Zero on kernels without `CONFIG_SCHEDSTATS`.
869 pub nr_failed_migrations_running: crate::metric_types::MonotonicCount,
870 /// Failed migrations attributed to cache-hot heuristic — the
871 /// source CPU's cache was too hot to leave. `/proc/<tid>/sched`
872 /// `nr_failed_migrations_hot`, plain u64 via `P_SCHEDSTAT`.
873 /// Zero on kernels without `CONFIG_SCHEDSTATS`.
874 pub nr_failed_migrations_hot: crate::metric_types::MonotonicCount,
875 /// Total nanoseconds the task spent on the runqueue waiting
876 /// to be picked. Populated from `/proc/<tid>/sched`'s
877 /// `wait_sum` key — kernel emits via `PN_SCHEDSTAT` as
878 /// `ms.ns_remainder`, reconstructed by the parser to full ns.
879 /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
880 /// sched_ext: the kernel updates this counter via
881 /// `__update_stats_wait_end` (`kernel/sched/stats.c`), called
882 /// from CFS/RT/DL paths only — `kernel/sched/ext.c` does not
883 /// call that helper.
884 pub wait_sum: crate::metric_types::MonotonicNs,
885 /// Number of runqueue-wait windows the task accumulated —
886 /// the per-event tally that pairs with [`Self::wait_sum`].
887 /// Populated from `/proc/<tid>/sched`'s `wait_count` key
888 /// (kernel emits as `P_SCHEDSTAT`, plain u64). Zero on
889 /// kernels without `CONFIG_SCHEDSTATS`. Same write path as
890 /// `wait_sum` (`__update_stats_wait_end` in
891 /// `kernel/sched/stats.c`), so the same sched_ext caveat
892 /// applies: zero under sched_ext.
893 pub wait_count: crate::metric_types::MonotonicCount,
894 /// Longest single runqueue-wait window the task ever
895 /// experienced, in nanoseconds. `/proc/<tid>/sched` `wait_max`
896 /// emitted via `PN_SCHEDSTAT` (`ms.ns_remainder`,
897 /// reconstructed to full ns by the parser). Tail-latency
898 /// signal that pairs with the `wait_sum` average. Zero on
899 /// kernels without `CONFIG_SCHEDSTATS`. Zero under sched_ext:
900 /// the kernel sets this counter via
901 /// `__update_stats_wait_end` from CFS/RT/DL paths only —
902 /// `kernel/sched/ext.c` does not call that helper, so
903 /// sched_ext-managed tasks never accumulate wait_max.
904 pub wait_max: crate::metric_types::PeakNs,
905 /// Pure voluntary sleep time, nanoseconds — `TASK_INTERRUPTIBLE`
906 /// off-CPU windows only, with the involuntary-block
907 /// component already subtracted at capture.
908 ///
909 /// Computed at capture as `sum_sleep_runtime - sum_block_runtime`
910 /// (saturating; the read-skew window where block briefly
911 /// exceeds sleep collapses to zero). The kernel's
912 /// `sum_sleep_runtime` key (read via `PN_SCHEDSTAT` in
913 /// `/proc/<tid>/sched`) is the FULL off-CPU total because
914 /// `__update_stats_enqueue_sleeper` (`kernel/sched/stats.c`)
915 /// charges every sleeper window regardless of which sleep
916 /// state the task was in — voluntary sleep AND involuntary
917 /// block both contribute. Subtracting `sum_block_runtime`
918 /// at capture leaves the voluntary-sleep residual, which
919 /// is the operationally useful signal for "how much time
920 /// did this task spend on a syscall wait that wasn't a
921 /// kernel block."
922 ///
923 /// Capture-side normalization (rather than a derived
924 /// metric at compare time) means every consumer sees the
925 /// pre-normalized value without re-deriving — and the raw
926 /// kernel reading is intentionally NOT preserved in the
927 /// snapshot per the project's pre-1.0 disposable-sidecar
928 /// policy.
929 ///
930 /// There is no `voluntary_sleep_count` counterpart: the
931 /// kernel does not emit one — the scheduler records the
932 /// aggregate runtime but not the sleep-event count
933 /// separately from `nr_wakeups`, which already covers the
934 /// wake-side tally.
935 /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
936 /// sched_ext: `__update_stats_enqueue_sleeper` is called
937 /// from CFS/RT/DL paths only. Also zero when either
938 /// `sum_sleep_runtime` or `sum_block_runtime` fails to parse
939 /// from `/proc/<tid>/sched`: the residual is uncomputable
940 /// without both halves, and falling back to the unsubtracted
941 /// `sum_sleep_runtime` would mislabel involuntary block as
942 /// voluntary sleep.
943 pub voluntary_sleep_ns: crate::metric_types::MonotonicNs,
944 /// Longest single sleep window in nanoseconds.
945 /// `/proc/<tid>/sched` `sleep_max` emitted via `PN_SCHEDSTAT`
946 /// (`ms.ns_remainder`, reconstructed by the parser). Zero on
947 /// kernels without `CONFIG_SCHEDSTATS`. Zero under sched_ext:
948 /// the kernel sets this counter via
949 /// `__update_stats_enqueue_sleeper` from CFS/RT/DL paths
950 /// only.
951 pub sleep_max: crate::metric_types::PeakNs,
952 /// Total nanoseconds blocked in the scheduler — every path
953 /// that puts the task into `TASK_UNINTERRUPTIBLE` contributes:
954 /// swap-in, page-fault resolution, disk I/O, plus
955 /// mutex/rwsem/completion waits inside kernel code that
956 /// hold the task off the runqueue. Populated from
957 /// `/proc/<tid>/sched`'s `sum_block_runtime` key (kernel
958 /// emits `ms.ns_remainder` via `PN_SCHEDSTAT`; the parser
959 /// reconstructs full ns). `block_sum - iowait_sum` is
960 /// therefore an UPPER BOUND on non-iowait involuntary-block
961 /// time — swap/zswap decompression contributes, but so do
962 /// the lock-family waits, so the delta cannot be read as
963 /// swap latency without further attribution. There is no
964 /// `block_count` counterpart: the kernel does not emit one.
965 /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
966 /// sched_ext: the kernel updates this counter via
967 /// `__update_stats_enqueue_sleeper` (`kernel/sched/stats.c`),
968 /// called from CFS/RT/DL paths only.
969 pub block_sum: crate::metric_types::MonotonicNs,
970 /// Longest single block window in nanoseconds.
971 /// `/proc/<tid>/sched` `block_max` emitted via `PN_SCHEDSTAT`
972 /// (`ms.ns_remainder`, reconstructed by the parser). Tail-
973 /// latency signal that pairs with the `block_sum` average.
974 /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
975 /// sched_ext: the kernel sets this counter via
976 /// `__update_stats_enqueue_sleeper` from CFS/RT/DL paths
977 /// only.
978 pub block_max: crate::metric_types::PeakNs,
979 /// Total nanoseconds in I/O wait specifically (subset of
980 /// `block_sum`). Distinguishes disk-backed I/O delay from
981 /// the full involuntary-block total — callers that want
982 /// disk latency alone read this field, callers that want
983 /// every blocked window read `block_sum`. Populated from
984 /// `/proc/<tid>/sched`'s `iowait_sum` key (kernel emits
985 /// `ms.ns_remainder` via `PN_SCHEDSTAT`; the parser
986 /// reconstructs full ns). Zero on kernels without
987 /// `CONFIG_SCHEDSTATS`. Zero under sched_ext: the kernel
988 /// updates this counter via `__update_stats_enqueue_sleeper`
989 /// (`kernel/sched/stats.c`), called from CFS/RT/DL paths
990 /// only.
991 pub iowait_sum: crate::metric_types::MonotonicNs,
992 /// Number of I/O-wait windows the task accumulated — the
993 /// per-event tally that pairs with [`Self::iowait_sum`].
994 /// Populated from `/proc/<tid>/sched`'s `iowait_count` key
995 /// (kernel emits as `P_SCHEDSTAT`, plain u64). Zero on
996 /// kernels without `CONFIG_SCHEDSTATS`. Same write path as
997 /// `iowait_sum` (`__update_stats_enqueue_sleeper` in
998 /// `kernel/sched/stats.c`), so the same sched_ext caveat
999 /// applies: zero under sched_ext.
1000 pub iowait_count: crate::metric_types::MonotonicCount,
1001 /// Longest single CPU-burst (run-without-preempt window) in
1002 /// nanoseconds. `/proc/<tid>/sched` `exec_max` emitted via
1003 /// `PN_SCHEDSTAT` (`ms.ns_remainder`, reconstructed by the
1004 /// parser). Zero on kernels without `CONFIG_SCHEDSTATS`.
1005 /// Updated for sched_ext tasks too: the kernel sets it in
1006 /// `update_se` (`kernel/sched/fair.c`), which sched_ext
1007 /// reaches via `update_curr_scx` → `update_curr_common`.
1008 pub exec_max: crate::metric_types::PeakNs,
1009 /// Longest scheduling slice the task got before being
1010 /// preempted, in nanoseconds. `/proc/<tid>/sched` `slice_max`
1011 /// emitted via `PN_SCHEDSTAT` (`ms.ns_remainder`,
1012 /// reconstructed by the parser). Zero on kernels without
1013 /// `CONFIG_SCHEDSTATS`. Zero under sched_ext: the kernel sets
1014 /// this counter only in `set_next_entity`
1015 /// (`kernel/sched/fair.c`), a CFS-only path —
1016 /// sched_ext-managed tasks never accumulate slice_max even
1017 /// when CONFIG_SCHEDSTATS is enabled.
1018 pub slice_max: crate::metric_types::PeakNs,
1019
1020 // -- jemalloc per-thread TSD counters (tsd_s.thread_allocated / thread_deallocated, via ptrace) --
1021 /// Bytes allocated by this thread over its lifetime — read
1022 /// directly from jemalloc's per-thread TSD u64 counter
1023 /// (`tsd_s.thread_allocated`) via ptrace + `process_vm_readv`.
1024 /// Cumulative-from-thread-creation; jemalloc updates the
1025 /// per-thread TSD counters unconditionally on its alloc fast
1026 /// and slow paths, so attaching the probe late does not lose
1027 /// data.
1028 ///
1029 /// Distinct from [`crate::host_heap::HostHeapState::allocated_bytes`],
1030 /// which is the runner process's own
1031 /// `tikv_jemalloc_ctl::stats::allocated` reading — a global
1032 /// arena counter for the calling process. This field is the
1033 /// per-thread TSD counter for an arbitrary target thread the
1034 /// probe attached to.
1035 ///
1036 /// Zero when the capture layer could not pull the counter:
1037 /// (a) the target process is not linked against jemalloc,
1038 /// (b) the probe attach failed for any other reason (DWARF
1039 /// missing, jemalloc in a DSO rather than the main
1040 /// executable, arch mismatch),
1041 /// (c) the per-thread ptrace step failed (tid exited
1042 /// mid-capture, EPERM under YAMA scope=1 without
1043 /// `CAP_SYS_PTRACE`),
1044 /// or (d) the thread is in the calling process's own tgid
1045 /// (PTRACE_SEIZE rejects self-attach). All four collapse to
1046 /// zero per the best-effort "absent = 0" capture contract.
1047 /// Snapshot-level diagnosis lives on
1048 /// [`CtprofProbeSummary::dominant_failure`] (the per-tag
1049 /// plurality) and
1050 /// [`CtprofProbeSummary::privilege_dominant`] (the EPERM
1051 /// remediation gate, true when ptrace tags account for ≥ 50%
1052 /// of `failed`), reachable via
1053 /// [`CtprofSnapshot::probe_summary`]; the per-tag taxonomy
1054 /// is documented in the `ktstr ctprof capture` CLI help.
1055 pub allocated_bytes: crate::metric_types::Bytes,
1056 /// Bytes freed by this thread over its lifetime — read from
1057 /// jemalloc's per-thread TSD u64 counter
1058 /// (`tsd_s.thread_deallocated`) via the same probe path that
1059 /// populates [`Self::allocated_bytes`].
1060 /// `allocated_bytes - deallocated_bytes` is a thread-local
1061 /// estimate of currently-held bytes; the difference races
1062 /// any in-flight allocator activity since the two counters
1063 /// are sampled in one `process_vm_readv` over a 24-byte span
1064 /// the target may continue to mutate during the read.
1065 pub deallocated_bytes: crate::metric_types::Bytes,
1066
1067 // -- procfs /proc/<tid>/stat: page faults + CPU time (fields 10, 12, 14, 15) --
1068 /// Minor faults (no disk I/O). `/proc/<tid>/stat` field 10.
1069 pub minflt: crate::metric_types::MonotonicCount,
1070 /// Major faults (backed by disk). `/proc/<tid>/stat` field 12.
1071 pub majflt: crate::metric_types::MonotonicCount,
1072 /// User-mode CPU time in USER_HZ clock ticks since thread
1073 /// start. `/proc/<tid>/stat` field 14
1074 /// (`nsec_to_clock_t(utime)` in `fs/proc/array.c::do_task_stat`).
1075 /// USER_HZ-scaled like [`Self::start_time_clock_ticks`] —
1076 /// cross-host comparison between x86_64 and aarch64 is
1077 /// meaningful because USER_HZ is 100 on both, independent of
1078 /// CONFIG_HZ. Suffix `_clock_ticks` mirrors the existing
1079 /// `start_time_clock_ticks` precedent.
1080 pub utime_clock_ticks: crate::metric_types::ClockTicks,
1081 /// Kernel-mode CPU time in USER_HZ clock ticks since thread
1082 /// start. `/proc/<tid>/stat` field 15
1083 /// (`nsec_to_clock_t(stime)` in `fs/proc/array.c::do_task_stat`).
1084 /// Same USER_HZ scaling and `_clock_ticks` suffix convention as
1085 /// [`Self::utime_clock_ticks`].
1086 pub stime_clock_ticks: crate::metric_types::ClockTicks,
1087 /// Kernel-internal scheduler priority (signed). Distinct
1088 /// from [`Self::nice`] — `priority` is the post-bias
1089 /// scheduling priority (`task_prio(task)`) the scheduler
1090 /// uses for ordering, while `nice` is the
1091 /// userspace-presentable [-20, 19] preference.
1092 /// `/proc/<tid>/stat` field 18, emitted via
1093 /// `seq_put_decimal_ll(m, " ", priority)` (the local `priority`
1094 /// = `task_prio(task)`) in `do_task_stat()` (`fs/proc/array.c`).
1095 /// Range per `task_prio()` (`kernel/sched/syscalls.c`):
1096 /// CFS / SCHED_OTHER tasks see `[0..39]` (nice [-20..19]
1097 /// translated by `task_prio()` returning
1098 /// `p->prio - MAX_RT_PRIO`); SCHED_FIFO / SCHED_RR tasks
1099 /// see `[-2..-100]`; SCHED_DEADLINE tasks land at `-101`.
1100 /// Default 0 when the stat read fails — collides with the
1101 /// CFS nice-0 case, so a CFS task at default nice and an
1102 /// absent stat line both render 0. Wrapped in
1103 /// [`crate::metric_types::OrdinalI32`] for the
1104 /// `[min, max]` range reduction across a group.
1105 pub priority: crate::metric_types::OrdinalI32,
1106 /// Real-time scheduler priority. `/proc/<tid>/stat` field
1107 /// 40, emitted via `seq_put_decimal_ull(m, " ", task->rt_priority)`
1108 /// in `do_task_stat()` (`fs/proc/array.c`). Non-zero only when the task
1109 /// runs SCHED_FIFO or SCHED_RR; CFS / SCHED_OTHER tasks
1110 /// land at zero. Useful as a post-hoc filter to identify
1111 /// real-time threads in a snapshot. Wrapped in
1112 /// [`crate::metric_types::OrdinalU32`] for the
1113 /// `[min, max]` range reduction across a group; the inner
1114 /// `u32` matches the kernel's
1115 /// `unsigned int task_struct::rt_priority` declaration
1116 /// (`include/linux/sched.h`) exactly. Practical range is
1117 /// bounded `0..99` regardless of the type width.
1118 pub rt_priority: crate::metric_types::OrdinalU32,
1119
1120 // -- /proc/<tid>/sched additions (counters + ordinal + slice gauge) --
1121 /// Cumulative time this task forced its SMT sibling idle for
1122 /// core-scheduling, in nanoseconds. `/proc/<tid>/sched`
1123 /// `core_forceidle_sum`, dotted ms.ns format via
1124 /// `PN_SCHEDSTAT` in `proc_sched_show_task()` (`kernel/sched/debug.c`).
1125 /// Reconstructed to full ns via the same
1126 /// `parsed_ns_from_dotted` helper as `wait_sum` /
1127 /// `block_sum`.
1128 ///
1129 /// Increment occurs in `__account_forceidle_time()`
1130 /// (`kernel/sched/cputime.c`), called from
1131 /// `__sched_core_account_forceidle()`
1132 /// (`kernel/sched/core_sched.c`). The increment body is a plain
1133 /// `__schedstat_add(p->stats.core_forceidle_sum, delta)` —
1134 /// it is CLASS-AGNOSTIC. The caller iterates
1135 /// `for_each_cpu(i, smt_mask)` and picks
1136 /// `p = rq_i->core_pick ?: rq_i->curr` on each SMT sibling,
1137 /// charging whichever task is running there regardless of
1138 /// scheduling class. So a SCHED_EXT / DEADLINE / RR / FIFO
1139 /// task on a core-scheduled SMT cohort CAN accrue forceidle
1140 /// time the same way a CFS task can.
1141 ///
1142 /// Real gating is at the rq/build level, not per-task, and
1143 /// the runtime gates apply IN SERIES rather than equating —
1144 /// `sched_core_enabled(rq)` and `core_forceidle_count` are
1145 /// independent conditions that BOTH have to fire:
1146 ///
1147 /// - **Build:** `CONFIG_SCHED_CORE` (file-level `#ifdef` in
1148 /// `kernel/sched/cputime.c` and
1149 /// `kernel/sched/core_sched.c`).
1150 /// - **Build:** `CONFIG_SCHEDSTATS` (the caller's own
1151 /// `#ifdef CONFIG_SCHEDSTATS` in `__sched_core_account_forceidle()`).
1152 /// - **Runtime, scheduler-class entry:**
1153 /// `sched_core_enabled(rq)` is the FIRST gate — checked
1154 /// at `pick_next_task()` entry (`kernel/sched/core.c`)
1155 /// with an early `__pick_next_task()` return when false.
1156 /// No core-wide selection runs without this.
1157 /// - **Runtime, transient counter:**
1158 /// `rq->core->core_forceidle_count > 0` is a SEPARATE
1159 /// subsequent gate — `pick_next_task()` only invokes
1160 /// `sched_core_account_forceidle(rq)` when this counter is
1161 /// non-zero (`kernel/sched/core.c`); the
1162 /// `WARN_ON_ONCE(!rq->core->core_forceidle_count)` inside
1163 /// `__sched_core_account_forceidle()`
1164 /// (`kernel/sched/core_sched.c`) reasserts the same
1165 /// precondition. The early-return in the same function
1166 /// on `core_forceidle_start == 0` is then a third
1167 /// transient guard against accounting before
1168 /// forceidle has begun.
1169 /// - **Runtime, occupancy:** non-zero
1170 /// `core_forceidle_occupation` (the `WARN_ON_ONCE` in
1171 /// `__sched_core_account_forceidle()`).
1172 ///
1173 /// Kernels that fail any build gate, or rqs that fail any
1174 /// runtime gate, see this counter at zero for every task.
1175 /// Hosts where no SMT cohort has ever accumulated forceidle
1176 /// also see zero across the board.
1177 pub core_forceidle_sum: crate::metric_types::MonotonicNs,
1178 /// Per-thread `se.slice` in nanoseconds. For fair-class
1179 /// tasks (SCHED_NORMAL / SCHED_BATCH) this is the
1180 /// instantaneous slice CFS is currently running the task
1181 /// with. For SCHED_EXT tasks the line is still emitted but
1182 /// reflects stale `p->se.slice` state — ext-class
1183 /// schedulers maintain slice in `p->scx.slice` and do not
1184 /// update `p->se.slice`. Field name `fair_slice_ns` mirrors
1185 /// the kernel emission gate `fair_policy(p->policy)`, not a
1186 /// guarantee about which class actually populated the value.
1187 ///
1188 /// `/proc/<tid>/sched` `se.slice`, plain integer via
1189 /// `P(se.slice)` in `proc_sched_show_task()`
1190 /// (`kernel/sched/debug.c`), gated by `fair_policy(p->policy)`
1191 /// in the same function. `fair_policy()` is defined in
1192 /// `kernel/sched/sched.h` as
1193 /// `normal_policy(policy) || policy == SCHED_BATCH`, and
1194 /// `normal_policy()` (`sched.h`) returns true for
1195 /// SCHED_NORMAL AND, when `CONFIG_SCHED_CLASS_EXT` is
1196 /// built, for SCHED_EXT. So the line IS emitted for
1197 /// SCHED_EXT tasks on a sched_ext-enabled kernel — but the
1198 /// value carries the staleness caveat above. The parser
1199 /// cannot distinguish "ext-class hasn't refreshed
1200 /// `p->se.slice` since the task left the fair class" from
1201 /// "CFS task with a current slice that happens to equal the
1202 /// last value": that ambiguity is the user's to resolve via
1203 /// `policy` (also captured per-thread). Tasks under
1204 /// SCHED_DEADLINE / SCHED_RR / SCHED_FIFO / SCHED_IDLE land
1205 /// at the absent-line default of 0.
1206 ///
1207 /// This is a GAUGE (instantaneous current value), not a
1208 /// counter or high-water mark. Distinct from
1209 /// [`Self::slice_max`] which IS the schedstat lifetime
1210 /// high-water — a thread that hasn't run for a long time
1211 /// can have a stale `fair_slice_ns` value while `slice_max`
1212 /// continues to reflect the historical worst. Aggregation
1213 /// across a group uses `Max` so the rendered cell shows the
1214 /// longest current slice any thread in the group is running
1215 /// with — Sum would multiply a near-identical instantaneous
1216 /// value across the group and obscure the signal (and would
1217 /// also be semantically meaningless: instantaneous gauges
1218 /// do not add).
1219 pub fair_slice_ns: crate::metric_types::GaugeNs,
1220
1221 // -- /proc/<tid>/status (process-wide tgid count) --
1222 /// Total threads in this task's tgid (process-wide thread
1223 /// count, the `signal_struct->nr_threads` snapshot). Field
1224 /// name mirrors the kernel struct member to avoid collision
1225 /// with [`CtprofSnapshot::threads`] (the snapshot's own
1226 /// `Vec<ThreadState>`). `/proc/<pid>/status` `Threads:` line
1227 /// emitted in `task_sig()` (`fs/proc/array.c`) via
1228 /// `seq_put_decimal_ull(m, "Threads:\t", num_threads)`.
1229 /// Identical for every thread of the same tgid.
1230 ///
1231 /// Capture-side dedup: the field is populated ONLY on the
1232 /// thread leader (tid == tgid) and zero for non-leader
1233 /// threads of the same process. The registry pairs this with
1234 /// [`crate::ctprof_compare::AggRule::MaxGaugeCount`] (not
1235 /// Sum) so the rendered cell surfaces "the largest process
1236 /// represented in this bucket" regardless of grouping axis.
1237 /// Sum would be wrong under `--group-by comm` and
1238 /// `--group-by cgroup` because non-leader buckets get a 0
1239 /// contribution from every member — a bucket whose leader
1240 /// thread did NOT match the grouping
1241 /// would render 0 even though processes are represented.
1242 /// Wrapped in [`crate::metric_types::GaugeCount`] so the
1243 /// type system rejects sum-style aggregation: a bucket with
1244 /// N threads sharing a tgid would over-count the parent
1245 /// process N-fold under Sum, while Max is well-defined
1246 /// (largest current count any contributor reported).
1247 pub nr_threads: crate::metric_types::GaugeCount,
1248
1249 // -- /proc/<tid>/smaps_rollup (per-MM memory breakdown) --
1250 /// Per-process memory breakdown from
1251 /// `/proc/<tid>/smaps_rollup`, parsed as a key-value map
1252 /// with values in kilobytes (the kernel's native unit on
1253 /// this file — `__show_smap()` (`fs/proc/task_mmu.c`)
1254 /// emits every line as `Name: NN kB`).
1255 ///
1256 /// Stored as a [`BTreeMap`] for forward-compat with the
1257 /// open key set: rollup mode (gated in `__show_smap()`)
1258 /// emits 22 keys on a recent kernel — Rss, Pss, Pss_Dirty,
1259 /// Pss_Anon, Pss_File, Pss_Shmem, Shared_Clean,
1260 /// Shared_Dirty, Private_Clean, Private_Dirty, Referenced,
1261 /// Anonymous, KSM, LazyFree, AnonHugePages,
1262 /// ShmemPmdMapped, FilePmdMapped, Shared_Hugetlb,
1263 /// Private_Hugetlb, Swap, SwapPss, Locked, plus the
1264 /// `[rollup]` header which the parser elides. The map
1265 /// preserves any future-kernel keys without a schema bump.
1266 /// Pss is the most operationally valuable: proportional
1267 /// share of shared pages — distinguishes "sole owner" from
1268 /// "one of N sharing".
1269 ///
1270 /// Per-MM, not per-thread: every thread of the same tgid
1271 /// shares one mm_struct, so all threads expose identical
1272 /// values. Capture-side dedup populates ONLY the thread
1273 /// leader (tid == tgid) and leaves non-leader threads at
1274 /// the empty map. Mirrors [`Self::nr_threads`]'s
1275 /// leader-dedup discipline. The capture cost is one
1276 /// `read_to_string` per tgid (NOT per-tid) because
1277 /// non-leaders short-circuit before opening the file.
1278 ///
1279 /// Empty when smaps_rollup is absent (older kernels
1280 /// without `/proc/<pid>/smaps_rollup` support — added
1281 /// upstream in 4.14) or unreadable (typical
1282 /// permission-denied for /proc/1/smaps_rollup outside
1283 /// CAP_SYS_PTRACE).
1284 pub smaps_rollup_kib: BTreeMap<String, u64>,
1285
1286 // -- I/O (/proc/<tid>/io) --
1287 //
1288 // The whole file is emitted by `do_io_accounting`
1289 // (`fs/proc/base.c`) under a single `CONFIG_TASK_IO_ACCOUNTING`
1290 // gate, and `CONFIG_TASK_IO_ACCOUNTING` `depends on`
1291 // `CONFIG_TASK_XACCT` in `init/Kconfig` — so from the
1292 // procfs-reader perspective the file either appears with all
1293 // 7 fields or doesn't appear at all. The XACCT split that
1294 // sometimes shows up in kernel commentary describes the
1295 // increment-side path, not the procfs surface; for the
1296 // capture pipeline the relevant gate is `CONFIG_TASK_IO_ACCOUNTING`
1297 // for every field below.
1298 /// Bytes read at the read syscall layer (incl. cached /
1299 /// pagecache hits). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1300 pub rchar: crate::metric_types::Bytes,
1301 /// Bytes written at the write syscall layer (incl.
1302 /// pagecache / writeback). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1303 pub wchar: crate::metric_types::Bytes,
1304 /// Number of read syscalls. Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1305 pub syscr: crate::metric_types::MonotonicCount,
1306 /// Number of write syscalls. Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1307 pub syscw: crate::metric_types::MonotonicCount,
1308 /// Bytes that hit the storage device on read (excludes
1309 /// pagecache hits). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1310 pub read_bytes: crate::metric_types::Bytes,
1311 /// Bytes that hit the storage device on write
1312 /// (post-writeback). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1313 pub write_bytes: crate::metric_types::Bytes,
1314 /// Bytes the kernel deaccounted from a prior dirty-write
1315 /// because the page was reclaimed without writeback (truncate,
1316 /// inode invalidation). `/proc/<tid>/io` 7th line, gated by
1317 /// `CONFIG_TASK_IO_ACCOUNTING`.
1318 ///
1319 /// `include/linux/task_io_accounting_ops.h`
1320 /// (`task_io_account_cancelled_write`) increments
1321 /// `current->ioac.cancelled_write_bytes` — i.e. the value
1322 /// records on the task that triggers the deaccount
1323 /// (the truncating / unmapping task), NOT the original
1324 /// writer. Sole call site is `folio_account_cleaned`
1325 /// (`mm/page-writeback.c`), invoked when a dirty folio
1326 /// is reclaimed without going through writeback.
1327 ///
1328 /// Operationally this is a "negative write" signal — bytes
1329 /// the kernel previously charged to a thread's `wchar`
1330 /// pipeline that never ended up on disk. Higher values mean
1331 /// more wasted writeback intent. Per-thread interpretation
1332 /// is asymmetric vs. [`Self::write_bytes`]: a thread's
1333 /// `cancelled_write_bytes` does NOT correspond to its own
1334 /// `write_bytes` — the writer and the canceller may be
1335 /// distinct tasks. Group-level Sum across a registry-grouped
1336 /// bucket is therefore meaningful (total bytes the bucket's
1337 /// threads cancelled), but per-thread `actual_write_bytes
1338 /// = write_bytes - cancelled_write_bytes` is NOT defined for
1339 /// that reason — the two counters track different parties.
1340 pub cancelled_write_bytes: crate::metric_types::Bytes,
1341
1342 // -- taskstats delay accounting + memory watermarks (genetlink TASKSTATS family) --
1343 //
1344 // Per-tid records captured via the kernel's taskstats
1345 // genetlink interface (NOT exposed in /proc/<tid>/sched or
1346 // /proc/<tid>/stat). Two field families:
1347 //
1348 // 1. Delay accounting — eight categories (cpu/blkio/swapin/
1349 // freepages/thrashing/compact/wpcopy/irq), each carrying
1350 // count (number of events), delay_total_ns (cumulative
1351 // ns of delay), delay_max_ns (longest single window),
1352 // delay_min_ns (shortest non-zero window observed;
1353 // sentinel 0 means "no events"). Gated on
1354 // `CONFIG_TASKSTATS` + `CONFIG_TASK_DELAY_ACCT` plus the
1355 // runtime `delayacct=on` toggle (sysctl
1356 // `kernel.task_delayacct` or boot param `delayacct`).
1357 //
1358 // 2. Memory watermarks — `hiwater_rss_bytes` and
1359 // `hiwater_vm_bytes`. Gated on `CONFIG_TASKSTATS` +
1360 // `CONFIG_TASK_XACCT` (NOT `CONFIG_TASK_DELAY_ACCT`).
1361 // Populated from the shared `mm_struct` so sibling tgid
1362 // threads report identical values, and kernel threads
1363 // (mm == NULL) leave the field at zero — see the
1364 // per-field doc on `hiwater_rss_bytes`.
1365 //
1366 // Capture path is the [`crate::taskstats`] module —
1367 // best-effort, all fields collapse to zero when:
1368 // - the kernel was built without `CONFIG_TASKSTATS`,
1369 // - the relevant per-family kconfig is off (DELAY_ACCT or
1370 // XACCT, depending on the field),
1371 // - the runtime `delayacct=on` toggle is off (delay-family
1372 // fields only — XACCT does not gate on the toggle),
1373 // - the calling process lacks `CAP_NET_ADMIN`,
1374 // - the per-tid query races a task exit (ESRCH).
1375 //
1376 // CAVEATS:
1377 // - cpu_delay is RACY (sched_info path, no lock) — count and
1378 // delay_total are not updated atomically.
1379 // - swapin and thrashing OVERLAP — a thrashing event is also
1380 // a swapin event from the syscall layer; do not sum.
1381 // - delay_min == 0 means "no events observed", NOT "saw a
1382 // zero-ns event". Compare against the matching count.
1383 // - hiwater_* values are per-mm, not per-thread; sibling
1384 // tgid threads report identical values, kernel threads
1385 // (mm == NULL) report zero. See the per-field doc.
1386 /// Number of off-CPU windows the task waited for the runqueue
1387 /// to schedule it. Source: taskstats `cpu_count`, populated at
1388 /// query time from `tsk->sched_info.pcount` (incremented by
1389 /// `sched_info_arrive` in `kernel/sched/stats.h`, line 282).
1390 /// `delayacct_add_tsk` (`kernel/delayacct.c::delayacct_add_tsk`,
1391 /// line 175) snapshots the value into the reply via
1392 /// `d->cpu_count += t1` where `t1 = tsk->sched_info.pcount`.
1393 pub cpu_delay_count: crate::metric_types::MonotonicCount,
1394 /// Cumulative ns the task spent waiting on the runqueue.
1395 /// Source: taskstats `cpu_delay_total`. RACY: count and total
1396 /// are not updated atomically (sched_info path, no lock); a
1397 /// concurrent reader may observe count or total advance ahead
1398 /// of the other.
1399 pub cpu_delay_total_ns: crate::metric_types::MonotonicNs,
1400 /// Longest single CPU-wait window, ns. Source: taskstats
1401 /// `cpu_delay_max`. Same lifetime-watermark semantics as
1402 /// `wait_max` / `block_max` — `MaxPeak` aggregation surfaces
1403 /// the worst single window any thread in the group ever
1404 /// experienced.
1405 pub cpu_delay_max_ns: crate::metric_types::PeakNs,
1406 /// Shortest non-zero CPU-wait window, ns. Source: taskstats
1407 /// `cpu_delay_min`. Sentinel 0 means "no events observed":
1408 /// the kernel writes the field on every event, so 0 is
1409 /// distinguishable from a genuine zero-ns event by checking
1410 /// `cpu_delay_count == 0`. `PeakNs` aggregation surfaces "the
1411 /// largest minimum any thread reported" across the group.
1412 pub cpu_delay_min_ns: crate::metric_types::PeakNs,
1413 /// Number of block-I/O wait windows. Source: taskstats
1414 /// `blkio_count`. Updates from `delayacct_blkio_start/end` in
1415 /// `kernel/delayacct.c`.
1416 pub blkio_delay_count: crate::metric_types::MonotonicCount,
1417 /// Cumulative ns the task waited on synchronous block I/O.
1418 /// Source: taskstats `blkio_delay_total`. Distinct from
1419 /// `iowait_sum` (schedstat) which counts a different bucket;
1420 /// the delayacct path is the canonical block-I/O delay
1421 /// accounting.
1422 pub blkio_delay_total_ns: crate::metric_types::MonotonicNs,
1423 /// Longest single block-I/O wait window, ns. Source: taskstats
1424 /// `blkio_delay_max`.
1425 pub blkio_delay_max_ns: crate::metric_types::PeakNs,
1426 /// Shortest non-zero block-I/O wait window, ns. Source:
1427 /// taskstats `blkio_delay_min`. Sentinel-0 caveat per
1428 /// `cpu_delay_min_ns`.
1429 pub blkio_delay_min_ns: crate::metric_types::PeakNs,
1430 /// Number of swap-in wait windows. Source: taskstats
1431 /// `swapin_count`. NOTE: overlaps with `thrashing_count` —
1432 /// every thrashing event is also a swapin event from the
1433 /// syscall layer; do not sum.
1434 pub swapin_delay_count: crate::metric_types::MonotonicCount,
1435 /// Cumulative ns waiting for swap-in to complete. Source:
1436 /// taskstats `swapin_delay_total`.
1437 pub swapin_delay_total_ns: crate::metric_types::MonotonicNs,
1438 /// Longest single swap-in wait, ns. Source: taskstats
1439 /// `swapin_delay_max`.
1440 pub swapin_delay_max_ns: crate::metric_types::PeakNs,
1441 /// Shortest non-zero swap-in wait, ns. Sentinel-0 caveat per
1442 /// `cpu_delay_min_ns`.
1443 pub swapin_delay_min_ns: crate::metric_types::PeakNs,
1444 /// Number of direct-reclaim (free-pages) wait windows. Source:
1445 /// taskstats `freepages_count`. Updates from
1446 /// `delayacct_freepages_start/end` (mm/page_alloc.c).
1447 pub freepages_delay_count: crate::metric_types::MonotonicCount,
1448 /// Cumulative ns waiting in direct memory reclaim. Source:
1449 /// taskstats `freepages_delay_total`.
1450 pub freepages_delay_total_ns: crate::metric_types::MonotonicNs,
1451 /// Longest single direct-reclaim wait, ns. Source: taskstats
1452 /// `freepages_delay_max`.
1453 pub freepages_delay_max_ns: crate::metric_types::PeakNs,
1454 /// Shortest non-zero direct-reclaim wait, ns. Sentinel-0 caveat
1455 /// per `cpu_delay_min_ns`.
1456 pub freepages_delay_min_ns: crate::metric_types::PeakNs,
1457 /// Number of thrashing wait windows. Source: taskstats
1458 /// `thrashing_count`. OVERLAPS with `swapin_*`: thrashing
1459 /// detection is a refinement of swapin tracking
1460 /// (mm/workingset.c).
1461 pub thrashing_delay_count: crate::metric_types::MonotonicCount,
1462 /// Cumulative ns waiting under thrashing pressure. Source:
1463 /// taskstats `thrashing_delay_total`.
1464 pub thrashing_delay_total_ns: crate::metric_types::MonotonicNs,
1465 /// Longest single thrashing wait, ns. Source: taskstats
1466 /// `thrashing_delay_max`.
1467 pub thrashing_delay_max_ns: crate::metric_types::PeakNs,
1468 /// Shortest non-zero thrashing wait, ns. Sentinel-0 caveat per
1469 /// `cpu_delay_min_ns`.
1470 pub thrashing_delay_min_ns: crate::metric_types::PeakNs,
1471 /// Number of memory-compaction wait windows. Source: taskstats
1472 /// `compact_count`. Updates from `delayacct_compact_start/end`
1473 /// (mm/compaction.c).
1474 pub compact_delay_count: crate::metric_types::MonotonicCount,
1475 /// Cumulative ns waiting on memory compaction. Source:
1476 /// taskstats `compact_delay_total`.
1477 pub compact_delay_total_ns: crate::metric_types::MonotonicNs,
1478 /// Longest single compaction wait, ns. Source: taskstats
1479 /// `compact_delay_max`.
1480 pub compact_delay_max_ns: crate::metric_types::PeakNs,
1481 /// Shortest non-zero compaction wait, ns. Sentinel-0 caveat
1482 /// per `cpu_delay_min_ns`.
1483 pub compact_delay_min_ns: crate::metric_types::PeakNs,
1484 /// Number of write-protect-copy (CoW) fault wait windows.
1485 /// Source: taskstats `wpcopy_count`. Updates from
1486 /// `delayacct_wpcopy_start/end` (mm/memory.c).
1487 pub wpcopy_delay_count: crate::metric_types::MonotonicCount,
1488 /// Cumulative ns waiting on write-protect-copy faults. Source:
1489 /// taskstats `wpcopy_delay_total`.
1490 pub wpcopy_delay_total_ns: crate::metric_types::MonotonicNs,
1491 /// Longest single wpcopy wait, ns. Source: taskstats
1492 /// `wpcopy_delay_max`.
1493 pub wpcopy_delay_max_ns: crate::metric_types::PeakNs,
1494 /// Shortest non-zero wpcopy wait, ns. Sentinel-0 caveat per
1495 /// `cpu_delay_min_ns`.
1496 pub wpcopy_delay_min_ns: crate::metric_types::PeakNs,
1497 /// Number of IRQ-handler windows the task delegated. Source:
1498 /// taskstats `irq_count`. Updates from `delayacct_irq` in
1499 /// `kernel/delayacct.c` — counts kernel-IRQ time charged to
1500 /// the task by the IRQ accounting subsystem.
1501 pub irq_delay_count: crate::metric_types::MonotonicCount,
1502 /// Cumulative ns of IRQ handling time charged to the task.
1503 /// Source: taskstats `irq_delay_total`.
1504 pub irq_delay_total_ns: crate::metric_types::MonotonicNs,
1505 /// Longest single IRQ-handler window, ns. Source: taskstats
1506 /// `irq_delay_max`.
1507 pub irq_delay_max_ns: crate::metric_types::PeakNs,
1508 /// Shortest non-zero IRQ-handler window, ns. Sentinel-0 caveat
1509 /// per `cpu_delay_min_ns`.
1510 pub irq_delay_min_ns: crate::metric_types::PeakNs,
1511 /// Lifetime high-watermark of resident-set size, bytes. Source:
1512 /// taskstats `hiwater_rss` (kB), converted at parse time via
1513 /// `saturating_mul(1024)`. Updates from `xacct_add_tsk` in
1514 /// `kernel/tsacct.c::xacct_add_tsk`. Distinct from
1515 /// `smaps_rollup_kib["Rss"]` which is the CURRENT RSS —
1516 /// this field is the lifetime peak.
1517 ///
1518 /// **Kernel threads read zero**: `xacct_add_tsk`
1519 /// (`kernel/tsacct.c`) calls `mm = get_task_mm(p)` and the
1520 /// hiwater assignments are guarded by
1521 /// `if (mm)`. Kernel threads (`PF_KTHREAD`, `tsk->mm == NULL`)
1522 /// skip the assignment entirely, so the field stays at the
1523 /// kernel-side zero default.
1524 ///
1525 /// **Sibling threads of the same tgid see the same value**:
1526 /// `get_mm_hiwater_rss(mm)` reads from the shared
1527 /// `mm_struct`, so every thread of a process reports the same
1528 /// hiwater value. The registry's `MaxPeakBytes` aggregation
1529 /// behaves as a per-process selector when buckets span
1530 /// multiple tgids: cross-tgid Max picks the largest
1531 /// per-process watermark in the bucket; intra-tgid Max is a
1532 /// no-op (every sibling reports the same number).
1533 pub hiwater_rss_bytes: crate::metric_types::PeakBytes,
1534 /// Lifetime high-watermark of virtual-memory size, bytes.
1535 /// Source: taskstats `hiwater_vm` (kB), converted at parse
1536 /// time. Same kernel write path as `hiwater_rss_bytes` —
1537 /// inherits the same kernel-thread zero and same sibling-tid
1538 /// shared-mm caveats; see [`Self::hiwater_rss_bytes`].
1539 pub hiwater_vm_bytes: crate::metric_types::PeakBytes,
1540 /// Whether this thread's taskstats genetlink query succeeded and populated
1541 /// the payload — `true` iff `apply_delay_stats` ran on an `Ok` query. This
1542 /// is the capture-mechanism flag for the WHOLE taskstats payload: one query
1543 /// (`fill_stats`) fills BOTH the delay-accounting family (cpu/blkio/... delay
1544 /// counters) AND the xacct memory watermarks (`hiwater_rss_bytes` /
1545 /// `hiwater_vm_bytes`) together, so they share this one flag. `false` when
1546 /// the query could not capture (CONFIG_TASKSTATS off, no CAP_NET_ADMIN, or
1547 /// the query raced task exit), leaving the absent-counter zero defaults. The
1548 /// group aggregation reads this to distinguish a captured (measured) zero
1549 /// from a never-captured payload — without it both read as a sentinel `0`
1550 /// and a derived metric like `total_offcpu_delay_ns` renders "0" instead of
1551 /// "-". A whole group with no captured thread aggregates to
1552 /// [`crate::ctprof_compare::Aggregated::Absent`].
1553 ///
1554 /// QUERY-level: `true` means THIS thread's taskstats query succeeded. Per
1555 /// sub-family ENABLEMENT is carried separately by [`Self::cpu_delay_active`] /
1556 /// [`Self::delay_block_active`] / [`Self::xacct_active`] (baked at capture from
1557 /// the host `/proc/sys/kernel/task_delayacct` + `/proc/config.gz` probes). The
1558 /// group measured predicate ANDs this query-Ok flag with the relevant
1559 /// sub-family active flag, so a sub-family disabled while the query still
1560 /// succeeds (`CONFIG_TASK_XACCT` off, or the `kernel.task_delayacct` sysctl
1561 /// off, with the other family on) now renders "-" not "0". On ktstr's own
1562 /// kernel (all configs `=y`, delayacct booted on) every sub-family is active,
1563 /// so the gating is an in-VM no-op; it only changes host-facing `ctprof
1564 /// capture` against single-family kernels.
1565 pub taskstats_measured: bool,
1566 /// Host-wide enablement of the `cpu_delay_*` sub-family (sched_info-sourced,
1567 /// filled unconditionally by `delayacct_add_tsk`): CONFIG_TASK_DELAY_ACCT is
1568 /// built in — survives the runtime `task_delayacct` toggle. Baked at capture
1569 /// from `host_context::probe_taskstats_active`; AND-ed with
1570 /// [`Self::taskstats_measured`] by the group measured predicate. No
1571 /// `serde(default)` (matching the sibling capture flags): a sidecar predating
1572 /// this field fails to deserialize and is regenerated by re-running, per the
1573 /// disposable-sidecar policy.
1574 pub cpu_delay_active: bool,
1575 /// Host-wide enablement of the delayacct resource-wait sub-family (`blkio` /
1576 /// `swapin` / `freepages` / `thrashing` / `compact` / `wpcopy` / `irq`): the
1577 /// runtime `task_delayacct` toggle is ON (these are gated by `tsk->delays`,
1578 /// allocated at fork only when on). Baked at capture; AND-ed with
1579 /// [`Self::taskstats_measured`].
1580 pub delay_block_active: bool,
1581 /// Host-wide enablement of the xacct watermark sub-family
1582 /// (`hiwater_rss_bytes`, `hiwater_vm_bytes`): CONFIG_TASK_XACCT is built in (no
1583 /// runtime toggle); an unknown host config (`/proc/config.gz` not exposed) is
1584 /// treated as active to avoid a false absent. Baked at capture; AND-ed with
1585 /// [`Self::taskstats_measured`].
1586 pub xacct_active: bool,
1587 /// Whether this thread's jemalloc `allocated_bytes` / `deallocated_bytes`
1588 /// were captured from a successful per-thread TSD probe read, versus left at
1589 /// the absent-as-`0` default (process not jemalloc-linked, the probe could
1590 /// not attach, or the per-thread read failed). Set from the per-thread read
1591 /// outcome — NOT the per-tgid attach — mirroring `taskstats_measured`'s
1592 /// per-thread Ok-gating: a failed read is not a measurement. Same
1593 /// measured-vs-zero discipline as [`Self::taskstats_measured`], for the
1594 /// `live_heap_estimate` derived metric.
1595 pub jemalloc_measured: bool,
1596}
1597
1598impl Default for ThreadState {
1599 /// Zero-valued sentinel — tid=0/tgid=0/empty strings are the
1600 /// "no thread observed yet" placeholder that ctprof inserts
1601 /// into HashMap entries before the /proc walk populates them
1602 /// from the live kernel state. Default-constructed ThreadState
1603 /// values are NOT visible to operator-facing output: the
1604 /// capture path in `capture_thread_at_with_tally`
1605 /// (which delegates to the per-file `/proc` read helpers in
1606 /// `parse`) overwrites each field from
1607 /// `/proc/<pid>/task/<tid>/{stat,status,schedstat,cgroup}` before
1608 /// the entry is read for rendering. The `state` char uses the
1609 /// `'~'` absent-value sentinel rather than the bare `char`
1610 /// Default `'\0'` because '\0' would print as an empty cell in
1611 /// the ctprof table and the absent-value glyph is operator-
1612 /// readable.
1613 fn default() -> Self {
1614 Self {
1615 tid: 0,
1616 tgid: 0,
1617 pcomm: String::new(),
1618 comm: String::new(),
1619 cgroup: String::new(),
1620 start_time_clock_ticks: 0,
1621 policy: Default::default(),
1622 nice: crate::metric_types::OrdinalI32(0),
1623 cpu_affinity: Default::default(),
1624 processor: Default::default(),
1625 // `'~'` (the absent-value sentinel) instead of the
1626 // bare `char` Default `'\0'`; see [`Self::state`].
1627 state: default_state_char(),
1628 ext_enabled: false,
1629 run_time_ns: Default::default(),
1630 wait_time_ns: Default::default(),
1631 timeslices: Default::default(),
1632 voluntary_csw: Default::default(),
1633 nonvoluntary_csw: Default::default(),
1634 nr_wakeups: Default::default(),
1635 nr_wakeups_local: Default::default(),
1636 nr_wakeups_remote: Default::default(),
1637 nr_wakeups_sync: Default::default(),
1638 nr_wakeups_migrate: Default::default(),
1639 nr_wakeups_affine: Default::default(),
1640 nr_wakeups_affine_attempts: Default::default(),
1641 nr_migrations: Default::default(),
1642 nr_forced_migrations: Default::default(),
1643 nr_failed_migrations_affine: Default::default(),
1644 nr_failed_migrations_running: Default::default(),
1645 nr_failed_migrations_hot: Default::default(),
1646 wait_sum: Default::default(),
1647 wait_count: Default::default(),
1648 wait_max: Default::default(),
1649 voluntary_sleep_ns: Default::default(),
1650 sleep_max: Default::default(),
1651 block_sum: Default::default(),
1652 block_max: Default::default(),
1653 iowait_sum: Default::default(),
1654 iowait_count: Default::default(),
1655 exec_max: Default::default(),
1656 slice_max: Default::default(),
1657 allocated_bytes: Default::default(),
1658 deallocated_bytes: Default::default(),
1659 minflt: Default::default(),
1660 majflt: Default::default(),
1661 utime_clock_ticks: Default::default(),
1662 stime_clock_ticks: Default::default(),
1663 priority: Default::default(),
1664 rt_priority: Default::default(),
1665 core_forceidle_sum: Default::default(),
1666 fair_slice_ns: Default::default(),
1667 nr_threads: Default::default(),
1668 smaps_rollup_kib: BTreeMap::new(),
1669 rchar: Default::default(),
1670 wchar: Default::default(),
1671 syscr: Default::default(),
1672 syscw: Default::default(),
1673 read_bytes: Default::default(),
1674 write_bytes: Default::default(),
1675 cancelled_write_bytes: Default::default(),
1676 cpu_delay_count: Default::default(),
1677 cpu_delay_total_ns: Default::default(),
1678 cpu_delay_max_ns: Default::default(),
1679 cpu_delay_min_ns: Default::default(),
1680 blkio_delay_count: Default::default(),
1681 blkio_delay_total_ns: Default::default(),
1682 blkio_delay_max_ns: Default::default(),
1683 blkio_delay_min_ns: Default::default(),
1684 swapin_delay_count: Default::default(),
1685 swapin_delay_total_ns: Default::default(),
1686 swapin_delay_max_ns: Default::default(),
1687 swapin_delay_min_ns: Default::default(),
1688 freepages_delay_count: Default::default(),
1689 freepages_delay_total_ns: Default::default(),
1690 freepages_delay_max_ns: Default::default(),
1691 freepages_delay_min_ns: Default::default(),
1692 thrashing_delay_count: Default::default(),
1693 thrashing_delay_total_ns: Default::default(),
1694 thrashing_delay_max_ns: Default::default(),
1695 thrashing_delay_min_ns: Default::default(),
1696 compact_delay_count: Default::default(),
1697 compact_delay_total_ns: Default::default(),
1698 compact_delay_max_ns: Default::default(),
1699 compact_delay_min_ns: Default::default(),
1700 wpcopy_delay_count: Default::default(),
1701 wpcopy_delay_total_ns: Default::default(),
1702 wpcopy_delay_max_ns: Default::default(),
1703 wpcopy_delay_min_ns: Default::default(),
1704 irq_delay_count: Default::default(),
1705 irq_delay_total_ns: Default::default(),
1706 irq_delay_max_ns: Default::default(),
1707 irq_delay_min_ns: Default::default(),
1708 hiwater_rss_bytes: Default::default(),
1709 hiwater_vm_bytes: Default::default(),
1710 // Absent until a successful capture sets them (apply_delay_stats /
1711 // the jemalloc probe assignment); a default ThreadState is the
1712 // not-yet-captured placeholder.
1713 taskstats_measured: false,
1714 cpu_delay_active: false,
1715 delay_block_active: false,
1716 xacct_active: false,
1717 jemalloc_measured: false,
1718 }
1719 }
1720}
1721
1722impl ThreadState {
1723 /// Overwrite the taskstats-sourced delay-accounting fields
1724 /// from a `DelayStats` payload. Called by `capture_with` /
1725 /// `capture_pid_with` after a successful per-tid
1726 /// [`crate::taskstats::TaskstatsClient::query_tid`] call;
1727 /// query failures leave the fields at the absent-counter
1728 /// default of zero installed in `capture_thread_at_with_tally`.
1729 pub(crate) fn apply_delay_stats(
1730 &mut self,
1731 ds: &crate::taskstats::DelayStats,
1732 active: crate::host_context::TaskstatsActive,
1733 ) {
1734 use crate::metric_types::{MonotonicCount, MonotonicNs, PeakBytes, PeakNs};
1735 self.cpu_delay_count = MonotonicCount(ds.cpu_count);
1736 self.cpu_delay_total_ns = MonotonicNs(ds.cpu_delay_total_ns);
1737 self.cpu_delay_max_ns = PeakNs(ds.cpu_delay_max_ns);
1738 self.cpu_delay_min_ns = PeakNs(ds.cpu_delay_min_ns);
1739 self.blkio_delay_count = MonotonicCount(ds.blkio_count);
1740 self.blkio_delay_total_ns = MonotonicNs(ds.blkio_delay_total_ns);
1741 self.blkio_delay_max_ns = PeakNs(ds.blkio_delay_max_ns);
1742 self.blkio_delay_min_ns = PeakNs(ds.blkio_delay_min_ns);
1743 self.swapin_delay_count = MonotonicCount(ds.swapin_count);
1744 self.swapin_delay_total_ns = MonotonicNs(ds.swapin_delay_total_ns);
1745 self.swapin_delay_max_ns = PeakNs(ds.swapin_delay_max_ns);
1746 self.swapin_delay_min_ns = PeakNs(ds.swapin_delay_min_ns);
1747 self.freepages_delay_count = MonotonicCount(ds.freepages_count);
1748 self.freepages_delay_total_ns = MonotonicNs(ds.freepages_delay_total_ns);
1749 self.freepages_delay_max_ns = PeakNs(ds.freepages_delay_max_ns);
1750 self.freepages_delay_min_ns = PeakNs(ds.freepages_delay_min_ns);
1751 self.thrashing_delay_count = MonotonicCount(ds.thrashing_count);
1752 self.thrashing_delay_total_ns = MonotonicNs(ds.thrashing_delay_total_ns);
1753 self.thrashing_delay_max_ns = PeakNs(ds.thrashing_delay_max_ns);
1754 self.thrashing_delay_min_ns = PeakNs(ds.thrashing_delay_min_ns);
1755 self.compact_delay_count = MonotonicCount(ds.compact_count);
1756 self.compact_delay_total_ns = MonotonicNs(ds.compact_delay_total_ns);
1757 self.compact_delay_max_ns = PeakNs(ds.compact_delay_max_ns);
1758 self.compact_delay_min_ns = PeakNs(ds.compact_delay_min_ns);
1759 self.wpcopy_delay_count = MonotonicCount(ds.wpcopy_count);
1760 self.wpcopy_delay_total_ns = MonotonicNs(ds.wpcopy_delay_total_ns);
1761 self.wpcopy_delay_max_ns = PeakNs(ds.wpcopy_delay_max_ns);
1762 self.wpcopy_delay_min_ns = PeakNs(ds.wpcopy_delay_min_ns);
1763 self.irq_delay_count = MonotonicCount(ds.irq_count);
1764 self.irq_delay_total_ns = MonotonicNs(ds.irq_delay_total_ns);
1765 self.irq_delay_max_ns = PeakNs(ds.irq_delay_max_ns);
1766 self.irq_delay_min_ns = PeakNs(ds.irq_delay_min_ns);
1767 self.hiwater_rss_bytes = PeakBytes(ds.hiwater_rss_bytes);
1768 self.hiwater_vm_bytes = PeakBytes(ds.hiwater_vm_bytes);
1769 // The only site that overwrites the absent-counter zero defaults from a
1770 // real taskstats payload: mark the payload measured (query Ok) so the
1771 // group aggregation (`ctprof_compare::groups::measured_predicate`) can
1772 // distinguish a measured zero from a never-captured payload
1773 // (CONFIG_TASKSTATS off / no CAP_NET_ADMIN / query raced exit), which
1774 // otherwise both read as 0. The per-sub-family enablement (host-global,
1775 // probed once per snapshot) is baked in alongside; the predicate ANDs the
1776 // query-Ok flag with the matching sub-family flag, so a sub-family
1777 // disabled while the query succeeds renders "-" not "0" (see the field
1778 // docs).
1779 self.taskstats_measured = true;
1780 self.cpu_delay_active = active.cpu_delay;
1781 self.delay_block_active = active.delay_block;
1782 self.xacct_active = active.xacct;
1783 }
1784
1785 /// Iterate over [`Self::smaps_rollup_kib`] with values
1786 /// converted from kilobytes to bytes via `saturating_mul(1024)`.
1787 /// The kernel emits smaps_rollup values in kB; the
1788 /// project's display layer auto-scales bytes via the
1789 /// existing "B" → KiB → MiB → GiB ladder, so a single
1790 /// helper centralizes the unit conversion at every render
1791 /// site (write_show + write_diff). Saturating multiply
1792 /// guards against pathological input from a malformed
1793 /// snapshot file. Wrapped in
1794 /// [`crate::metric_types::Bytes`] so the byte-typed value
1795 /// flows through the same auto-scale path as the rest of
1796 /// the byte-tagged registry metrics.
1797 pub fn smaps_rollup_bytes(
1798 &self,
1799 ) -> impl Iterator<Item = (&String, crate::metric_types::Bytes)> {
1800 self.smaps_rollup_kib
1801 .iter()
1802 .map(|(k, v)| (k, crate::metric_types::Bytes(v.saturating_mul(1024))))
1803 }
1804}
1805
1806/// Per-cgroup enrichment record attached to [`CtprofSnapshot`].
1807///
1808/// Populated from the cgroup v2 filesystem at capture time. The
1809/// shape mirrors the kernel's per-controller file layout:
1810/// [`CgroupCpuStats`] holds the `cpu.*` files,
1811/// [`CgroupMemoryStats`] holds the `memory.*` files,
1812/// [`CgroupPidsStats`] holds the `pids.*` files, and [`Psi`]
1813/// holds the `<resource>.pressure` files. These are
1814/// aggregate-over-the-cgroup values — NOT summable from
1815/// per-thread data — so the capture layer reads them directly
1816/// from cgroupfs rather than deriving.
1817///
1818/// Nested-struct shape (rather than a flat ~50-field struct)
1819/// mirrors the kernel's controller-by-controller exposure: a
1820/// reader who knows the kernel layout can map directly between
1821/// cgroupfs files and Rust fields, and the merge policy in
1822/// [`crate::ctprof_compare::flatten_cgroup_stats`] applies
1823/// per-domain (max for limits, min for floors, saturating_add
1824/// for counters) without conflating across domains.
1825///
1826/// Schema note: the previous flat shape (4 fields:
1827/// `cpu_usage_usec`, `nr_throttled`, `throttled_usec`,
1828/// `memory_current`) is gone. Snapshots written by older
1829/// versions deserialize via serde's defaulting — old fields
1830/// land on the new nested fields' zero defaults rather than
1831/// migrating, so a baseline-vs-candidate compare against an
1832/// old snapshot produces "every counter went from N to 0".
1833/// Re-capture both sides with the current build to compare
1834/// faithfully. Per the project's pre-1.0 disposable-sidecar
1835/// policy this is intentional.
1836#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1837#[non_exhaustive]
1838pub struct CgroupStats {
1839 pub cpu: CgroupCpuStats,
1840 pub memory: CgroupMemoryStats,
1841 pub pids: CgroupPidsStats,
1842 /// Pressure Stall Information for this cgroup, per resource.
1843 /// Populated from `<cgroup>/cpu.pressure`,
1844 /// `<cgroup>/memory.pressure`, `<cgroup>/io.pressure`, and
1845 /// `<cgroup>/irq.pressure` (cgroup v2 files declared in
1846 /// `cgroup_psi_files[]` (`kernel/cgroup/cgroup.c`)). Defaults to all-zero
1847 /// when the kernel has CONFIG_PSI off, when PSI is disabled
1848 /// at runtime via the `psi=0` boot param, or when individual
1849 /// resource files are absent (older kernels missing
1850 /// irq.pressure).
1851 pub psi: Psi,
1852}
1853
1854/// CPU controller state for one cgroup. Fields mirror the
1855/// `cpu.*` cgroup v2 files exposed under
1856/// `<cgroup>/cpu.stat`, `<cgroup>/cpu.max`,
1857/// `<cgroup>/cpu.weight`, and `<cgroup>/cpu.weight.nice`.
1858#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1859#[non_exhaustive]
1860pub struct CgroupCpuStats {
1861 /// `usage_usec` from `cpu.stat`. Cumulative CPU time consumed
1862 /// by tasks in this cgroup, in microseconds.
1863 pub usage_usec: u64,
1864 /// `nr_throttled` from `cpu.stat`. Cumulative count of
1865 /// CFS-bandwidth throttling events that paused this cgroup.
1866 pub nr_throttled: u64,
1867 /// `throttled_usec` from `cpu.stat`. Cumulative wall-clock
1868 /// time the cgroup spent throttled by CFS bandwidth.
1869 pub throttled_usec: u64,
1870 /// `cpu.max` quota in microseconds. `None` when the file is
1871 /// absent (root cgroup) OR when the kernel emits the literal
1872 /// "max" token (no CFS bandwidth cap configured for this
1873 /// cgroup).
1874 pub max_quota_us: Option<u64>,
1875 /// `cpu.max` period in microseconds. Default 100_000 (100ms)
1876 /// per the kernel default. Always present alongside the
1877 /// quota half on a child cgroup; defaults to 100_000 when
1878 /// the file is absent (root cgroup).
1879 pub max_period_us: u64,
1880 /// `cpu.weight` (1..=10_000, default 100). `None` when the
1881 /// file is absent (root cgroup); the kernel does not allow
1882 /// 0 as a value, so the absent-vs-zero distinction is
1883 /// load-bearing.
1884 pub weight: Option<u64>,
1885 /// `cpu.weight.nice` (-20..=19, default 0). `None` when the
1886 /// file is absent. Alias-domain for [`Self::weight`] —
1887 /// the kernel writes both files in lockstep but they're
1888 /// captured independently to surface any
1889 /// kernel-version-specific divergence.
1890 pub weight_nice: Option<i32>,
1891}
1892
1893/// Memory controller state for one cgroup. Fields mirror the
1894/// `memory.*` cgroup v2 files. `stat` and `events` are
1895/// captured as flat key-value maps so the data model
1896/// auto-extends when the kernel adds new keys (memory.stat
1897/// has 71 keys on a recent kernel; the explicit list is
1898/// scheduler-correctness-relevant but the map preserves
1899/// regression-detection on lesser-known counters).
1900#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1901#[non_exhaustive]
1902pub struct CgroupMemoryStats {
1903 /// `memory.current`, instantaneous RSS of the cgroup in
1904 /// bytes.
1905 pub current: u64,
1906 /// `memory.max`, hard memory limit in bytes. `None` when
1907 /// the file is absent (root cgroup) OR when the kernel
1908 /// emits the literal "max" token (no hard cap).
1909 pub max: Option<u64>,
1910 /// `memory.high`, soft pressure limit in bytes. `None` when
1911 /// absent or unlimited (same "max"-token semantics as
1912 /// [`Self::max`]).
1913 pub high: Option<u64>,
1914 /// `memory.low`, best-effort protection floor in bytes.
1915 /// `None` when the file is absent (no protection
1916 /// configured); `Some(u64::MAX)` when the kernel emits the
1917 /// literal `max` token (request maximum protection — every
1918 /// byte under the cgroup is protected). Per the kernel's
1919 /// cgroup v2 docs, memory under `low` is protected from
1920 /// reclaim unless no unprotected memory remains. Note the
1921 /// asymmetry vs. limits: `None` means "no floor" (semantic
1922 /// opposite of "max"-as-no-cap on the limit fields above).
1923 pub low: Option<u64>,
1924 /// `memory.min`, hard protection floor in bytes. `None`
1925 /// when absent (no floor). `Some(u64::MAX)` when the kernel
1926 /// emits `max` (full protection). Stronger than `low` —
1927 /// memory under `min` is never reclaimed even under
1928 /// memory pressure.
1929 pub min: Option<u64>,
1930 /// `memory.stat` parsed as a key-value map. Keys mirror the
1931 /// kernel-emitted strings (e.g. `anon`, `file`,
1932 /// `workingset_refault_anon`, `pgfault`, `pgmajfault`,
1933 /// `slab`, the active/inactive variants, etc.). Empty when
1934 /// the file is absent.
1935 pub stat: BTreeMap<String, u64>,
1936 /// `memory.events` parsed as a key-value map. Typical keys:
1937 /// `low`, `high`, `max`, `oom`, `oom_kill`,
1938 /// `oom_group_kill`, `sock_throttled` (subset varies by
1939 /// kernel version). Empty when the file is absent.
1940 pub events: BTreeMap<String, u64>,
1941}
1942
1943/// PIDs controller state for one cgroup. Fields mirror the
1944/// `pids.*` cgroup v2 files. The pids controller is optional
1945/// (must be enabled in `cgroup.subtree_control`); on hosts that
1946/// don't enable it, both fields are `None`.
1947#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1948#[non_exhaustive]
1949pub struct CgroupPidsStats {
1950 /// `pids.current`, current task count in this cgroup.
1951 /// `None` when the file is absent (pids controller not
1952 /// enabled).
1953 pub current: Option<u64>,
1954 /// `pids.max`, hard task-count limit. `None` when the file
1955 /// is absent OR when the kernel emits the literal "max"
1956 /// token (no cap).
1957 pub max: Option<u64>,
1958}
1959
1960/// One Pressure Stall Information half-line: either the `some`
1961/// or `full` row for one resource. Mirrors the kernel emission
1962/// format `%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu`
1963/// in `psi_show()` (`kernel/sched/psi.c`).
1964///
1965/// `avg10/60/300` are stored as **centi-percent** (lossless
1966/// fixed-point) — the kernel writes `LOAD_INT(avg).LOAD_FRAC(avg)`
1967/// as a 2-decimal-digit percentage in `psi_show()`. The integer
1968/// expansion is `int * 100 + frac`, giving a numerical range of
1969/// `0..=10099`. The upper bound is `100.99` (not `100.00`)
1970/// because the kernel's EWMA helper `calc_load()`
1971/// (`include/linux/sched/loadavg.h`) rounds via `newload +=
1972/// FIXED_1 - 1` before the final `>> FSHIFT`, so a fully-loaded
1973/// group can land just over `100.0` for one sample. This avoids
1974/// serde JSON float-roundtrip drift that would manifest as
1975/// spurious non-zero deltas in compare output.
1976///
1977/// `total_usec` is microseconds (kernel
1978/// `div_u64(total_ns, NSEC_PER_USEC)` in `psi_show()`). Same unit
1979/// as [`CgroupCpuStats::usage_usec`], so the existing
1980/// auto_scale "µs" ladder applies.
1981///
1982/// "some" semantics: at least one task is stalled on this
1983/// resource. "full" semantics: every runnable task is stalled.
1984/// At the SYSTEM level (`/proc/pressure/cpu`), `cpu.full` is
1985/// always zero by kernel design — the explicit gate
1986/// `if (!(group == &psi_system && res == PSI_CPU && full))` in
1987/// `psi_show()` (`kernel/sched/psi.c`) skips the avg/total
1988/// computation, but the `seq_printf` in `psi_show()` still emits
1989/// the structurally-present line. Per-cgroup `cpu.full` (under
1990/// `<cgroup>/cpu.pressure`) IS meaningful and computed
1991/// normally. `irq` is full-only (kernel `only_full = res == PSI_IRQ`
1992/// in `psi_show()`), so [`PsiResource::some`] for irq always reads
1993/// zero.
1994#[derive(Debug, Clone, Copy, Default, serde::Serialize, serde::Deserialize)]
1995#[non_exhaustive]
1996pub struct PsiHalf {
1997 /// 10-second running average of pressure %, scaled by 100
1998 /// (so 0..=10099 covers 0.00..=100.99 — see the EWMA-rounding
1999 /// note on the struct doc).
2000 pub avg10: u16,
2001 /// 60-second running average of pressure %, same scaling.
2002 pub avg60: u16,
2003 /// 300-second running average of pressure %, same scaling.
2004 pub avg300: u16,
2005 /// Cumulative total stalled time in microseconds.
2006 pub total_usec: u64,
2007}
2008
2009impl PsiHalf {
2010 /// Convert the centi-percent `avg10` value to a percentage
2011 /// `f64`. Returns `0.0..=100.99` per the kernel's EWMA
2012 /// rounding (see struct-level doc).
2013 pub fn avg10_percent(&self) -> f64 {
2014 self.avg10 as f64 / 100.0
2015 }
2016
2017 /// Convert the centi-percent `avg60` value to a percentage
2018 /// `f64`. Same range as [`Self::avg10_percent`].
2019 pub fn avg60_percent(&self) -> f64 {
2020 self.avg60 as f64 / 100.0
2021 }
2022
2023 /// Convert the centi-percent `avg300` value to a percentage
2024 /// `f64`. Same range as [`Self::avg10_percent`].
2025 pub fn avg300_percent(&self) -> f64 {
2026 self.avg300 as f64 / 100.0
2027 }
2028}
2029
2030/// Pressure Stall Information for one resource (cpu / memory /
2031/// io / irq), bundling the `some` and `full` halves.
2032#[derive(Debug, Clone, Copy, Default, serde::Serialize, serde::Deserialize)]
2033#[non_exhaustive]
2034pub struct PsiResource {
2035 pub some: PsiHalf,
2036 pub full: PsiHalf,
2037}
2038
2039/// Bundle of [`PsiResource`] for the four kernel-exposed
2040/// resources. Same shape used at both system level
2041/// ([`CtprofSnapshot::psi`]) and per-cgroup
2042/// ([`CgroupStats::psi`]) — the data source differs but the
2043/// kernel emits the same format and field set in both places.
2044#[derive(Debug, Clone, Copy, Default, serde::Serialize, serde::Deserialize)]
2045#[non_exhaustive]
2046pub struct Psi {
2047 pub cpu: PsiResource,
2048 pub memory: PsiResource,
2049 pub io: PsiResource,
2050 /// IRQ pressure. Only the `full` half is populated by the
2051 /// kernel (`psi_show()` sets `only_full = res == PSI_IRQ`);
2052 /// `irq.some` is structurally present but always zero.
2053 /// Requires both `CONFIG_IRQ_TIME_ACCOUNTING` at build AND
2054 /// `irqtime_enabled()` at runtime (`/proc/pressure/irq` returns
2055 /// `-EOPNOTSUPP` per `psi_show()` (`kernel/sched/psi.c`) otherwise);
2056 /// runtime irqtime is gated by the `tsc=...` boot param /
2057 /// `irqtime_enabled` static branch — when off, the file open
2058 /// fails and the parser leaves this resource at the default
2059 /// all-zero value.
2060 pub irq: PsiResource,
2061}
2062
2063/// Global sched_ext sysfs state, captured from
2064/// `/sys/kernel/sched_ext/`. The kernel registers exactly five
2065/// global attributes via `scx_global_attrs[]`
2066/// (`kernel/sched/ext.c`); this struct mirrors them
2067/// 1-to-1.
2068///
2069/// Per-scheduler attrs (`/sys/kernel/sched_ext/root/...`) are
2070/// out of scope: those are scheduler-specific internals
2071/// (queued/dispatched/ops-name) that come and go as schedulers
2072/// load and unload, and answer different questions than the
2073/// global counters here.
2074#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
2075#[non_exhaustive]
2076pub struct SchedExtSysfs {
2077 /// `state` — sched_ext class enable state. One of
2078 /// `enabling`, `enabled`, `disabling`, `disabled` per
2079 /// `scx_enable_state_str[]`
2080 /// (`kernel/sched/ext_internal.h`). Emitted by
2081 /// `scx_attr_state_show()`
2082 /// (`kernel/sched/ext.c`). Defaults to empty string
2083 /// when the file is unreadable; `disabled` when no scx
2084 /// scheduler is currently loaded. The "is sched_ext active
2085 /// during this capture?" answer.
2086 pub state: String,
2087
2088 /// `switch_all` — boolean (rendered as 0/1) indicating
2089 /// whether ALL scheduling classes have been switched to
2090 /// scx (vs. only those tasks the BPF scheduler claims via
2091 /// the per-task selection path). Emitted by
2092 /// `scx_attr_switch_all_show()`
2093 /// (`kernel/sched/ext.c`) via
2094 /// `READ_ONCE(scx_switching_all)`.
2095 pub switch_all: u64,
2096
2097 /// `nr_rejected` — count of tasks rejected from
2098 /// SCHED_EXT during init when `ops.init_task()` set
2099 /// `p->disallow`. Increment in `__scx_init_task()`
2100 /// (`kernel/sched/ext.c`): when a task entering
2101 /// SCHED_EXT has its policy reverted to SCHED_NORMAL
2102 /// because the BPF scheduler asked the kernel to disallow
2103 /// it, `atomic_long_inc(&scx_nr_rejected)` fires.
2104 /// `atomic_long_read(&scx_nr_rejected)` is emitted by
2105 /// `scx_attr_nr_rejected_show()`
2106 /// (`kernel/sched/ext.c`).
2107 ///
2108 /// Resets to 0 on every scheduler load: `scx_root_enable_workfn()`
2109 /// (`kernel/sched/ext.c`) does
2110 /// `atomic_long_set(&scx_nr_rejected, 0)` before bringing
2111 /// the new scheduler online. To detect a reload-driven
2112 /// reset rather than a genuine cumulative drop, pair the
2113 /// nr_rejected delta with [`Self::enable_seq`] — any
2114 /// enable_seq movement across two snapshots invalidates
2115 /// nr_rejected as a monotonic counter.
2116 ///
2117 /// Does NOT count runtime dispatch errors. The "did the
2118 /// scheduler reject a dispatch operation at runtime?"
2119 /// question is answered by per-scheduler debug data
2120 /// (`/sys/kernel/sched_ext/root/...`), out of scope for
2121 /// this global-attrs struct.
2122 pub nr_rejected: u64,
2123
2124 /// `hotplug_seq` — per-CPU-hotplug-event sequence counter.
2125 /// Atomic long incremented every time the kernel observes a
2126 /// hotplug transition. Emitted by
2127 /// `scx_attr_hotplug_seq_show()`
2128 /// (`kernel/sched/ext.c`). Comparing two snapshots:
2129 /// any delta indicates that a CPU online/offline event
2130 /// happened during the interval, which can confound
2131 /// per-CPU statistics.
2132 pub hotplug_seq: u64,
2133
2134 /// `enable_seq` — per-scheduler-load sequence counter.
2135 /// Atomic long incremented in `scx_root_enable_workfn()`
2136 /// (`kernel/sched/ext.c`, `atomic_long_inc(&scx_enable_seq)`)
2137 /// each time a scx scheduler is enabled. Comparing two
2138 /// snapshots: any delta indicates a scheduler reload
2139 /// happened during the interval — counter resets on the
2140 /// scx side will surface here even if the per-thread data
2141 /// looks continuous.
2142 pub enable_seq: u64,
2143}
2144
2145// Parsers + tallying readers live in `parse.rs` to keep this
2146// production file under the per-file line budget. The `use
2147// parse::*;` glob keeps existing call sites unchanged.
2148mod parse;
2149use parse::*;
2150
2151/// Capture one thread's procfs-derived profile under an arbitrary
2152/// procfs root. Each procfs reader returns `Option`; the assembled
2153/// [`ThreadState`] coerces `None` to the field's default per the
2154/// module-level capture contract. The jemalloc per-thread TSD
2155/// counters (`allocated_bytes` / `deallocated_bytes`) are NOT
2156/// populated by this function — they require a tgid-scoped probe
2157/// attach that the caller owns ([`capture_with`] /
2158/// [`capture_pid_with`] do this and write the counters directly
2159/// onto the returned `ThreadState`). On the returned struct, both
2160/// fields therefore land at the absent-counter default of zero
2161/// unless the caller overwrites them.
2162///
2163/// `comm` is the thread name the caller has already read from
2164/// `<proc_root>/<tgid>/task/<tid>/comm` (typically via
2165/// [`read_thread_comm_at`]). Passing it in — symmetric with the
2166/// pre-existing `pcomm` parameter — lets the caller share one
2167/// procfs read with the per-tid probe-recording path
2168/// (`probe_thread_recording`), which needs the thread name for
2169/// tracing on probe failures: a hot loop that re-reads the file
2170/// inside this fn would double the comm syscalls per tid on hosts
2171/// with thousands of threads.
2172///
2173/// Pass empty string for the absent-comm default; the ghost
2174/// filter in [`capture_with`] / [`capture_pid_with`] keys on
2175/// `ThreadState::comm.is_empty()` to drop a tid that exited
2176/// between `iter_task_ids_at` and this call, so an empty `comm`
2177/// is the correct shape for that path.
2178///
2179/// `use_syscall_affinity` gates the `sched_getaffinity(2)` path —
2180/// tests staging a synthetic `/proc` pass `false` so the syscall
2181/// does not read the REAL affinity of the test process; production
2182/// passes `true` and falls back to `Cpus_allowed_list:` when the
2183/// syscall returns EPERM.
2184#[cfg(test)]
2185fn capture_thread_at(
2186 proc_root: &Path,
2187 tgid: i32,
2188 tid: i32,
2189 pcomm: &str,
2190 comm: &str,
2191 use_syscall_affinity: bool,
2192) -> ThreadState {
2193 capture_thread_at_with_tally(
2194 proc_root,
2195 tgid,
2196 tid,
2197 pcomm,
2198 comm,
2199 use_syscall_affinity,
2200 &mut None,
2201 )
2202}
2203
2204/// Per-tid procfs walk. Threads a `&mut ParseTally` through every
2205/// per-file reader so per-tid read failures land in the
2206/// per-snapshot [`CtprofParseSummary`] when the capture
2207/// pipeline runs in production mode (`use_syscall_affinity=true`).
2208/// Synthetic-tree tests typically pass `&mut None` for the tally,
2209/// matching the pre-tally shape.
2210fn capture_thread_at_with_tally(
2211 proc_root: &Path,
2212 tgid: i32,
2213 tid: i32,
2214 pcomm: &str,
2215 comm: &str,
2216 use_syscall_affinity: bool,
2217 tally: &mut Option<&mut ParseTally>,
2218) -> ThreadState {
2219 let cgroup = read_cgroup_at_with_tally(proc_root, tgid, tid, tally).unwrap_or_default();
2220 let stat = read_stat_at_with_tally(proc_root, tgid, tid, tally);
2221 let (run_time_ns, wait_time_ns, timeslices) =
2222 read_schedstat_at_with_tally(proc_root, tgid, tid, tally);
2223 let io = read_io_at_with_tally(proc_root, tgid, tid, tally);
2224 let status = read_status_at_with_tally(proc_root, tgid, tid, tally);
2225 let sched = read_sched_at_with_tally(proc_root, tgid, tid, tally);
2226 let smaps_rollup_kib = read_smaps_rollup_at_with_tally(proc_root, tgid, tid, tally);
2227 let cpu_affinity = if use_syscall_affinity {
2228 crate::cpu_util::read_affinity(tid)
2229 .or(status.cpus_allowed)
2230 .unwrap_or_default()
2231 } else {
2232 status.cpus_allowed.unwrap_or_default()
2233 };
2234 use crate::metric_types::{
2235 Bytes, CategoricalString, ClockTicks, CpuSet, GaugeCount, GaugeNs, MonotonicCount,
2236 MonotonicNs, OrdinalI32, OrdinalU32, PeakNs,
2237 };
2238 ThreadState {
2239 tid: tid as u32,
2240 tgid: tgid as u32,
2241 pcomm: pcomm.to_string(),
2242 comm: comm.to_string(),
2243 cgroup,
2244 start_time_clock_ticks: stat.start_time_clock_ticks.unwrap_or(0),
2245 policy: CategoricalString(stat.policy.map(policy_name).unwrap_or_default()),
2246 nice: OrdinalI32(stat.nice.unwrap_or(0)),
2247 cpu_affinity: CpuSet(cpu_affinity),
2248 processor: OrdinalI32(stat.processor.unwrap_or(0)),
2249 state: status.state.unwrap_or_else(default_state_char),
2250 ext_enabled: sched.ext_enabled.unwrap_or(false),
2251 run_time_ns: MonotonicNs(run_time_ns.unwrap_or(0)),
2252 wait_time_ns: MonotonicNs(wait_time_ns.unwrap_or(0)),
2253 timeslices: MonotonicCount(timeslices.unwrap_or(0)),
2254 voluntary_csw: MonotonicCount(status.voluntary_csw.unwrap_or(0)),
2255 nonvoluntary_csw: MonotonicCount(status.nonvoluntary_csw.unwrap_or(0)),
2256 nr_wakeups: MonotonicCount(sched.nr_wakeups.unwrap_or(0)),
2257 nr_wakeups_local: MonotonicCount(sched.nr_wakeups_local.unwrap_or(0)),
2258 nr_wakeups_remote: MonotonicCount(sched.nr_wakeups_remote.unwrap_or(0)),
2259 nr_wakeups_sync: MonotonicCount(sched.nr_wakeups_sync.unwrap_or(0)),
2260 nr_wakeups_migrate: MonotonicCount(sched.nr_wakeups_migrate.unwrap_or(0)),
2261 nr_wakeups_affine: MonotonicCount(sched.nr_wakeups_affine.unwrap_or(0)),
2262 nr_wakeups_affine_attempts: MonotonicCount(sched.nr_wakeups_affine_attempts.unwrap_or(0)),
2263 nr_migrations: MonotonicCount(sched.nr_migrations.unwrap_or(0)),
2264 nr_forced_migrations: MonotonicCount(sched.nr_forced_migrations.unwrap_or(0)),
2265 nr_failed_migrations_affine: MonotonicCount(sched.nr_failed_migrations_affine.unwrap_or(0)),
2266 nr_failed_migrations_running: MonotonicCount(
2267 sched.nr_failed_migrations_running.unwrap_or(0),
2268 ),
2269 nr_failed_migrations_hot: MonotonicCount(sched.nr_failed_migrations_hot.unwrap_or(0)),
2270 wait_sum: MonotonicNs(sched.wait_sum.unwrap_or(0)),
2271 wait_count: MonotonicCount(sched.wait_count.unwrap_or(0)),
2272 wait_max: PeakNs(sched.wait_max.unwrap_or(0)),
2273 // Capture-time normalization: kernel's `sum_sleep_runtime`
2274 // counts BOTH voluntary sleep AND involuntary block (see
2275 // `__update_stats_enqueue_sleeper` at kernel/sched/stats.c).
2276 // Subtracting the block component leaves pure voluntary
2277 // sleep — the operationally useful signal — and avoids the
2278 // need for a derived metric at compare time.
2279 //
2280 // The subtraction is only meaningful when BOTH halves
2281 // parsed successfully. If `sum_block_runtime` is missing,
2282 // an `unwrap_or(0)` fallback would yield
2283 // `sum_sleep_runtime - 0 = full_sleep_total`, mislabelling
2284 // the involuntary-block component as voluntary sleep and
2285 // breaking the field-doc contract ("voluntary only"). If
2286 // `sum_sleep_runtime` is missing, the fallback would yield
2287 // `0 - block`, which `saturating_sub` collapses to 0 but
2288 // also discards any real voluntary signal that might have
2289 // been recorded if the kernel had emitted both. Either
2290 // half-missing case means the value is uncomputable, so
2291 // it falls through to 0 — matching the "absent data → 0"
2292 // convention used by every sibling field at this site
2293 // (e.g. `wait_sum`, `sleep_max`, `block_sum`) and
2294 // co-locating with the existing `block_sum: 0` that the
2295 // same parse miss already produces below.
2296 //
2297 // `saturating_sub` remains in the both-Some path as
2298 // defense against the kernel-ordering edge case:
2299 // `__update_stats_enqueue_sleeper` adds to
2300 // `sum_sleep_runtime` BEFORE adding the same delta to
2301 // `sum_block_runtime`, so a sample read between those
2302 // writes can transiently yield `block > sleep` even
2303 // though every in-tree path eventually settles to
2304 // `block <= sleep`.
2305 voluntary_sleep_ns: MonotonicNs(match (sched.sleep_sum, sched.block_sum) {
2306 (Some(sleep), Some(block)) => sleep.saturating_sub(block),
2307 _ => 0,
2308 }),
2309 sleep_max: PeakNs(sched.sleep_max.unwrap_or(0)),
2310 block_sum: MonotonicNs(sched.block_sum.unwrap_or(0)),
2311 block_max: PeakNs(sched.block_max.unwrap_or(0)),
2312 iowait_sum: MonotonicNs(sched.iowait_sum.unwrap_or(0)),
2313 iowait_count: MonotonicCount(sched.iowait_count.unwrap_or(0)),
2314 exec_max: PeakNs(sched.exec_max.unwrap_or(0)),
2315 slice_max: PeakNs(sched.slice_max.unwrap_or(0)),
2316 allocated_bytes: Bytes(0),
2317 deallocated_bytes: Bytes(0),
2318 minflt: MonotonicCount(stat.minflt.unwrap_or(0)),
2319 majflt: MonotonicCount(stat.majflt.unwrap_or(0)),
2320 utime_clock_ticks: ClockTicks(stat.utime_clock_ticks.unwrap_or(0)),
2321 stime_clock_ticks: ClockTicks(stat.stime_clock_ticks.unwrap_or(0)),
2322 priority: OrdinalI32(stat.priority.unwrap_or(0)),
2323 rt_priority: OrdinalU32(stat.rt_priority.unwrap_or(0)),
2324 core_forceidle_sum: MonotonicNs(sched.core_forceidle_sum.unwrap_or(0)),
2325 fair_slice_ns: GaugeNs(sched.fair_slice_ns.unwrap_or(0)),
2326 // Dedup `nr_threads` to only the thread leader. Every
2327 // thread of the same tgid sees the same kernel-emitted
2328 // value; populating it on every thread would let any
2329 // Sum-style aggregator multiply the count by itself
2330 // across the group. Leader-only population means the
2331 // registry's `AggRule::MaxGaugeCount` surfaces the
2332 // largest process represented in the bucket — reading
2333 // "the biggest process in this group" rather than "how
2334 // many threads the kernel believes this group contains"
2335 // (which is already covered by the row count).
2336 nr_threads: GaugeCount(if tid == tgid {
2337 status.nr_threads.unwrap_or(0)
2338 } else {
2339 0
2340 }),
2341 smaps_rollup_kib,
2342 rchar: Bytes(io.rchar.unwrap_or(0)),
2343 wchar: Bytes(io.wchar.unwrap_or(0)),
2344 syscr: MonotonicCount(io.syscr.unwrap_or(0)),
2345 syscw: MonotonicCount(io.syscw.unwrap_or(0)),
2346 read_bytes: Bytes(io.read_bytes.unwrap_or(0)),
2347 write_bytes: Bytes(io.write_bytes.unwrap_or(0)),
2348 cancelled_write_bytes: Bytes(io.cancelled_write_bytes.unwrap_or(0)),
2349 // Taskstats fields land here at zero defaults; the caller
2350 // (`capture_with` / `capture_pid_with`) overwrites them
2351 // after the per-tid `TaskstatsClient::query_tid` call,
2352 // mirroring how `allocated_bytes` / `deallocated_bytes`
2353 // are placeholdered above and then filled by the jemalloc
2354 // probe path. Zero defaults are correct for the
2355 // best-effort contract: a kernel without `CONFIG_TASKSTATS`
2356 // / `CONFIG_TASK_DELAY_ACCT`, a host with `delayacct=off`
2357 // at runtime, a process without `CAP_NET_ADMIN`, or a tid
2358 // that exited before `query_tid` succeeded all collapse
2359 // to zero per-field.
2360 cpu_delay_count: MonotonicCount(0),
2361 cpu_delay_total_ns: MonotonicNs(0),
2362 cpu_delay_max_ns: PeakNs(0),
2363 cpu_delay_min_ns: PeakNs(0),
2364 blkio_delay_count: MonotonicCount(0),
2365 blkio_delay_total_ns: MonotonicNs(0),
2366 blkio_delay_max_ns: PeakNs(0),
2367 blkio_delay_min_ns: PeakNs(0),
2368 swapin_delay_count: MonotonicCount(0),
2369 swapin_delay_total_ns: MonotonicNs(0),
2370 swapin_delay_max_ns: PeakNs(0),
2371 swapin_delay_min_ns: PeakNs(0),
2372 freepages_delay_count: MonotonicCount(0),
2373 freepages_delay_total_ns: MonotonicNs(0),
2374 freepages_delay_max_ns: PeakNs(0),
2375 freepages_delay_min_ns: PeakNs(0),
2376 thrashing_delay_count: MonotonicCount(0),
2377 thrashing_delay_total_ns: MonotonicNs(0),
2378 thrashing_delay_max_ns: PeakNs(0),
2379 thrashing_delay_min_ns: PeakNs(0),
2380 compact_delay_count: MonotonicCount(0),
2381 compact_delay_total_ns: MonotonicNs(0),
2382 compact_delay_max_ns: PeakNs(0),
2383 compact_delay_min_ns: PeakNs(0),
2384 wpcopy_delay_count: MonotonicCount(0),
2385 wpcopy_delay_total_ns: MonotonicNs(0),
2386 wpcopy_delay_max_ns: PeakNs(0),
2387 wpcopy_delay_min_ns: PeakNs(0),
2388 irq_delay_count: MonotonicCount(0),
2389 irq_delay_total_ns: MonotonicNs(0),
2390 irq_delay_max_ns: PeakNs(0),
2391 irq_delay_min_ns: PeakNs(0),
2392 hiwater_rss_bytes: crate::metric_types::PeakBytes(0),
2393 hiwater_vm_bytes: crate::metric_types::PeakBytes(0),
2394 // Not-yet-captured here: the delay family's zero defaults above are
2395 // overwritten (and this flag set true) by apply_delay_stats on a
2396 // successful taskstats query; jemalloc_measured is set true where the
2397 // probe assigns allocated_bytes/deallocated_bytes. Both stay false when
2398 // the respective capture did not run.
2399 taskstats_measured: false,
2400 cpu_delay_active: false,
2401 delay_block_active: false,
2402 xacct_active: false,
2403 jemalloc_measured: false,
2404 }
2405}
2406
2407#[cfg(test)]
2408fn capture_thread(tgid: i32, tid: i32, pcomm: &str) -> ThreadState {
2409 let proc_root = Path::new(DEFAULT_PROC_ROOT);
2410 let comm = read_thread_comm_at(proc_root, tgid, tid).unwrap_or_default();
2411 capture_thread_at(proc_root, tgid, tid, pcomm, &comm, true)
2412}
2413
2414/// Running tally for the per-snapshot jemalloc-probe summary line
2415/// emitted by [`capture_with`] and [`capture_pid_with`]. The
2416/// dominant `AttachError` tag and `ProbeError` tag are tracked so
2417/// the summary can surface a remediation hint when one error class
2418/// dominates (e.g. EPERM under YAMA).
2419#[derive(Debug, Default)]
2420struct ProbeSummary {
2421 tgids_walked: u64,
2422 jemalloc_detected: u64,
2423 probed_ok: u64,
2424 failed: u64,
2425 attach_tag_counts: BTreeMap<&'static str, u64>,
2426 probe_tag_counts: BTreeMap<&'static str, u64>,
2427}
2428
2429impl ProbeSummary {
2430 /// Pick the most frequent ACTIONABLE error tag (across attach
2431 /// and probe failures) for the summary line. Ties resolve to
2432 /// REVERSE-alphabetical order so the output is deterministic:
2433 /// the comparator's secondary key is `b.0.cmp(a.0)` (note the
2434 /// argument flip), so when two tags share a count, the
2435 /// alphabetically-EARLIER tag wins (e.g. `dwarf-parse-failure`
2436 /// beats `ptrace-seize`).
2437 ///
2438 /// `jemalloc-not-found` and `readlink-failure` are filtered out
2439 /// of the attach side: both are the expected outcome on the bulk
2440 /// of system processes (most tgids are not jemalloc-linked, and
2441 /// short-lived ones routinely fail readlink mid-walk), so
2442 /// surfacing them as the operator-facing "dominant failure tag"
2443 /// would drown the actionable signal (privilege drops, stripped
2444 /// debuginfo, arch mismatch) under known-benign noise on every
2445 /// snapshot. The filter is the same matches! arm
2446 /// `try_attach_probe_for_tgid_at` uses to route those two tags
2447 /// to debug-level tracing rather than warn-level — the
2448 /// dominant-tag summary mirrors the same actionable/non-actionable
2449 /// cut. Probe tags are not filtered: every `ProbeError` variant
2450 /// is actionable.
2451 fn dominant_tag(&self) -> Option<&'static str> {
2452 self.attach_tag_counts
2453 .iter()
2454 .filter(|(t, _)| !matches!(**t, "jemalloc-not-found" | "readlink-failure"))
2455 .chain(self.probe_tag_counts.iter())
2456 .max_by(|a, b| a.1.cmp(b.1).then_with(|| b.0.cmp(a.0)))
2457 .map(|(tag, _)| *tag)
2458 }
2459
2460 /// True when `ptrace-seize` (or `ptrace-interrupt`) failures
2461 /// dominate, signalling a privilege issue. Used to gate the
2462 /// EPERM remediation hint.
2463 fn ptrace_dominates(&self) -> bool {
2464 let total_ptrace: u64 = self
2465 .probe_tag_counts
2466 .iter()
2467 .filter(|(t, _)| matches!(**t, "ptrace-seize" | "ptrace-interrupt"))
2468 .map(|(_, n)| *n)
2469 .sum();
2470 // Half of failures or more attributable to ptrace
2471 // privilege (ptrace-seize or ptrace-interrupt) — high
2472 // enough that the hint is useful, low enough that a few
2473 // EPERMs in an otherwise-clean run don't drown the
2474 // summary.
2475 self.failed > 0 && total_ptrace * 2 >= self.failed
2476 }
2477
2478 /// Project the internal tally to the curated public surface.
2479 /// Drops the per-tag `attach_tag_counts` / `probe_tag_counts`
2480 /// maps (implementation detail) and surfaces only the
2481 /// counters + dominant tag string + privilege-dominant
2482 /// signal. Mirrors the actionable/non-actionable cut
2483 /// [`Self::dominant_tag`] uses, so `dominant_failure` is
2484 /// `None` exactly when the snapshot has zero actionable
2485 /// failures. `privilege_dominant` mirrors
2486 /// [`Self::ptrace_dominates`] so a downstream consumer can
2487 /// reproduce the EPERM-hint trigger condition without
2488 /// parsing the operator-facing tracing line.
2489 fn to_public(&self) -> CtprofProbeSummary {
2490 CtprofProbeSummary {
2491 tgids_walked: self.tgids_walked,
2492 jemalloc_detected: self.jemalloc_detected,
2493 probed_ok: self.probed_ok,
2494 failed: self.failed,
2495 dominant_failure: self.dominant_tag().map(|t| t.to_string()),
2496 privilege_dominant: self.ptrace_dominates(),
2497 }
2498 }
2499}
2500
2501/// Internal tally of procfs read-level failures, threaded through
2502/// [`capture_thread_at_with_tally`] and projected to the public
2503/// surface via [`Self::to_public`]. Mirrors the [`ProbeSummary`] /
2504/// [`CtprofProbeSummary`] split: tracks per-tid context plus a
2505/// per-file-kind failure map, then drops the implementation-detail
2506/// shape (here the `&'static str` keys vs the public surface's
2507/// `String` keys, which serde-derive cleanly).
2508///
2509/// `tids_walked` is incremented once per tid the capture pass
2510/// attempts, regardless of whether the tid lands in the snapshot —
2511/// the bump happens at the call site (before invoking
2512/// `capture_thread_at_with_tally`), so a ghost-filtered tid still
2513/// counts as walked. The per-tid `pending_failures` set lets the
2514/// caller unwind a ghost-filtered tid's read-failure contributions
2515/// before the summary is finalized — see [`Self::commit_pending`] /
2516/// [`Self::discard_pending`].
2517#[derive(Debug, Default)]
2518struct ParseTally {
2519 tids_walked: u64,
2520 failures_by_file: BTreeMap<&'static str, u64>,
2521 /// Per-tid pending bumps held until the caller commits or
2522 /// discards based on the ghost filter. Cleared between tids.
2523 pending_failures: Vec<&'static str>,
2524 /// Committed total of negative dotted-ns values seen across
2525 /// the snapshot. The kernel's PN_SCHEDSTAT path (`%Ld.%06ld`
2526 /// in `kernel/sched/debug.c`) emits a leading `-` when a
2527 /// schedstat field carries a negative integer part — rare but
2528 /// observable on clock-skew / suspend-resume hosts. The
2529 /// capture-side parser previously folded these into the
2530 /// absent-counter zero silently; this tally surfaces the
2531 /// rate so an operator can spot a host whose schedstat values
2532 /// are routinely negative-and-zeroed.
2533 negative_dotted_values: u64,
2534 /// Per-tid pending negative-dotted bumps held until
2535 /// commit / discard, mirroring [`Self::pending_failures`].
2536 pending_negative_dotted: u64,
2537}
2538
2539impl ParseTally {
2540 /// Record a per-file read failure for the current tid. Held
2541 /// pending until [`Self::commit_pending`] or
2542 /// [`Self::discard_pending`] resolves the tid's outcome.
2543 fn record_failure(&mut self, file_kind: &'static str) {
2544 self.pending_failures.push(file_kind);
2545 }
2546
2547 /// Record a negative dotted-ns value seen during sched parse
2548 /// for the current tid. Held pending until
2549 /// [`Self::commit_pending`] / [`Self::discard_pending`].
2550 fn record_negative_dotted(&mut self) {
2551 self.pending_negative_dotted = self.pending_negative_dotted.saturating_add(1);
2552 }
2553
2554 /// Commit the current tid's pending failures to the per-snapshot
2555 /// tally. Called when the tid lands in the snapshot.
2556 fn commit_pending(&mut self) {
2557 for kind in self.pending_failures.drain(..) {
2558 *self.failures_by_file.entry(kind).or_insert(0) += 1;
2559 }
2560 self.negative_dotted_values = self
2561 .negative_dotted_values
2562 .saturating_add(self.pending_negative_dotted);
2563 self.pending_negative_dotted = 0;
2564 }
2565
2566 /// Discard the current tid's pending failures. Called when the
2567 /// ghost filter rejects the tid — the bumps would correspond to
2568 /// a thread the snapshot doesn't include, so they must not
2569 /// inflate the summary.
2570 fn discard_pending(&mut self) {
2571 self.pending_failures.clear();
2572 self.pending_negative_dotted = 0;
2573 }
2574
2575 /// Total failures across every file kind. Read-side mirror of
2576 /// the public surface's `read_failures` field.
2577 fn total_failures(&self) -> u64 {
2578 self.failures_by_file.values().sum()
2579 }
2580
2581 /// Pick the file kind with the most failures. Ties resolve to
2582 /// REVERSE-alphabetical order for determinism — the
2583 /// alphabetically-EARLIER tag wins (mirrors
2584 /// [`ProbeSummary::dominant_tag`]'s comparator).
2585 fn dominant_file(&self) -> Option<&'static str> {
2586 self.failures_by_file
2587 .iter()
2588 .max_by(|a, b| a.1.cmp(b.1).then_with(|| b.0.cmp(a.0)))
2589 .map(|(tag, _)| *tag)
2590 }
2591
2592 /// True when ≥ 50% of failures are in `schedstat` or `io` —
2593 /// the two procfs files gated by `CONFIG_SCHED_INFO` /
2594 /// `CONFIG_TASK_IO_ACCOUNTING`. Mirrors
2595 /// [`ProbeSummary::ptrace_dominates`]'s shape: dominance gate
2596 /// at half-or-more, false when total is zero.
2597 fn kernel_config_dominates(&self) -> bool {
2598 let total = self.total_failures();
2599 if total == 0 {
2600 return false;
2601 }
2602 let kconfig: u64 = self
2603 .failures_by_file
2604 .iter()
2605 .filter(|(t, _)| matches!(**t, "schedstat" | "io"))
2606 .map(|(_, n)| *n)
2607 .sum();
2608 kconfig * 2 >= total
2609 }
2610
2611 /// Project the internal tally to the curated public surface.
2612 fn to_public(&self) -> CtprofParseSummary {
2613 let read_failures = self.total_failures();
2614 let mut by_file = BTreeMap::new();
2615 for (k, v) in &self.failures_by_file {
2616 by_file.insert((*k).to_string(), *v);
2617 }
2618 CtprofParseSummary {
2619 tids_walked: self.tids_walked,
2620 read_failures,
2621 read_failures_by_file: by_file,
2622 dominant_read_failure: self.dominant_file().map(|t| t.to_string()),
2623 kernel_config_dominant: self.kernel_config_dominates(),
2624 negative_dotted_values: self.negative_dotted_values,
2625 }
2626 }
2627}
2628
2629/// Stable EPERM remediation hint for the capture summary. References
2630/// `$(which ktstr)` rather than a hardcoded path so the suggestion
2631/// works regardless of where the binary is installed.
2632const PTRACE_EPERM_HINT: &str = "hint: re-run as root, or sudo setcap cap_sys_ptrace+eip $(which ktstr), or set kernel.yama.ptrace_scope=0";
2633
2634/// Result of the stateless attach pass for a single tgid:
2635/// the procfs-derived `pcomm` (for tracing) plus the underlying
2636/// `attach_jemalloc_at` outcome. Carries no shared state, so it
2637/// can be assembled by rayon workers in parallel without locking.
2638struct AttachOutcome {
2639 pcomm: String,
2640 result: std::result::Result<
2641 crate::host_thread_probe::JemallocProbe,
2642 crate::host_thread_probe::AttachError,
2643 >,
2644}
2645
2646/// Cache value for the per-`(dev, ino)` probe cache in
2647/// [`capture_with`]'s parallel probe phase. Captures BOTH the
2648/// `JemallocProbe` (for the success path) and the
2649/// `AttachError::tag()` string (for the failure path) so a
2650/// cache hit can re-apply the same `attach_tag_counts` /
2651/// `failed` bumps that the original miss applied via
2652/// [`record_attach_outcome`]. Without `failed_tag`, repeat
2653/// hits on a failed binary would credit only `tgids_walked` —
2654/// the actionable failure (and its dominant-tag accounting)
2655/// would be silently undercounted relative to actual attach
2656/// volume.
2657#[derive(Clone)]
2658struct CachedAttachResult {
2659 probe: Option<crate::host_thread_probe::JemallocProbe>,
2660 /// `None` for the success path (`probe.is_some()`); `Some`
2661 /// for every failure path, even non-actionable tags
2662 /// (`jemalloc-not-found`, `readlink-failure`) — the
2663 /// dominant-tag filter in [`ProbeSummary::dominant_tag`]
2664 /// excludes those, but `attach_tag_counts` itself records
2665 /// every tag for diagnostic completeness.
2666 failed_tag: Option<&'static str>,
2667}
2668
2669/// Stateless half of the per-tgid attach: read `pcomm` and run
2670/// `attach_jemalloc_at` (the expensive ELF parse + DWARF walk).
2671/// No summary mutation — the result is paired with `pcomm` and
2672/// returned to the caller for application via
2673/// [`record_attach_outcome`]. Splitting attach from the summary
2674/// update lets the parallel probe phase in [`capture_with`] hold
2675/// the `summary_mutex` only for the cheap counter+tracing step,
2676/// rather than serialising every rayon worker on the slowest
2677/// call in the pipeline.
2678fn attach_probe_for_tgid_at(proc_root: &Path, tgid: i32) -> AttachOutcome {
2679 #[cfg(test)]
2680 {
2681 // Panic-injection seam: a test sets `PANIC_INJECT_TGID` to
2682 // a sentinel tgid value before calling `capture_with`. When
2683 // the rayon worker for that tgid enters this function, we
2684 // panic to model the failure mode where the ELF parse / DWARF
2685 // walk panics under fd exhaustion or OOM. The
2686 // `catch_unwind` wrapper in `capture_with`'s phase 1 must
2687 // absorb this and surface it through the summary as a
2688 // `worker-panic` attach tag without crashing the snapshot.
2689 let injected = PANIC_INJECT_TGID.load(std::sync::atomic::Ordering::Acquire);
2690 if injected != 0 && injected == tgid {
2691 // Non-string payload variant: when the bool seam is
2692 // armed, panic with a typed payload so the
2693 // `downcast_ref::<&str>` and `downcast_ref::<String>`
2694 // arms in `capture_with` both miss and the
2695 // `unwrap_or("<non-string panic payload>")` fallback
2696 // arm fires. Pinned by
2697 // `capture_with_rayon_worker_panic_non_string_payload_falls_back`.
2698 if PANIC_INJECT_NON_STRING.load(std::sync::atomic::Ordering::Acquire) {
2699 // u64 is `'static + Send`, so it satisfies the
2700 // `Box<dyn Any + Send>` payload bound but neither
2701 // downcasts to `&str` nor `String` — exactly the
2702 // shape the fallback arm guards against.
2703 std::panic::panic_any(0xDEADBEEFu64);
2704 }
2705 panic!("test: injected attach worker panic for tgid {tgid}");
2706 }
2707 }
2708 let pcomm = read_process_comm_at(proc_root, tgid).unwrap_or_default();
2709 let result = crate::host_thread_probe::attach_jemalloc_at(proc_root, tgid);
2710 AttachOutcome { pcomm, result }
2711}
2712
2713/// Test-only seam for the panic-injection harness consumed by
2714/// [`attach_probe_for_tgid_at`]. Set to a non-zero tgid to make
2715/// the next attach call for that tgid panic; reset to 0 to
2716/// disable. The check fires on the rayon worker thread, so the
2717/// `catch_unwind` wrapper in [`capture_with`] is the only thing
2718/// that prevents the panic from propagating out of `pool.install`.
2719/// `cfg(test)` only — production builds carry no injection
2720/// surface.
2721#[cfg(test)]
2722static PANIC_INJECT_TGID: std::sync::atomic::AtomicI32 = std::sync::atomic::AtomicI32::new(0);
2723
2724/// Test-only companion seam for [`PANIC_INJECT_TGID`]: when
2725/// armed (`true`) before calling `capture_with`, the injected
2726/// panic uses a typed non-string payload (`std::panic::panic_any`
2727/// over a `u64`) instead of the default formatted-message
2728/// `panic!`. Lets a test exercise the
2729/// `unwrap_or("<non-string panic payload>")` fallback arm in
2730/// `capture_with`'s panic-handling block — the `downcast_ref`
2731/// chain misses both `&str` and `String` for non-string
2732/// payloads and must fall back rather than panicking on
2733/// `unwrap()`.
2734#[cfg(test)]
2735static PANIC_INJECT_NON_STRING: std::sync::atomic::AtomicBool =
2736 std::sync::atomic::AtomicBool::new(false);
2737
2738/// Stateful half of the per-tgid attach: apply `outcome` to
2739/// `summary` and emit one tracing event. Two attach-error tags
2740/// log at `debug` rather than `warn`: `jemalloc-not-found` (the
2741/// bulk of system processes are not jemalloc-linked, so this is
2742/// the dominant non-actionable outcome on a busy host) and
2743/// `readlink-failure` (a tgid that exited between the procfs
2744/// walk and `readlink(/proc/<pid>/exe)` is also routine — race-
2745/// with-exit on short-lived helpers). Every other variant logs
2746/// at `warn` because a jemalloc-linked target failing to attach
2747/// is actionable (privilege drop, stripped binary, …). The
2748/// matches! arm here is the same one [`ProbeSummary::dominant_tag`]
2749/// uses to filter the operator-facing summary, so the level
2750/// routing and the dominance ranking surface the same
2751/// actionable/non-actionable cut. No I/O — safe to call under a
2752/// short-held mutex from the parallel probe phase.
2753fn record_attach_outcome(
2754 tgid: i32,
2755 outcome: AttachOutcome,
2756 summary: &mut ProbeSummary,
2757) -> CachedAttachResult {
2758 summary.tgids_walked += 1;
2759 let AttachOutcome { pcomm, result } = outcome;
2760 match result {
2761 Ok(probe) => {
2762 summary.jemalloc_detected += 1;
2763 tracing::debug!(tgid, %pcomm, "ctprof probe: jemalloc detected");
2764 CachedAttachResult {
2765 probe: Some(probe),
2766 failed_tag: None,
2767 }
2768 }
2769 Err(err) => {
2770 let tag = err.tag();
2771 *summary.attach_tag_counts.entry(tag).or_insert(0) += 1;
2772 if matches!(tag, "jemalloc-not-found" | "readlink-failure") {
2773 tracing::debug!(tgid, %pcomm, tag, err = %err, "ctprof probe: attach skipped");
2774 } else {
2775 summary.failed += 1;
2776 tracing::warn!(tgid, %pcomm, tag, err = %err, "ctprof probe: attach failed");
2777 }
2778 CachedAttachResult {
2779 probe: None,
2780 failed_tag: Some(tag),
2781 }
2782 }
2783 }
2784}
2785
2786/// Single-call wrapper around [`attach_probe_for_tgid_at`] +
2787/// [`record_attach_outcome`] for sequential callers (tests + the
2788/// per-pid `capture_pid_with` path) that don't need the
2789/// stateless/stateful split. The parallel probe phase in
2790/// [`capture_with`] calls the two halves separately so the
2791/// expensive attach runs outside the summary mutex.
2792fn try_attach_probe_for_tgid_at(
2793 proc_root: &Path,
2794 tgid: i32,
2795 summary: &mut ProbeSummary,
2796) -> Option<crate::host_thread_probe::JemallocProbe> {
2797 let outcome = attach_probe_for_tgid_at(proc_root, tgid);
2798 record_attach_outcome(tgid, outcome, summary).probe
2799}
2800
2801/// Pull `(allocated_bytes, deallocated_bytes)` for one tid via the
2802/// pre-attached probe, recording the outcome in `summary` and
2803/// emitting a `tracing::warn!` once per failed tgid (the engine
2804/// shares the same `AttachError`/`ProbeError` taxonomy across every
2805/// tid of a tgid, so logging each tid would spam the operator).
2806///
2807/// Returns `Some((allocated, deallocated))` on a successful per-thread
2808/// read, `None` on a read failure. The caller uses `None` to leave
2809/// `jemalloc_measured` false — a failed read is not a measurement, so
2810/// the absent-as-0 default must fold to `Aggregated::Absent`, not a
2811/// sentinel `Sum(0)`. Mirrors the taskstats path's per-thread Ok-gating.
2812fn probe_thread_recording(
2813 probe: &crate::host_thread_probe::JemallocProbe,
2814 tid: i32,
2815 tgid: i32,
2816 pcomm: &str,
2817 comm: &str,
2818 summary: &mut ProbeSummary,
2819 failed_tgids_logged: &mut std::collections::BTreeSet<i32>,
2820) -> Option<(u64, u64)> {
2821 match crate::host_thread_probe::probe_thread(probe, tid) {
2822 Ok(c) => {
2823 summary.probed_ok += 1;
2824 Some((c.allocated_bytes, c.deallocated_bytes))
2825 }
2826 Err(err) => {
2827 let tag = err.tag();
2828 *summary.probe_tag_counts.entry(tag).or_insert(0) += 1;
2829 summary.failed += 1;
2830 if failed_tgids_logged.insert(tgid) {
2831 tracing::warn!(
2832 tgid,
2833 tid,
2834 %pcomm,
2835 %comm,
2836 tag,
2837 err = %err,
2838 "ctprof probe: probe_thread failed",
2839 );
2840 }
2841 None
2842 }
2843 }
2844}
2845
2846/// Emit the once-per-snapshot parse-summary line. Mirrors the
2847/// [`emit_probe_summary`] discipline: one info-level line with the
2848/// per-snapshot tally counts. Includes the dominant failure file
2849/// kind when any read failures landed, the kernel-config
2850/// remediation hint when `schedstat` / `io` dominate, and the
2851/// negative-dotted-value count when the parser saw any
2852/// schedstat fields with a leading `-`. The clauses are
2853/// suppressed when their underlying signal is zero so a clean
2854/// host emits a single short line.
2855fn emit_parse_summary(tally: &ParseTally) {
2856 let tids_walked = tally.tids_walked;
2857 let read_failures = tally.total_failures();
2858 let negative_dotted = tally.negative_dotted_values;
2859 let dominant_clause = tally
2860 .dominant_file()
2861 .map(|tag| format!(" (dominant: {tag})"))
2862 .unwrap_or_default();
2863 let kconfig_clause = if tally.kernel_config_dominates() {
2864 format!("; {PARSE_KCONFIG_HINT}")
2865 } else {
2866 String::new()
2867 };
2868 let negative_clause = if negative_dotted > 0 {
2869 format!(", {negative_dotted} negative-dotted values")
2870 } else {
2871 String::new()
2872 };
2873 tracing::info!(
2874 "ctprof parse: {tids_walked} tids walked, \
2875 {read_failures} read failures{negative_clause}\
2876 {dominant_clause}{kconfig_clause}",
2877 );
2878}
2879
2880/// Emit the once-per-snapshot summary line. Includes the dominant
2881/// failure tag when any failures landed and an EPERM remediation
2882/// hint when ptrace privilege failures dominate.
2883fn emit_probe_summary(summary: &ProbeSummary) {
2884 let tgids_walked = summary.tgids_walked;
2885 let jemalloc_detected = summary.jemalloc_detected;
2886 let probed_ok = summary.probed_ok;
2887 let failed = summary.failed;
2888 if failed > 0 {
2889 let dominant = summary.dominant_tag().unwrap_or("?");
2890 if summary.ptrace_dominates() {
2891 tracing::info!(
2892 "ctprof probe: {tgids_walked} tgids walked, \
2893 {jemalloc_detected} jemalloc detected, \
2894 {probed_ok} probed OK, {failed} failed \
2895 (dominant: {dominant}; {})",
2896 PTRACE_EPERM_HINT,
2897 );
2898 } else {
2899 tracing::info!(
2900 "ctprof probe: {tgids_walked} tgids walked, \
2901 {jemalloc_detected} jemalloc detected, \
2902 {probed_ok} probed OK, {failed} failed \
2903 (dominant: {dominant})",
2904 );
2905 }
2906 } else {
2907 tracing::info!(
2908 "ctprof probe: {tgids_walked} tgids walked, \
2909 {jemalloc_detected} jemalloc detected, \
2910 {probed_ok} probed OK, {failed} failed",
2911 );
2912 }
2913}
2914
2915/// Capture a complete host-wide snapshot under arbitrary procfs
2916/// and cgroup roots. Walks `<proc_root>` for every live tgid,
2917/// enumerates its threads, and assembles a [`CtprofSnapshot`]
2918/// with per-cgroup enrichment populated once per distinct cgroup
2919/// path (many threads share a cgroup; keep the walk
2920/// O(cgroups) rather than O(threads)). The default-roots
2921/// production entry point is [`capture`]; tests pass a tempdir
2922/// to exercise the walk against a synthetic tree.
2923///
2924/// `use_syscall_affinity` gates four real-host touchpoints —
2925/// (a) the [`crate::host_context::collect_host_context`] sweep
2926/// (kernel/CPU/memory/tunables read from the live host); (b)
2927/// phase 1, the parallel jemalloc-probe attach pass that walks
2928/// every tgid's `/proc/<pid>/exe` for ELF + DWARF metadata; (c)
2929/// `sched_getaffinity(2)` inside per-thread capture, with
2930/// fall-back to `Cpus_allowed_list:` on syscall failure;
2931/// (d) `emit_probe_summary` plus the [`CtprofProbeSummary`]
2932/// surfaced on the snapshot, both of which are skipped when
2933/// `use_syscall_affinity` is `false`: `emit_probe_summary` is
2934/// not called and `probe_summary` is `None`. Synthetic-tree
2935/// tests pass `false` so the staged procfs is read in isolation
2936/// (no `sched_getaffinity`, no ELF parses, no `host` block, no
2937/// `probe_summary`); production passes `true`.
2938///
2939/// Self-skip: the caller's own tgid is excluded from the per-tgid
2940/// probe-attach loop because `PTRACE_SEIZE` rejects self-attach
2941/// (the rayon `.filter(|&tgid| tgid != self_pid)` drops self
2942/// before the attach call). Phase 2 still iterates the full tgid
2943/// list including self_pid, and the per-tid lookup
2944/// `probe_map.get(&tgid).and_then(|p| p.as_ref())` returns `None`
2945/// for self_pid because phase 1 never inserted an entry; the
2946/// closure short-circuits via `.map(...).unwrap_or((0, 0))`,
2947/// leaving the jemalloc fields at the absent-counter default.
2948/// Every other procfs-derived
2949/// field populates normally — `capture_thread_at` runs
2950/// unconditionally per tid regardless of probe outcome.
2951fn capture_with(
2952 proc_root: &Path,
2953 cgroup_root: &Path,
2954 sys_root: &Path,
2955 use_syscall_affinity: bool,
2956) -> CtprofSnapshot {
2957 let captured_at_unix_ns = std::time::SystemTime::now()
2958 .duration_since(std::time::UNIX_EPOCH)
2959 .map(|d| d.as_nanos() as u64)
2960 .unwrap_or(0);
2961 let host = if use_syscall_affinity {
2962 Some(crate::host_context::collect_host_context())
2963 } else {
2964 None
2965 };
2966 // Linux pid_max is bounded above by 2^22 (PID_MAX_LIMIT,
2967 // defined in include/linux/threads.h; kernel/pid.c clamps
2968 // pid_max to it via pid_max_max) on every supported
2969 // architecture, well inside i32::MAX, so the u32 → i32 cast
2970 // cannot wrap.
2971 let self_pid = std::process::id() as i32;
2972 let mut threads: Vec<ThreadState> = Vec::new();
2973 let mut failed_tgids_logged: std::collections::BTreeSet<i32> =
2974 std::collections::BTreeSet::new();
2975
2976 // Phase 1: resolve probes in parallel via rayon. The expensive
2977 // ELF parse + DWARF walk runs concurrently across tgids, with
2978 // an inode cache (Mutex-wrapped) so duplicate binaries are
2979 // resolved only once. The result is a map of tgid → probe.
2980 //
2981 // Cache key shape: `(st_dev, st_ino)` of `/proc/<tgid>/exe`'s
2982 // metadata. Two tgids whose exes resolve to the same
2983 // `(dev, ino)` share a cache entry.
2984 //
2985 // Overlay-fs / container collision note: the kernel exposes
2986 // overlayfs files with the OVERLAY MOUNT's superblock device
2987 // (`dentry->d_sb->s_dev`, the overlayfs `struct super_block`
2988 // anonymous device number) and a synthetic `st_ino`. The
2989 // mapping happens in `fs/overlayfs/inode.c::ovl_map_dev_ino`
2990 // — both fields come from the overlay superblock's view, NOT
2991 // the underlying upper or lower layer. Two unrelated mounts
2992 // produce DIFFERENT `s_dev` values (each gets its own
2993 // anonymous bdev); two containers sharing a single mount of
2994 // the same lower-layer image-store path see the same `s_dev`
2995 // and the same hashed `st_ino` for that file. In the
2996 // shared-mount case a cached jemalloc attach result is
2997 // reused across containers — BENIGN, because the cached value
2998 // records "is this binary jemalloc-linked, and at what TSD
2999 // offset", which is a property of the ELF bytes and identical
3000 // across container instances of the same image. Mutable-
3001 // overlay writes (an upper-layer write that copies-up the
3002 // lower ELF) produce a NEW `(s_dev, st_ino)` pair within the
3003 // SAME overlay mount — `ovl_map_dev_ino` rehashes against
3004 // the new upper-layer inode — so the cache misses correctly
3005 // and re-resolves the rewritten binary.
3006 let tgids = iter_tgids_at(proc_root);
3007 let probe_cache: std::sync::Mutex<std::collections::HashMap<(u64, u64), CachedAttachResult>> =
3008 std::sync::Mutex::new(std::collections::HashMap::new());
3009 let summary_mutex = std::sync::Mutex::new(ProbeSummary::default());
3010
3011 let probe_map: std::collections::HashMap<i32, Option<crate::host_thread_probe::JemallocProbe>> =
3012 if use_syscall_affinity {
3013 use rayon::prelude::*;
3014 // Scale parallelism by available CPU headroom: read
3015 // `<proc_root>/loadavg`, subtract from online CPU count,
3016 // clamp to [1, num_cpus/2 + 1]. Avoids drowning a hot
3017 // host. Routing the read through `proc_root` (rather
3018 // than `/proc` directly) keeps the parameterised-root
3019 // contract intact so synthetic-tree tests can stage
3020 // their own loadavg shape.
3021 let max_threads = {
3022 let num_cpus = std::thread::available_parallelism()
3023 .map(|n| n.get())
3024 .unwrap_or(4);
3025 let load = std::fs::read_to_string(proc_root.join("loadavg"))
3026 .ok()
3027 .and_then(|s| s.split_whitespace().next()?.parse::<f64>().ok())
3028 .unwrap_or(0.0);
3029 let headroom = (num_cpus as f64 - load).max(1.0) as usize;
3030 headroom.clamp(1, num_cpus / 2 + 1)
3031 };
3032 // ThreadPoolBuilder::build can fail when the OS rejects
3033 // the per-thread `pthread_create` (RLIMIT_NPROC, kernel
3034 // task table at PID_MAX). Fall back to the global rayon
3035 // pool on Err — capture still completes, only loses the
3036 // bounded-headroom guarantee.
3037 let pool_result = rayon::ThreadPoolBuilder::new()
3038 .num_threads(max_threads)
3039 .build();
3040 let work = || {
3041 tgids
3042 .par_iter()
3043 .copied()
3044 .filter(|&tgid| tgid != self_pid)
3045 .map(|tgid| {
3046 // Catch panics from the per-tgid attach pipeline so
3047 // a single rogue worker (fd exhaustion, OOM during
3048 // DWARF parse, or any panic-on-bug under
3049 // `attach_jemalloc_at`) cannot tear down
3050 // `pool.install` and the surrounding capture call.
3051 // Without this guard, `rayon::ThreadPool::install`
3052 // re-throws worker panics into the calling thread,
3053 // collapsing the entire snapshot into an unwind on
3054 // a single tgid's failure. On panic we record a
3055 // `worker-panic` attach tag against the summary
3056 // (counted under `failed`, surfaced in
3057 // `dominant_failure` when it dominates) and return
3058 // `(tgid, None)` so phase 2 still walks the tgid's
3059 // threads with the absent-counter default. The tag
3060 // is treated as actionable — a panicking attach is
3061 // a bug or resource-exhaustion signal, distinct
3062 // from the benign `jemalloc-not-found` /
3063 // `readlink-failure` outcomes the dominant-tag
3064 // filter suppresses.
3065 let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
3066 let cache_key =
3067 std::fs::metadata(proc_root.join(tgid.to_string()).join("exe"))
3068 .ok()
3069 .map(|m| {
3070 use std::os::unix::fs::MetadataExt;
3071 (m.dev(), m.ino())
3072 });
3073
3074 if let Some(key) = cache_key {
3075 // `unwrap_or_else(into_inner)` on every
3076 // shared-mutex lock so a prior worker
3077 // panic that poisoned a lock cannot
3078 // cascade-poison every subsequent worker
3079 // — the catch_unwind arm below records
3080 // the failure as a `worker-panic`
3081 // attach-tag bump, and surviving workers
3082 // should still make progress on the
3083 // partially-mutated state rather than
3084 // re-panicking out of `pool.install` and
3085 // collapsing the snapshot.
3086 let cached = probe_cache.lock_unpoisoned().get(&key).cloned();
3087 if let Some(cached_result) = cached {
3088 let mut s = summary_mutex.lock_unpoisoned();
3089 s.tgids_walked += 1;
3090 match &cached_result.failed_tag {
3091 None => {
3092 // Success path — original miss
3093 // already credited
3094 // `jemalloc_detected`. Re-apply
3095 // here so cache hits stay
3096 // symmetric with cache misses;
3097 // without this, only the first
3098 // sharer of a `(dev, ino)`
3099 // would count toward
3100 // `jemalloc_detected` and
3101 // every subsequent reuse
3102 // would silently undercount.
3103 s.jemalloc_detected += 1;
3104 tracing::debug!(
3105 tgid,
3106 "ctprof probe: cache hit (jemalloc)"
3107 );
3108 }
3109 Some(tag) => {
3110 // Failure path — re-apply the
3111 // SAME bookkeeping
3112 // [`record_attach_outcome`]
3113 // applied on the original
3114 // miss: bump
3115 // `attach_tag_counts[tag]`
3116 // unconditionally, and
3117 // `failed` for actionable
3118 // tags only (matching the
3119 // dominant-tag filter in
3120 // [`ProbeSummary::dominant_tag`]).
3121 // Without this, repeat hits
3122 // on a failed binary would
3123 // credit only `tgids_walked`
3124 // and the dominant-failure
3125 // signal would degrade as
3126 // shared-inode reuse climbs.
3127 // Logging stays at debug level
3128 // — the original miss already
3129 // emitted the warn-level event
3130 // for actionable tags; spamming
3131 // a warn per cache hit would
3132 // drown the operator log.
3133 *s.attach_tag_counts.entry(tag).or_insert(0) += 1;
3134 if !matches!(
3135 *tag,
3136 "jemalloc-not-found" | "readlink-failure"
3137 ) {
3138 s.failed += 1;
3139 }
3140 tracing::debug!(
3141 tgid,
3142 tag,
3143 "ctprof probe: cache hit (prior failure)"
3144 );
3145 }
3146 }
3147 cached_result.probe
3148 } else {
3149 // Stateless attach (the expensive ELF parse +
3150 // DWARF walk) runs OUTSIDE the summary mutex
3151 // so rayon workers parallelise it. The lock
3152 // is only held for the cheap counter +
3153 // tracing application via `record_attach_outcome`.
3154 //
3155 // Shared-inode cache misses can produce
3156 // duplicate parses when N workers enter
3157 // simultaneously — all run the attach before
3158 // any inserts. The cache fully amortises
3159 // subsequent lookups; the duplicate work is
3160 // bounded by the rayon pool size.
3161 let outcome = attach_probe_for_tgid_at(proc_root, tgid);
3162 let mut s = summary_mutex.lock_unpoisoned();
3163 let res = record_attach_outcome(tgid, outcome, &mut s);
3164 drop(s);
3165 let probe = res.probe.clone();
3166 probe_cache.lock_unpoisoned().insert(key, res);
3167 probe
3168 }
3169 } else {
3170 // No cache key — exe symlink unreadable. Same
3171 // attach-outside-lock pattern as the cache-miss
3172 // branch above; result is not cached because
3173 // there's no key to file it under.
3174 let outcome = attach_probe_for_tgid_at(proc_root, tgid);
3175 let mut s = summary_mutex.lock_unpoisoned();
3176 record_attach_outcome(tgid, outcome, &mut s).probe
3177 }
3178 }));
3179 let probe = match result {
3180 Ok(p) => p,
3181 Err(panic_payload) => {
3182 // Recover the panic message string for
3183 // the operator log. The payload is a
3184 // `Box<dyn Any + Send>` whose runtime
3185 // type is `&'static str` for `panic!("…")`
3186 // with a literal and `String` for
3187 // `panic!("{…}", …)` with formatted args.
3188 // Both of `attach_jemalloc_at`'s likely
3189 // panic sites (and the test seam in
3190 // `attach_probe_for_tgid_at`) panic with
3191 // a formatted message → `String`. Other
3192 // panic types (typed values, custom
3193 // payloads) collapse to a placeholder so
3194 // the log line still surfaces the tgid.
3195 let panic_msg = panic_payload
3196 .downcast_ref::<&str>()
3197 .copied()
3198 .or_else(|| {
3199 panic_payload.downcast_ref::<String>().map(|s| s.as_str())
3200 })
3201 .unwrap_or("<non-string panic payload>");
3202 // Bump counters to mirror what
3203 // `record_attach_outcome` would have done
3204 // for an attach error: tgids_walked++,
3205 // worker-panic tag++, failed++. The lock
3206 // may be poisoned if the inner panic
3207 // happened mid-update of the summary, so
3208 // recover via
3209 // [`crate::sync::MutexExt::lock_unpoisoned`]
3210 // rather than `.unwrap()` — bumping a
3211 // counter on partially-mutated state is
3212 // strictly less bad than re-panicking out
3213 // of the worker and tearing down
3214 // `pool.install`.
3215 let mut s = summary_mutex.lock_unpoisoned();
3216 s.tgids_walked += 1;
3217 *s.attach_tag_counts.entry("worker-panic").or_insert(0) += 1;
3218 s.failed += 1;
3219 tracing::error!(
3220 tgid,
3221 panic_msg,
3222 "ctprof probe: attach worker panicked; tgid skipped",
3223 );
3224 None
3225 }
3226 };
3227 (tgid, probe)
3228 })
3229 .collect()
3230 };
3231 match pool_result {
3232 Ok(pool) => pool.install(work),
3233 Err(e) => {
3234 tracing::warn!(
3235 error = %e,
3236 max_threads,
3237 "rayon ThreadPoolBuilder failed; falling back to global pool"
3238 );
3239 work()
3240 }
3241 }
3242 } else {
3243 std::collections::HashMap::new()
3244 };
3245
3246 // `mut` is required because phase 2 below threads `&mut
3247 // summary` into `probe_thread_recording`.
3248 let mut summary = summary_mutex.into_inner_unpoisoned();
3249 // Tally for procfs read-level failures, surfaced as
3250 // `parse_summary` when the production path runs. Tests that
3251 // pass `use_syscall_affinity=false` skip the assignment so
3252 // the public field stays `None` — same discipline as
3253 // `probe_summary`.
3254 let mut parse_tally = ParseTally::default();
3255 let mut tally_opt: Option<&mut ParseTally> = if use_syscall_affinity {
3256 Some(&mut parse_tally)
3257 } else {
3258 None
3259 };
3260
3261 // Open a single taskstats genetlink socket for the snapshot.
3262 // Best-effort: a kernel without `CONFIG_TASKSTATS`, a process
3263 // without `CAP_NET_ADMIN`, or any other open failure collapses
3264 // to `None` and every per-tid `query_tid` call short-circuits
3265 // through the absent-default zeros installed in
3266 // `capture_thread_at_with_tally`. Synthetic-tree tests pass
3267 // `use_syscall_affinity=false`, so the socket is never opened
3268 // — same discipline as the host-context / probe pass.
3269 // Per-sub-family enablement, probed once per snapshot from the host
3270 // (`/proc/sys/kernel/task_delayacct` + `/proc/config.gz`); baked into each
3271 // captured thread by `apply_delay_stats`, AND-ed with the per-thread query-Ok
3272 // in the group measured predicate. Only consumed on the live capture path.
3273 // Read independently of `collect_host_context`'s host-field probe (two small
3274 // /proc reads per infrequent snapshot) so the gating does not depend on
3275 // host-context collection running on this path.
3276 let taskstats_active = crate::host_context::probe_taskstats_active();
3277 let taskstats_client = if use_syscall_affinity {
3278 match crate::taskstats::TaskstatsClient::open() {
3279 Ok(c) => Some(c),
3280 Err(e) => {
3281 tracing::warn!(
3282 error = %e,
3283 "ctprof taskstats: open failed; delay-accounting and memory-watermark \
3284 fields will be zero. Ensure the kernel was built with CONFIG_TASKSTATS \
3285 (plus CONFIG_TASK_DELAY_ACCT for delay fields and CONFIG_TASK_XACCT for \
3286 hiwater fields), the process holds CAP_NET_ADMIN, and the kernel was \
3287 booted with `delayacct=on` (or sysctl `kernel.task_delayacct=1`)"
3288 );
3289 None
3290 }
3291 }
3292 } else {
3293 None
3294 };
3295 // Per-snapshot tally of `query_tid` outcomes. Allocated only
3296 // when the production-mode capture path runs (`use_syscall_affinity`
3297 // is true) — synthetic-tree tests skip it the same way they
3298 // skip `parse_summary` and `probe_summary`. Counters bump
3299 // even when `taskstats_client.is_none()` happened (open
3300 // failed) — the per-tid loop simply never reaches
3301 // `record_result` in that case, so every counter stays zero
3302 // and the operator sees a tally of all-zeros pointing at the
3303 // open-time tracing warning.
3304 let mut taskstats_tally: Option<crate::taskstats::TaskstatsSummary> = if use_syscall_affinity {
3305 Some(crate::taskstats::TaskstatsSummary::default())
3306 } else {
3307 None
3308 };
3309
3310 // Phase 2: sequential per-tid walk + ptrace reads.
3311 for tgid in &tgids {
3312 let tgid = *tgid;
3313 let pcomm = read_process_comm_at(proc_root, tgid).unwrap_or_default();
3314 let probe: Option<&crate::host_thread_probe::JemallocProbe> = probe_map
3315 .get(&tgid)
3316 .and_then(|p: &Option<crate::host_thread_probe::JemallocProbe>| p.as_ref());
3317 for tid in iter_task_ids_at(proc_root, tgid) {
3318 if let Some(t) = tally_opt.as_mut() {
3319 t.tids_walked += 1;
3320 }
3321 let comm = read_thread_comm_at(proc_root, tgid, tid).unwrap_or_default();
3322 let probe_read = probe.and_then(|p| {
3323 probe_thread_recording(
3324 p,
3325 tid,
3326 tgid,
3327 &pcomm,
3328 &comm,
3329 &mut summary,
3330 &mut failed_tgids_logged,
3331 )
3332 });
3333 let (allocated_bytes, deallocated_bytes) = probe_read.unwrap_or((0, 0));
3334 let mut t = capture_thread_at_with_tally(
3335 proc_root,
3336 tgid,
3337 tid,
3338 &pcomm,
3339 &comm,
3340 use_syscall_affinity,
3341 &mut tally_opt,
3342 );
3343 t.allocated_bytes = crate::metric_types::Bytes(allocated_bytes);
3344 t.deallocated_bytes = crate::metric_types::Bytes(deallocated_bytes);
3345 // jemalloc is MEASURED iff the per-thread probe READ succeeded
3346 // (`probe_read.is_some()`): a non-jemalloc tgid has `probe == None`,
3347 // and an attached tgid whose per-thread read failed yields `None`
3348 // too — both leave the absent-as-0 defaults and `jemalloc_measured`
3349 // false so the group folds to Absent, not a sentinel Sum(0). Mirrors
3350 // the taskstats Ok-gating below (a failed read is not a measurement).
3351 t.jemalloc_measured = probe_read.is_some();
3352 // Best-effort taskstats query for delay-accounting +
3353 // hiwater memory watermarks. tid > 0 invariant is
3354 // guaranteed by `iter_task_ids_at`'s `> 0` filter; the
3355 // u32 cast is therefore safe. Failures (the kernel
3356 // doesn't support taskstats, the tid raced exit, the
3357 // socket was never opened) fall through to the zero
3358 // defaults already installed in
3359 // `capture_thread_at_with_tally`. Each query result —
3360 // success or failure — feeds the per-snapshot tally
3361 // so the operator can distinguish "every tid raced
3362 // exit" from "CAP_NET_ADMIN missing" from "kernel
3363 // built without CONFIG_TASKSTATS" without parsing the
3364 // tracing log.
3365 if let Some(client) = taskstats_client.as_ref() {
3366 let result = client.query_tid(tid as u32);
3367 if let Some(tally) = taskstats_tally.as_mut() {
3368 tally.record_result(&result);
3369 }
3370 if let Ok(ds) = result {
3371 t.apply_delay_stats(&ds, taskstats_active);
3372 }
3373 }
3374 // Ghost-thread filter: a tid that exited between the
3375 // `iter_task_ids_at` readdir and our per-file reads
3376 // produces an all-Default `ThreadState` — empty comm
3377 // and zero start_time_clock_ticks, because every
3378 // procfs file read bailed with ENOENT mid-capture.
3379 // Including these entries pollutes the comparison: a
3380 // baseline run might capture 1000 such ghosts and a
3381 // candidate 500, producing a spurious "500 ghost
3382 // threads vanished" diff signal in every report. A
3383 // legitimate thread under a real kernel always
3384 // carries at least one of these fields — kernel
3385 // threads have a non-empty comm at creation, user
3386 // threads inherit one from their parent — so an
3387 // entry with BOTH empty implies mid-capture exit.
3388 // The filter preserves the "captures-what-existed"
3389 // intent without softening the "captures every live
3390 // thread" invariant.
3391 if t.comm.is_empty() && t.start_time_clock_ticks == 0 {
3392 if let Some(t) = tally_opt.as_mut() {
3393 t.discard_pending();
3394 }
3395 continue;
3396 }
3397 if let Some(t) = tally_opt.as_mut() {
3398 t.commit_pending();
3399 }
3400 threads.push(t);
3401 }
3402 }
3403 let probe_summary = if use_syscall_affinity {
3404 emit_probe_summary(&summary);
3405 Some(summary.to_public())
3406 } else {
3407 None
3408 };
3409 let parse_summary = if use_syscall_affinity {
3410 emit_parse_summary(&parse_tally);
3411 Some(parse_tally.to_public())
3412 } else {
3413 None
3414 };
3415 let mut cgroup_stats: BTreeMap<String, CgroupStats> = BTreeMap::new();
3416 for t in &threads {
3417 if !t.cgroup.is_empty() && !cgroup_stats.contains_key(&t.cgroup) {
3418 cgroup_stats.insert(
3419 t.cgroup.clone(),
3420 read_cgroup_stats_at(cgroup_root, &t.cgroup),
3421 );
3422 }
3423 }
3424 let psi = read_host_psi_at(proc_root);
3425 let sched_ext = read_sched_ext_sysfs_at(sys_root);
3426 CtprofSnapshot {
3427 captured_at_unix_ns,
3428 host,
3429 threads,
3430 cgroup_stats,
3431 probe_summary,
3432 parse_summary,
3433 taskstats_summary: taskstats_tally,
3434 psi,
3435 sched_ext,
3436 }
3437}
3438
3439/// Capture a complete host-wide snapshot against the default
3440/// procfs and cgroup roots (`/proc` and `/sys/fs/cgroup`).
3441/// Probes every jemalloc-linked tgid the walk reaches and
3442/// populates per-thread `allocated_bytes` / `deallocated_bytes`
3443/// from the jemalloc TSD counters; tgids the probe cannot attach
3444/// against (ptrace denied, not jemalloc-linked, stripped binary)
3445/// land their threads at the absent-counter default of 0 per the
3446/// best-effort capture contract.
3447///
3448/// # Cost
3449///
3450/// O(threads-on-host) for the procfs walk; additionally one ELF
3451/// open + DWARF parse for every tgid `attach_jemalloc` resolves
3452/// successfully, plus a ptrace seize/interrupt/waitpid/detach
3453/// round-trip per thread of those tgids. On a host with many
3454/// jemalloc-linked daemons (database / browser / runtime
3455/// processes) the probe path dominates the wall-clock cost.
3456/// Callers that need only one tgid's data should use
3457/// [`capture_pid`] to scope the walk.
3458pub fn capture() -> CtprofSnapshot {
3459 capture_with(
3460 Path::new(DEFAULT_PROC_ROOT),
3461 Path::new(DEFAULT_CGROUP_ROOT),
3462 Path::new(DEFAULT_SYS_ROOT),
3463 true,
3464 )
3465}
3466
3467/// Capture a ctprof snapshot scoped to a single tgid.
3468///
3469/// Walks `/proc/<pid>/task` for thread enumeration but skips every
3470/// other tgid on the host, sidestepping the wall-clock cost (and
3471/// blast-radius) of the global probe pass that [`capture`] runs.
3472/// Probes the target tgid's jemalloc TSD counters when it is
3473/// jemalloc-linked and not the calling process; otherwise the
3474/// per-thread allocated / deallocated fields land at zero per the
3475/// best-effort capture contract.
3476///
3477/// Useful for tests and tools that already know which process they
3478/// care about — the resulting snapshot's `threads` vec only carries
3479/// entries for `pid`'s tgid (one entry per thread of that process).
3480/// `host` and `cgroup_stats` populate normally so the snapshot
3481/// stays self-describing.
3482pub fn capture_pid(pid: i32) -> CtprofSnapshot {
3483 capture_pid_with(
3484 Path::new(DEFAULT_PROC_ROOT),
3485 Path::new(DEFAULT_CGROUP_ROOT),
3486 Path::new(DEFAULT_SYS_ROOT),
3487 pid,
3488 true,
3489 )
3490}
3491
3492/// `proc_root` + `cgroup_root` parameterised variant of
3493/// [`capture_pid`]. Lets tests stage a synthetic procfs / cgroupfs
3494/// for the capture walk without touching the real host.
3495///
3496/// `use_syscall_affinity` gates the same four real-host
3497/// touchpoints as [`capture_with`] — host-context collection,
3498/// the jemalloc probe attach (here scoped to the single target
3499/// `pid` rather than a phase-1 sweep across every tgid),
3500/// `sched_getaffinity(2)` inside per-thread capture, and
3501/// `emit_probe_summary` plus the [`CtprofProbeSummary`] on the
3502/// snapshot. Synthetic-tree tests pass `false` because the
3503/// staged procfs has no real ELF behind `/proc/<pid>/exe`;
3504/// production passes `true`. Self-skip parallels the global path:
3505/// when `pid == self_pid`, the `probe` binding is `None` (the
3506/// `&& pid != self_pid` guard skips the attach), and each tid's
3507/// `probe.as_ref().map(...).unwrap_or((0, 0))` short-circuits to
3508/// the absent-counter default for the jemalloc fields, with every
3509/// other procfs-derived field populated normally.
3510fn capture_pid_with(
3511 proc_root: &Path,
3512 cgroup_root: &Path,
3513 sys_root: &Path,
3514 pid: i32,
3515 use_syscall_affinity: bool,
3516) -> CtprofSnapshot {
3517 let captured_at_unix_ns = std::time::SystemTime::now()
3518 .duration_since(std::time::UNIX_EPOCH)
3519 .map(|d| d.as_nanos() as u64)
3520 .unwrap_or(0);
3521 let host = if use_syscall_affinity {
3522 Some(crate::host_context::collect_host_context())
3523 } else {
3524 None
3525 };
3526 // Linux pid_max is bounded above by 2^22 (PID_MAX_LIMIT,
3527 // defined in include/linux/threads.h; kernel/pid.c clamps
3528 // pid_max to it via pid_max_max) on every supported
3529 // architecture, well inside i32::MAX, so the u32 → i32 cast
3530 // cannot wrap.
3531 let self_pid = std::process::id() as i32;
3532 let pcomm = read_process_comm_at(proc_root, pid).unwrap_or_default();
3533 let mut summary = ProbeSummary::default();
3534 let mut failed_tgids_logged: std::collections::BTreeSet<i32> =
3535 std::collections::BTreeSet::new();
3536 let probe = if use_syscall_affinity && pid != self_pid {
3537 try_attach_probe_for_tgid_at(proc_root, pid, &mut summary)
3538 } else {
3539 None
3540 };
3541 let mut threads: Vec<ThreadState> = Vec::new();
3542 let mut parse_tally = ParseTally::default();
3543 let mut tally_opt: Option<&mut ParseTally> = if use_syscall_affinity {
3544 Some(&mut parse_tally)
3545 } else {
3546 None
3547 };
3548 // Per-sub-family enablement, probed once per snapshot (see `capture_with`).
3549 let taskstats_active = crate::host_context::probe_taskstats_active();
3550 // Best-effort taskstats client — same discipline as `capture_with`.
3551 let taskstats_client = if use_syscall_affinity {
3552 match crate::taskstats::TaskstatsClient::open() {
3553 Ok(c) => Some(c),
3554 Err(e) => {
3555 tracing::warn!(
3556 error = %e,
3557 "ctprof taskstats: open failed; delay-accounting and memory-watermark \
3558 fields will be zero. Ensure the kernel was built with CONFIG_TASKSTATS \
3559 (plus CONFIG_TASK_DELAY_ACCT for delay fields and CONFIG_TASK_XACCT for \
3560 hiwater fields), the process holds CAP_NET_ADMIN, and the kernel was \
3561 booted with `delayacct=on` (or sysctl `kernel.task_delayacct=1`)"
3562 );
3563 None
3564 }
3565 }
3566 } else {
3567 None
3568 };
3569 // Per-snapshot tally — mirrors the `capture_with` discipline.
3570 // Allocated only under `use_syscall_affinity` so the
3571 // synthetic-tree code path keeps `taskstats_summary: None` on
3572 // the resulting snapshot, identical to `parse_summary` /
3573 // `probe_summary`.
3574 let mut taskstats_tally: Option<crate::taskstats::TaskstatsSummary> = if use_syscall_affinity {
3575 Some(crate::taskstats::TaskstatsSummary::default())
3576 } else {
3577 None
3578 };
3579 for tid in iter_task_ids_at(proc_root, pid) {
3580 if let Some(t) = tally_opt.as_mut() {
3581 t.tids_walked += 1;
3582 }
3583 let comm = read_thread_comm_at(proc_root, pid, tid).unwrap_or_default();
3584 let probe_read = probe.as_ref().and_then(|p| {
3585 probe_thread_recording(
3586 p,
3587 tid,
3588 pid,
3589 &pcomm,
3590 &comm,
3591 &mut summary,
3592 &mut failed_tgids_logged,
3593 )
3594 });
3595 let (allocated_bytes, deallocated_bytes) = probe_read.unwrap_or((0, 0));
3596 let mut t = capture_thread_at_with_tally(
3597 proc_root,
3598 pid,
3599 tid,
3600 &pcomm,
3601 &comm,
3602 use_syscall_affinity,
3603 &mut tally_opt,
3604 );
3605 t.allocated_bytes = crate::metric_types::Bytes(allocated_bytes);
3606 t.deallocated_bytes = crate::metric_types::Bytes(deallocated_bytes);
3607 // jemalloc measured iff the per-thread probe READ succeeded
3608 // (`probe_read.is_some()`); see the capture_with twin. A non-jemalloc
3609 // tgid (probe None) or an attached tgid whose per-thread read failed both
3610 // leave jemalloc_measured false so the group folds to Absent instead of a
3611 // sentinel Sum(0).
3612 t.jemalloc_measured = probe_read.is_some();
3613 if let Some(client) = taskstats_client.as_ref() {
3614 let result = client.query_tid(tid as u32);
3615 if let Some(tally) = taskstats_tally.as_mut() {
3616 tally.record_result(&result);
3617 }
3618 if let Ok(ds) = result {
3619 t.apply_delay_stats(&ds, taskstats_active);
3620 }
3621 }
3622 if t.comm.is_empty() && t.start_time_clock_ticks == 0 {
3623 if let Some(t) = tally_opt.as_mut() {
3624 t.discard_pending();
3625 }
3626 continue;
3627 }
3628 if let Some(t) = tally_opt.as_mut() {
3629 t.commit_pending();
3630 }
3631 threads.push(t);
3632 }
3633 let probe_summary = if use_syscall_affinity {
3634 emit_probe_summary(&summary);
3635 Some(summary.to_public())
3636 } else {
3637 None
3638 };
3639 let parse_summary = if use_syscall_affinity {
3640 emit_parse_summary(&parse_tally);
3641 Some(parse_tally.to_public())
3642 } else {
3643 None
3644 };
3645 let mut cgroup_stats: BTreeMap<String, CgroupStats> = BTreeMap::new();
3646 for t in &threads {
3647 if !t.cgroup.is_empty() && !cgroup_stats.contains_key(&t.cgroup) {
3648 cgroup_stats.insert(
3649 t.cgroup.clone(),
3650 read_cgroup_stats_at(cgroup_root, &t.cgroup),
3651 );
3652 }
3653 }
3654 let psi = read_host_psi_at(proc_root);
3655 let sched_ext = read_sched_ext_sysfs_at(sys_root);
3656 CtprofSnapshot {
3657 captured_at_unix_ns,
3658 host,
3659 threads,
3660 cgroup_stats,
3661 probe_summary,
3662 parse_summary,
3663 taskstats_summary: taskstats_tally,
3664 psi,
3665 sched_ext,
3666 }
3667}
3668
3669/// Capture a snapshot and write it to `path` in the canonical
3670/// zstd+JSON format. Wrapper over [`capture`] +
3671/// [`CtprofSnapshot::write`] so CLI code can stay a single
3672/// call.
3673pub fn capture_to(path: &Path) -> Result<()> {
3674 capture().write(path)
3675}
3676
3677// Test modules — alphabetized.
3678#[cfg(test)]
3679mod tests_capture;
3680#[cfg(test)]
3681mod tests_cgroup;
3682#[cfg(test)]
3683mod tests_helpers;
3684#[cfg(test)]
3685mod tests_parse;
3686#[cfg(test)]
3687mod tests_parse_summary;
3688#[cfg(test)]
3689mod tests_probe;
3690#[cfg(test)]
3691mod tests_snapshot;
3692#[cfg(test)]
3693mod tests_thread_state;