ktstr/ctprof/
mod.rs

1//! Per-thread ctprof (cgroup/thread profiler) data model + capture layer.
2//!
3//! [`CtprofSnapshot`] is the serialized container for a single
4//! host-wide per-thread profile. Capture produces one via the
5//! `ktstr ctprof capture -o snapshot.ctprof.zst` subcommand;
6//! comparison reads two and joins them on the selected grouping
7//! axis (pcomm, cgroup, or comm).
8//!
9//! Field families and probe-timing invariance:
10//!
11//! - **Cumulative counters and totals** (the majority): wakeups,
12//!   migrations, csw, run/wait/sleep/block/iowait time, schedstat
13//!   counts, page-fault counters, syscall counters, byte counters,
14//!   the taskstats per-bucket `*_count` and `*_delay_total_ns`,
15//!   the jemalloc per-thread allocated/deallocated TSD counters,
16//!   etc. Sampled twice at different instants the value increases
17//!   monotonically; probe-attach latency does not alter the
18//!   reading.
19//! - **Lifetime high-water peaks**: schedstat `*_max` family
20//!   (`wait_max`, `sleep_max`, `block_max`, `exec_max`,
21//!   `slice_max`), every taskstats `*_delay_max_ns` /
22//!   `*_delay_min_ns`, and the memory watermarks
23//!   (`hiwater_rss_bytes`, `hiwater_vm_bytes`). These are
24//!   non-decreasing-over-time but per-event extrema rather than
25//!   sums, so they are non-summable across threads (the registry
26//!   reduces them via `MaxPeak` / `MaxPeakBytes`). Same
27//!   probe-timing invariance as the cumulative counters.
28//! - **Instantaneous gauges** (sensitive to probe timing):
29//!   [`ThreadState::nr_threads`] (signal_struct->nr_threads
30//!   snapshot), [`ThreadState::fair_slice_ns`] (instantaneous
31//!   `p->se.slice`), and [`ThreadState::state`]
32//!   (task_state_array letter). Sampled at capture time and can
33//!   genuinely differ between two probes of the same thread.
34//!   The registry pairs them with `MaxGaugeCount` /
35//!   `MaxGaugeNs` / `ModeChar` reductions rather than the
36//!   `Sum*` rules used for cumulative counters.
37//! - **Categorical / ordinal scalars** (point-in-time
38//!   snapshots): `policy`, `nice`, `priority`, `processor`,
39//!   `rt_priority`, plus the identity strings (`pcomm`, `comm`,
40//!   `cgroup`) and the [`crate::metric_types::CpuSet`]
41//!   `cpu_affinity`. These are sampled at capture time and can
42//!   change at runtime (e.g. `sched_setaffinity` mid-run flips
43//!   `processor` and `cpu_affinity`), so they share the
44//!   gauge family's probe-timing sensitivity. The registry
45//!   reduces them via `Mode*` / `Range*` / `Affinity` rather
46//!   than `Sum*`.
47//!
48//! The jemalloc per-thread TSD counters
49//! (`tsd_s.thread_allocated` / `thread_deallocated`) jemalloc
50//! maintains unconditionally on its alloc/dalloc fast and slow
51//! paths, so the ptrace-based attach this layer performs does
52//! not perturb them; counters previously accumulated remain
53//! valid across the brief stop the attach induces. Metrics not
54//! derivable from cumulative state (e.g. perf_event_open
55//! counters that reset on attachment) are intentionally absent
56//! from this capture layer.
57//!
58//! # Capture model
59//!
60//! [`capture`] walks `/proc` for every live tgid, enumerates its
61//! threads, and populates each [`ThreadState`] from a handful of
62//! procfs sources: `stat`, `schedstat`, `status`, `io`, `sched`,
63//! `comm`, `cgroup`. The procfs walk runs sequentially per tid in
64//! `capture_with` phase 2. Phase 1 attaches the jemalloc TSD
65//! probe in parallel across tgids when `use_syscall_affinity` is
66//! `true` (the production path); under `use_syscall_affinity =
67//! false` (the synthetic-tree test path), phase 1 is skipped
68//! entirely — the per-tgid probe map starts and stays empty, and
69//! phase 2's per-tid lookup falls through to the absent-counter
70//! default of zero. See "Probe wiring" below for the per-tgid
71//! mechanics.
72//!
73//! ## Probe wiring (most-expensive step)
74//!
75//! For every tgid the walk reaches, the capture pipeline calls
76//! the `pub(crate)` `host_thread_probe::attach_jemalloc_at` (or
77//! its default-root `attach_jemalloc` wrapper) to resolve the
78//! target's jemalloc TLS symbol + per-`tsd_s` field offsets via
79//! an ELF parse and DWARF walk; per-thread counter reads then
80//! dispatch through `host_thread_probe::probe_thread` for one
81//! ptrace cycle: seize → interrupt → waitpid → getregset →
82//! `process_vm_readv` → detach (the detach happens automatically
83//! via the `ScopeDetach` Drop guard, so any fallible step still
84//! leaves the target unstuck). The remote read pulls a
85//! contiguous 24-byte counter span — the canonical jemalloc
86//! `TSD_DATA_FAST` layout (allocated, fast-event slot,
87//! deallocated) — but the byte count is computed dynamically by
88//! `combined_read_span` from the DWARF-resolved field offsets, so
89//! a future jemalloc layout change is absorbed. This is the
90//! dominant wall-clock cost of a snapshot:
91//! O(unique-exe-inode tgids) ELF parses + O(jemalloc-linked
92//! tgids) DWARF walks + O(threads of jemalloc-linked tgids)
93//! ptrace cycles. The first term covers non-jemalloc tgids: each
94//! distinct `/proc/<pid>/exe` inode still costs one ELF parse to
95//! discover absence (the inode-keyed cache below collapses
96//! repeats). `attach_jemalloc_at` is the sole detection gate —
97//! tgids that attach successfully populate `allocated_bytes` /
98//! `deallocated_bytes`; tgids that fail attach (not jemalloc-
99//! linked, stripped binary, ptrace denied, arch mismatch — see
100//! `host_thread_probe::AttachError`) land their threads at the
101//! absent-counter default of zero.
102//!
103//! Phase 1 parallelism is gated by host CPU headroom (read from
104//! `<proc_root>/loadavg`, clamped to `[1, num_cpus/2 + 1]`) so the
105//! capture cannot drown a hot host with concurrent ELF reads.
106//! Per-tgid attach results are inode-keyed cached so a fork-bombed
107//! tgid family resolves DWARF once. The per-tgid wrapper
108//! `try_attach_probe_for_tgid_at` records every outcome in a single
109//! `ProbeSummary` tally; `emit_probe_summary` surfaces a single
110//! info-level line per snapshot summarising tgids walked, jemalloc
111//! detected, probed OK, failed, plus the dominant actionable
112//! failure tag and an EPERM remediation hint when ptrace-attach
113//! failures dominate.
114//!
115//! Each internal procfs reader returns `Option` (graceful on
116//! missing/unreadable — a kernel without `CONFIG_SCHED_INFO` (the
117//! `schedstat` file) or `CONFIG_TASK_IO_ACCOUNTING` (the `io` file)
118//! makes that file absent, so its reader yields `None` without
119//! failing the rest of the thread). The assembled
120//! [`ThreadState`] treats `None` as "absent at capture" via the
121//! field type — counters collapse to `0`, identity strings
122//! collapse to empty, affinity collapses to an empty vec. A
123//! missing reading is therefore indistinguishable from a genuine
124//! zero in the serialized output; the capture contract is
125//! best-effort, never-fail-the-snapshot. Tests that need stronger
126//! guarantees inspect the underlying readers directly (they remain
127//! `Option`-shaped, unit-tested in this module).
128//!
129//! # Privilege
130//!
131//! Pulling the jemalloc per-thread TSD counters requires
132//! `ptrace(PTRACE_SEIZE)` against the target. Under
133//! `kernel.yama.ptrace_scope=0` any same-uid process attaches.
134//! Under `=1` (Debian/Ubuntu host default) the tracer must be an
135//! ancestor of the target or carry `CAP_SYS_PTRACE`; `=2` and `=3`
136//! raise the bar further. When attach fails, the per-thread
137//! `allocated_bytes` / `deallocated_bytes` collapse to 0 per the
138//! best-effort contract — the rest of the snapshot still
139//! populates from procfs.
140
141use std::collections::BTreeMap;
142use std::fs;
143use std::path::{Path, PathBuf};
144
145use crate::sync::MutexExt;
146use anyhow::Result;
147
148/// Top-level serialized artifact produced by `ktstr ctprof`.
149///
150/// The file layout on disk is zstd-compressed JSON of this struct.
151/// Extension `.ctprof.zst` is conventional; nothing in the loader
152/// depends on the extension beyond being passed a path that
153/// resolves to a readable file.
154#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
155#[non_exhaustive]
156pub struct CtprofSnapshot {
157    /// Wall-clock time at capture, nanoseconds since the Unix
158    /// epoch. Useful as a tie-breaker when comparing two snapshots
159    /// that originate from the same host — the newer one is
160    /// candidate by default — but carries no load-bearing role in
161    /// any grouping axis.
162    pub captured_at_unix_ns: u64,
163
164    /// Host context snapshot (kernel, CPU, memory, tunables).
165    /// Optional because older tools or synthetic fixtures may
166    /// omit it; comparison degrades to a "host context unavailable"
167    /// line rather than failing the whole compare when either
168    /// side is missing.
169    pub host: Option<crate::host_context::HostContext>,
170
171    /// One entry per observed thread on the host at capture time.
172    /// Order is not load-bearing; the comparison pipeline groups
173    /// by `pcomm` / `cgroup` / `comm` depending on `--group-by`.
174    pub threads: Vec<ThreadState>,
175
176    /// Enrichment metadata for every cgroup that at least one
177    /// sampled thread resides in. Keyed by the cgroup path
178    /// relative to the v2 mount (e.g.
179    /// `/kubepods/burstable/pod-<id>/container`). Populated from
180    /// the cgroup filesystem, not the per-thread sample, because
181    /// cpu.stat / memory.current describe the cgroup's aggregate
182    /// state, not per-thread contribution.
183    pub cgroup_stats: BTreeMap<String, CgroupStats>,
184
185    /// Probe outcome statistics for the snapshot, when the probe
186    /// pass ran. `None` indicates the snapshot was assembled
187    /// without the per-tgid jemalloc probe walk (synthetic-tree
188    /// tests pass `use_syscall_affinity=false` to skip it).
189    /// `Some(_)` carries the per-snapshot tally — see
190    /// [`CtprofProbeSummary`] for the curated field set.
191    pub probe_summary: Option<CtprofProbeSummary>,
192
193    /// Procfs-read failure statistics for the snapshot, when the
194    /// capture pass ran in production mode. Mirrors the
195    /// `probe_summary` discipline: `None` indicates synthetic-tree
196    /// tests skipped it (`use_syscall_affinity=false`); `Some(_)`
197    /// carries the per-snapshot read-level failure tally — see
198    /// [`CtprofParseSummary`].
199    pub parse_summary: Option<CtprofParseSummary>,
200
201    /// Per-snapshot taskstats genetlink query outcome tally,
202    /// populated when the capture pass ran in production mode.
203    /// `None` mirrors `probe_summary` / `parse_summary`:
204    /// synthetic-tree tests pass `use_syscall_affinity=false`
205    /// which skips the netlink path entirely. `Some(_)` carries
206    /// the per-snapshot ok/eperm/esrch/other counts so an operator
207    /// can distinguish "no taskstats data because every tid raced
208    /// exit" (high `esrch_count`) from "no taskstats data because
209    /// the kernel was built without `CONFIG_TASKSTATS`" (the
210    /// netlink open failed up-front so every counter is zero)
211    /// from "no taskstats data because `CAP_NET_ADMIN` is missing"
212    /// (high `eperm_count`). See `crate::taskstats::TaskstatsSummary`
213    /// for the per-counter semantics and remediation guidance.
214    pub taskstats_summary: Option<crate::taskstats::TaskstatsSummary>,
215
216    /// Host-level Pressure Stall Information, populated from
217    /// `<proc_root>/pressure/{cpu,memory,io,irq}`. Captures
218    /// system-wide stall pressure across the four kernel-exposed
219    /// resources. Defaults to all-zero when the kernel has
220    /// CONFIG_PSI off or when individual resource files are
221    /// absent. See [`Psi`] for the per-resource shape and the
222    /// system-level cpu.full / irq.some caveats.
223    pub psi: Psi,
224
225    /// Global sched_ext sysfs state from `/sys/kernel/sched_ext/`.
226    /// `None` when CONFIG_SCHED_CLASS_EXT is not built (no
227    /// `sched_ext` sysfs directory exists), or when the
228    /// directory itself is unreadable. See [`SchedExtSysfs`]
229    /// for the per-field shape and kernel cites. Populated
230    /// during the same capture pass as PSI.
231    pub sched_ext: Option<SchedExtSysfs>,
232}
233
234/// Per-snapshot probe outcome statistics. Curated projection of
235/// the capture pipeline's internal probe tally — exposes the
236/// counters, the dominant failure tag, and a `privilege_dominant`
237/// boolean a downstream consumer needs to decide whether the
238/// snapshot's `allocated_bytes` / `deallocated_bytes` fields are
239/// trustworthy on a given host without parsing the operator-
240/// facing tracing line.
241///
242/// The internal probe taxonomy (the per-variant
243/// `host_thread_probe::AttachError` and `ProbeError` enums) is
244/// deliberately NOT mirrored here — it is implementation
245/// detail that may change shape without breaking this contract.
246/// `dominant_failure` carries the operator-facing tag string
247/// (e.g. `"ptrace-seize"`, `"dwarf-parse-failure"`) that the
248/// capture pipeline already surfaces in its tracing summary; the
249/// stable token format is documented in the `ktstr ctprof
250/// capture` CLI help. `privilege_dominant` mirrors the same gate
251/// that prints the EPERM remediation hint — true when ≥ 50% of
252/// `failed` is `ptrace-seize` or `ptrace-interrupt`.
253///
254/// The four counters are zero when the probe pass reached zero
255/// tgids (e.g. an empty `proc_root`); `dominant_failure` is
256/// `None` when no actionable failures landed; `privilege_dominant`
257/// is `false` when there are no failures or when ptrace failures
258/// are strictly less than half of `failed` (the `>= 50%` gate
259/// accepts equality at the boundary).
260///
261/// # Examples
262///
263/// ```no_run
264/// let snap = ktstr::ctprof::capture();
265/// if let Some(ps) = &snap.probe_summary {
266///     if let Some(hint) = ps.remediation_hint() {
267///         eprintln!("{hint}");
268///     }
269///     if let Some(tag) = &ps.dominant_failure {
270///         eprintln!("dominant failure: {tag}");
271///     }
272/// }
273/// ```
274#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
275#[non_exhaustive]
276pub struct CtprofProbeSummary {
277    /// Total tgids the probe pass walked. Equals the number of
278    /// `/proc/<pid>` directories the capture saw, minus the
279    /// calling process's own tgid (which is skipped because
280    /// `PTRACE_SEIZE` rejects self-attach).
281    pub tgids_walked: u64,
282    /// Tgids whose `attach_jemalloc_at` call succeeded — i.e.
283    /// the target was identified as jemalloc-linked, the TSD
284    /// symbol resolved, and the per-`tsd_s` field offsets came
285    /// out of the DWARF walk. A subset of `tgids_walked`.
286    pub jemalloc_detected: u64,
287    /// Per-thread probe reads that returned a counter pair.
288    /// Bounded above by the sum of thread counts across all
289    /// `jemalloc_detected` tgids; per-thread failures (target
290    /// thread exited mid-attach, EPERM, etc.) reduce this count
291    /// below the upper bound.
292    pub probed_ok: u64,
293    /// Attach-or-probe failures whose tag is classified
294    /// ACTIONABLE — see the `ktstr ctprof capture` CLI help
295    /// for the full filter rule and tag taxonomy. Routine
296    /// non-actionable outcomes (target not jemalloc-linked,
297    /// `readlink` race-with-exit) do NOT contribute to this
298    /// count.
299    pub failed: u64,
300    /// Tag string for the most-frequent actionable failure across
301    /// all attach-and-probe failures. `None` when `failed == 0`.
302    /// Stable single-word identifiers — the wire contract that
303    /// downstream consumers match against. The full taxonomy is
304    /// documented in the `ktstr ctprof capture` CLI help.
305    /// Examples: `"ptrace-seize"`, `"dwarf-parse-failure"`,
306    /// `"jemalloc-in-dso"`.
307    pub dominant_failure: Option<String>,
308    /// `true` when the ptrace failure share crosses the
309    /// hint-trigger threshold (≥ 50% of `failed` is `ptrace-seize`
310    /// or `ptrace-interrupt`). Mirrors the same gate that prints
311    /// the EPERM remediation hint in the operator-facing tracing
312    /// summary, so a downstream consumer can reproduce that
313    /// signal without parsing the log line. When `true`,
314    /// rerunning the capture binary with `CAP_SYS_PTRACE`
315    /// (e.g. `sudo setcap cap_sys_ptrace+eip $(which ktstr)`,
316    /// or run as root, or `sysctl kernel.yama.ptrace_scope=0`)
317    /// resolves most attach failures so jemalloc TSD attach
318    /// succeeds across foreign tgids. `false` when
319    /// `failed == 0` (no failures to dominate) or when ptrace
320    /// failures are strictly less than half of `failed` (the
321    /// `>= 50%` gate accepts equality at the boundary).
322    ///
323    /// Independent of [`Self::dominant_failure`]: ptrace failures
324    /// are tallied across both `ptrace-seize` and
325    /// `ptrace-interrupt` for the threshold, while
326    /// `dominant_failure` reports a single per-tag plurality.
327    /// When ptrace counts split across the two tags,
328    /// `privilege_dominant` may be `true` while
329    /// `dominant_failure` names a non-ptrace tag that won the
330    /// single-tag plurality. Conversely, `dominant_failure` may
331    /// name a ptrace tag while `privilege_dominant` is `false`
332    /// when ptrace failures are below the 50% threshold.
333    pub privilege_dominant: bool,
334}
335
336impl CtprofProbeSummary {
337    /// Operator-facing remediation hint when ptrace failures
338    /// dominate the snapshot. Returns `Some(&'static str)` —
339    /// the same `PTRACE_EPERM_HINT` constant the capture
340    /// pipeline embeds in its tracing summary line (a one-liner
341    /// naming `cap_sys_ptrace` — the `setcap`-form spelling of
342    /// the capability — and `kernel.yama.ptrace_scope`), or
343    /// `None` when [`Self::privilege_dominant`] is false. Lets a
344    /// downstream consumer surface the same fix-it message
345    /// without parsing the log line or hand-rolling the gate.
346    pub fn remediation_hint(&self) -> Option<&'static str> {
347        if self.privilege_dominant {
348            Some(PTRACE_EPERM_HINT)
349        } else {
350            None
351        }
352    }
353}
354
355/// Per-snapshot procfs read-failure statistics. Curated projection
356/// of the capture pipeline's internal read-tally — exposes per-file
357/// counters and a dominant-failure tag a downstream consumer needs
358/// to decide whether the snapshot's procfs-derived fields (CSW,
359/// schedstats, IO, etc.) are trustworthy on a given host without
360/// scanning every thread for default values.
361///
362/// The read-failure tally ([`Self::read_failures`] /
363/// [`Self::read_failures_by_file`]) is read-level only — it
364/// counts failures of `fs::read_to_string` against
365/// `/proc/<tgid>/task/<tid>/<file>`, not per-field parse failures
366/// inside an otherwise-readable file.
367/// A present-but-malformed file (e.g. a corrupt `stat` whose
368/// `parse_stat` returns all-`None`) does NOT count: the file read
369/// succeeded so the tally stays at zero for that category, even
370/// though the per-field parsers fold every value to its absent-
371/// counter default. Read failures correspond to the kernel never
372/// having written the file (ENOENT / kernel without
373/// `CONFIG_SCHED_INFO`), the file disappearing mid-capture (race),
374/// or any other I/O-level error from the procfs reader. A snapshot
375/// with 1 K schedstat failures across 1 K tids implies a kernel
376/// build without `CONFIG_SCHED_INFO`; 47 stat failures across 1 K
377/// tids implies mid-capture races.
378///
379/// One parse-level signal IS surfaced separately:
380/// [`Self::negative_dotted_values`] counts the per-line cases in
381/// `/proc/<tid>/sched` where the kernel's PN_SCHEDSTAT format
382/// emitted a leading `-` — a rare but observable clock-skew /
383/// suspend-resume artifact that the parser otherwise folds
384/// silently to zero. Other forms of per-field corruption (
385/// non-numeric fractional, malformed key, …) stay outside this
386/// summary's scope and surface as zero values on the affected
387/// `ThreadState` fields.
388///
389/// Per-file tokens in [`Self::read_failures_by_file`] are stable
390/// kebab-case identifiers downstream consumers match against. The
391/// recognized set: `"stat"`, `"schedstat"`, `"io"`, `"status"`,
392/// `"sched"`, `"cgroup"`, `"smaps_rollup"`. Adding a new procfs
393/// file to the capture adds a new key; the wire shape carries
394/// any token the capture emitted, so a consumer that only knows
395/// the existing set absorbs new keys without breaking.
396///
397/// Ghost-filtered tids do NOT contribute to `read_failures` /
398/// `read_failures_by_file` — their pending failure bumps are
399/// unwound via `discard_pending` when a thread ends up filtered
400/// out of `threads` (empty comm + zero start_time), so a busy
401/// host with mid-capture exits doesn't inflate the failure tallies
402/// with counts that would correspond to threads the snapshot
403/// doesn't even contain. `tids_walked` still counts every walk
404/// attempt regardless of the ghost filter outcome.
405///
406/// # Examples
407///
408/// ```no_run
409/// let snap = ktstr::ctprof::capture();
410/// if let Some(ps) = &snap.parse_summary
411///     && let Some(hint) = ps.kernel_config_hint()
412/// {
413///     eprintln!("{hint}");
414/// }
415/// ```
416#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
417#[non_exhaustive]
418pub struct CtprofParseSummary {
419    /// Total tids the capture pass attempted to read across every
420    /// tgid. Non-zero whenever the capture walked any tid; the
421    /// denominator a downstream consumer uses to compute "what
422    /// fraction of reads failed" without parsing the operator-
423    /// facing tracing line.
424    pub tids_walked: u64,
425    /// Total file-level read failures across all categories. Sum
426    /// of [`Self::read_failures_by_file`] values.
427    pub read_failures: u64,
428    /// Per-file-kind failure tally, keyed by stable kebab tokens
429    /// (`"stat"`, `"schedstat"`, `"io"`, `"status"`, `"sched"`,
430    /// `"cgroup"`, `"smaps_rollup"`). Empty map when the capture
431    /// saw zero failures. Keys present in the map have non-zero
432    /// counts; absent keys imply zero failures for that category,
433    /// NOT "category unknown".
434    pub read_failures_by_file: BTreeMap<String, u64>,
435    /// Tag string for the file kind with the most read failures
436    /// across the snapshot. `None` when `read_failures == 0`.
437    /// Stable kebab tokens (the same vocabulary
438    /// [`Self::read_failures_by_file`] keys against). Ties resolve
439    /// REVERSE-alphabetically so the output is deterministic — the
440    /// alphabetically-earlier tag wins (e.g. `"io"` beats
441    /// `"status"` when both count equal).
442    pub dominant_read_failure: Option<String>,
443    /// `true` when ≥ 50% of `read_failures` are concentrated in
444    /// kernel-config-gated files (`"schedstat"`, `"io"`). These
445    /// two files are absent on kernels built without
446    /// `CONFIG_SCHED_INFO` / `CONFIG_TASK_IO_ACCOUNTING`
447    /// respectively, so a dominance signal here points the
448    /// operator at a kernel build/config issue rather than a
449    /// transient race or permission problem. `false` when
450    /// `read_failures == 0` or when failures are spread across
451    /// non-kconfig files.
452    pub kernel_config_dominant: bool,
453    /// Number of `/proc/<tid>/sched` PN_SCHEDSTAT dotted-ns
454    /// values whose integer part read as negative (kernel emitted
455    /// a leading `-`, e.g. `-5.000000`). The capture-side parser
456    /// (`parsed_ns_from_dotted`) rejects negative integer parts —
457    /// a `u64` parse cannot accept the sign — and the call site
458    /// then `unwrap_or(0)`s the resulting `None` per the
459    /// best-effort capture contract. Without this counter the
460    /// silent fold to zero leaves operators with no visibility
461    /// into the rate at which schedstat values were silently
462    /// truncated.
463    ///
464    /// Counts per-field-occurrence, NOT per-thread: a single
465    /// tid that exposed five negative dotted fields contributes
466    /// `5` to this counter (e.g. one tid with negative `wait_sum`,
467    /// `sleep_max`, `block_sum`, `iowait_sum`, and `exec_max`
468    /// adds 5). The denominator for "fraction of tids affected"
469    /// is therefore NOT this field — pair with
470    /// [`Self::tids_walked`] only as an upper bound on
471    /// affected-tid count.
472    ///
473    /// Distinct from [`Self::read_failures`]: a negative dotted
474    /// value comes from a `sched` file that READ successfully —
475    /// it is a parse-level signal, not a read-level signal. The
476    /// field stays at zero on a clean host because the kernel
477    /// emits non-negative values on every well-behaved schedstat
478    /// path; non-zero values are most commonly the result of
479    /// clock-skew on suspend/resume where a `delta` calculation
480    /// against a stale baseline lands negative.
481    ///
482    /// Ghost-filter discipline: per-tid bumps are held pending
483    /// (alongside the read-failure bumps in
484    /// [`crate::ctprof`]'s capture-side `ParseTally`), and
485    /// unwound via `discard_pending` when the surrounding tid is
486    /// rejected by the empty-comm + zero-start ghost filter so a
487    /// busy host with mid-capture exits doesn't inflate this
488    /// counter with bumps that correspond to threads the snapshot
489    /// doesn't even contain.
490    pub negative_dotted_values: u64,
491}
492
493impl CtprofParseSummary {
494    /// Operator-facing hint when kernel-config-gated file failures
495    /// dominate the snapshot. Returns `Some(&'static str)` naming
496    /// the two `CONFIG_*` knobs that gate the affected files
497    /// (`CONFIG_SCHED_INFO` for `schedstat`, `CONFIG_TASK_IO_ACCOUNTING`
498    /// for `io`), or `None` when [`Self::kernel_config_dominant`]
499    /// is `false`. Lets a downstream consumer surface a remediation
500    /// pointer without parsing the log line or hand-rolling the
501    /// gate, mirroring the [`CtprofProbeSummary::remediation_hint`]
502    /// pattern.
503    pub fn kernel_config_hint(&self) -> Option<&'static str> {
504        if self.kernel_config_dominant {
505            Some(PARSE_KCONFIG_HINT)
506        } else {
507            None
508        }
509    }
510}
511
512/// Stable kernel-config remediation hint for parse summaries.
513/// Names the two procfs files that disappear on kernels built
514/// without the corresponding `CONFIG_*` knobs.
515const PARSE_KCONFIG_HINT: &str = "hint: schedstat / io read failures dominate — \
516                                  kernel may be built without CONFIG_SCHED_INFO \
517                                  and/or CONFIG_TASK_IO_ACCOUNTING";
518
519/// Absent-value sentinel for [`ThreadState::state`]. Used by both
520/// the manual [`Default`] impl on [`ThreadState`] and the
521/// `serde(default = ...)` attribute on the field so the absent
522/// state is `'~'` regardless of how a [`ThreadState`] gets
523/// constructed (default-built test fixture, partial JSON
524/// deserialize, capture-time `unwrap_or` fallback).
525///
526/// `'~'` (U+007E = 126) is chosen specifically because it sorts
527/// strictly AFTER every entry in `fs/proc/array.c::task_state_array`
528/// — `R` (82), `S` (83), `D` (68), `T` (84), `t` (116), `X`
529/// (88), `Z` (90), `P` (80), `I` (73) all have lower codepoints.
530/// [`crate::ctprof_compare::aggregate`] breaks the
531/// categorical-mode count-ties (rules
532/// [`crate::ctprof_compare::AggRule::Mode`] /
533/// [`crate::ctprof_compare::AggRule::ModeChar`] /
534/// [`crate::ctprof_compare::AggRule::ModeBool`]) toward the
535/// LEX-SMALLEST candidate (the closure
536/// `a.1.cmp(&b.1).then(b.0.cmp(&a.0))` inside the
537/// `Modeable::mode_across` reduction), so a sentinel smaller
538/// than the real letters would HIJACK the tiebreak whenever a
539/// default-built thread sat alongside a real one in the same
540/// group. `'~'` is larger than all of them, so the real kernel
541/// letter always wins the tie.
542///
543/// `'?'` (U+003F = 63) was the obvious-looking pick but is
544/// numerically SMALLER than every state letter the kernel
545/// emits, which would make it a tiebreak hijacker rather than
546/// a safe sentinel. Avoid.
547fn default_state_char() -> char {
548    '~'
549}
550
551/// Per-thread resource profile.
552///
553/// Populated by the capture layer from `/proc/<tid>/{sched,status,
554/// io,stat,comm,cgroup}`, `sched_getaffinity`, the taskstats
555/// genetlink path (delay-accounting + memory-watermark fields),
556/// and (for jemalloc-linked processes only, via ptrace +
557/// `process_vm_readv`) the per-thread `tsd_s.thread_allocated` /
558/// `thread_deallocated` TLS counters.
559///
560/// Field families (mirrors the module-level breakdown, with
561/// the registry-pairing reductions named):
562///
563/// - **Cumulative counters and totals** (the majority): wakeups,
564///   migrations, csw, run/wait/sleep/block/iowait time,
565///   schedstat counts, page-fault counters, syscall counters,
566///   byte counters, the taskstats per-bucket `*_count` and
567///   `*_delay_total_ns`, and the jemalloc per-thread
568///   allocated/deallocated TSD counters. Probe-timing invariant
569///   modulo monotonic forward progress; reduced via the
570///   `Sum*` rules.
571/// - **Lifetime high-water peaks**: schedstat `*_max` family,
572///   every taskstats `*_delay_max_ns` / `*_delay_min_ns`, and
573///   the memory watermarks ([`Self::hiwater_rss_bytes`],
574///   [`Self::hiwater_vm_bytes`]). Non-decreasing-over-time but
575///   per-event extrema, so non-summable across threads; the
576///   registry reduces them via `MaxPeak` / `MaxPeakBytes`.
577/// - **Instantaneous gauges** (sensitive to probe timing):
578///   [`Self::nr_threads`] (signal_struct->nr_threads snapshot),
579///   [`Self::fair_slice_ns`] (instantaneous `p->se.slice`),
580///   and [`Self::state`] (task_state_array letter). Two probes
581///   of the same thread at different instants can legitimately
582///   produce different values. Reduced via `MaxGaugeCount` /
583///   `MaxGaugeNs` / `ModeChar`.
584/// - **Categorical / ordinal scalars** (point-in-time
585///   snapshots): [`Self::policy`], [`Self::nice`],
586///   [`Self::priority`], [`Self::processor`],
587///   [`Self::rt_priority`], plus the identity strings
588///   ([`Self::pcomm`], [`Self::comm`], [`Self::cgroup`]) and
589///   the [`crate::metric_types::CpuSet`]
590///   [`Self::cpu_affinity`]. Sampled at capture time and can
591///   change at runtime (e.g. `sched_setaffinity` mid-run flips
592///   `processor` and `cpu_affinity`); reduced via `Mode*` /
593///   `Range*` / `Affinity`.
594///
595/// Same family taxonomy as the module-level block at the top of
596/// the file; the per-field docs flag the family on each entry
597/// and the registry's [`AggRule`] pairing makes the
598/// "category-mismatched aggregation is a compile error"
599/// invariant load-bearing.
600///
601/// [`AggRule`]: crate::ctprof_compare::AggRule
602///
603/// `Default` is implemented manually rather than derived because
604/// the [`Self::state`] field needs `'~'` (the absent-value
605/// sentinel) instead of `'\0'` (the `char` Default). See the
606/// field doc on [`Self::state`] for why: `'\0'` lex-compares
607/// SMALLER than every real kernel state letter, which would
608/// poison [`crate::ctprof_compare::AggRule::ModeChar`]
609/// tie-breaks toward "absent" whenever a default-constructed
610/// thread sat alongside a real one in a group.
611#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)]
612#[non_exhaustive]
613pub struct ThreadState {
614    // -- identity --
615    /// Kernel task id. Ephemeral across runs; not used as a
616    /// grouping axis.
617    pub tid: u32,
618    /// Thread group id (process id). Ephemeral across runs.
619    pub tgid: u32,
620    /// Process name, read from `/proc/<tgid>/comm`. Stable across
621    /// runs on the same build. Feeds the grouping key under
622    /// `--group-by pcomm` (default), where it flows through the
623    /// token-based [`crate::ctprof_compare::pattern_key`]
624    /// normalizer so ephemeral worker pools (`worker-0`,
625    /// `worker-1`, ...) collapse into a single `worker-{N}`
626    /// bucket; pass `--no-thread-normalize` to group by literal
627    /// pcomm. Also feeds the smaps_rollup join key (with the same
628    /// normalization rules) so per-process memory rows survive
629    /// PID churn across snapshots.
630    pub pcomm: String,
631    /// Thread name, read from `/proc/<tid>/comm`. Stable when the
632    /// runtime assigns deterministic names (worker pools, async
633    /// runtimes). Feeds the grouping key under `--group-by comm`,
634    /// where it flows through the token-based
635    /// [`crate::ctprof_compare::pattern_key`] normalizer (same
636    /// rules as pcomm). Pass `--no-thread-normalize` to group by
637    /// literal comm, or `--group-by comm-exact` for the same
638    /// effect on this axis only (smaps still normalizes).
639    pub comm: String,
640    /// Cgroup v2 path.
641    ///
642    /// # Namespace semantics
643    ///
644    /// The path is read verbatim from `/proc/<tid>/cgroup` and
645    /// is therefore relative to the CGROUP NAMESPACE ROOT the
646    /// capturing process sees — NOT relative to the
647    /// system-global v2 mount root. A process outside the
648    /// capturing namespace would see the same cgroup under a
649    /// different path (prefixed with the namespace-root ancestors
650    /// the inner view hides); a process inside a nested cgroup
651    /// namespace sees a truncated path. Cross-namespace
652    /// comparison requires external canonicalization (e.g.
653    /// resolving via `cgroup.procs` inode chains or walking
654    /// `/proc/<tid>/ns/cgroup` to the common root) — the
655    /// capture layer deliberately does NOT attempt this because
656    /// the resolution depends on capture-site privilege and
657    /// namespace visibility that varies per caller.
658    ///
659    /// Kept as `cgroup` (not renamed to `cgroup_ns_relative`)
660    /// for consistency with `GroupBy::Cgroup`,
661    /// `cgroup_flatten`, `cgroup_stats`, and every CLI flag
662    /// that threads the same concept through the comparison
663    /// layer; a rename would cascade through every pinned
664    /// string in the compare pipeline without improving the
665    /// semantic guarantee. This doc is the canonical
666    /// documentation of the namespace-relative contract.
667    pub cgroup: String,
668    /// `/proc/<tid>/stat` field 22 (`start_time`) in USER_HZ
669    /// clock ticks since system boot. The kernel exports this
670    /// field in USER_HZ units (defined in
671    /// `include/asm-generic/param.h` as `USER_HZ == 100` on
672    /// every architecture the capture layer targets — x86_64
673    /// and aarch64) — NOT raw internal jiffies, which scale
674    /// with CONFIG_HZ. Cross-host comparison between x86_64 and
675    /// aarch64 is meaningful because USER_HZ is the same 100 on
676    /// both, so a diff between two hosts on different CONFIG_HZ
677    /// settings still compares correctly. Seconds-since-boot
678    /// is simply `start_time_clock_ticks / 100` on those
679    /// architectures. Other in-tree architectures carry
680    /// different USER_HZ (alpha defines 1024, for instance);
681    /// a future port must either restate the divisor or
682    /// normalise at capture time. `fs/proc/array.c::do_task_stat`
683    /// is where the kernel writes the field to procfs.
684    ///
685    /// Stored as raw `u64`, NOT wrapped in
686    /// [`crate::metric_types::ClockTicks`], because this field
687    /// is an identity / ghost-thread sentinel rather than a
688    /// metric that flows through the aggregation pipeline. The
689    /// ghost-filter in `capture_with` / `capture_pid_with`
690    /// keys on `start_time_clock_ticks == 0` (alongside an
691    /// empty `comm`) to drop `ThreadState`s assembled from a
692    /// tid that exited mid-capture, which is cleaner against a
693    /// raw `u64` than against a wrapped sentinel.
694    pub start_time_clock_ticks: u64,
695    /// Scheduling policy (SCHED_OTHER, SCHED_FIFO, SCHED_RR,
696    /// SCHED_BATCH, SCHED_IDLE, SCHED_DEADLINE, SCHED_EXT). Stored
697    /// as the canonical name string rather than the kernel
698    /// integer so comparison output is human-readable without a
699    /// reverse-lookup table. Wrapped in
700    /// [`crate::metric_types::CategoricalString`] so the
701    /// aggregation pipeline reduces by mode (most-frequent value)
702    /// rather than a category-mismatched sum or max.
703    pub policy: crate::metric_types::CategoricalString,
704    /// Nice value in the standard [-20, 19] range. Signed i32
705    /// because the range includes negative values and
706    /// `parse_stat` extracts the field via `get_i32` on
707    /// procfs's decimal text — the inner type matches the
708    /// extraction path and the kernel-visible range without
709    /// coercion. Wrapped in [`crate::metric_types::OrdinalI32`]
710    /// so the aggregation pipeline reduces by `[min, max]` range
711    /// rather than sum.
712    pub nice: crate::metric_types::OrdinalI32,
713    /// Allowed CPU set from `sched_getaffinity`. Sorted ascending.
714    /// Comparison aggregates via union across the group and
715    /// renders as "N cpus (range)" or "mixed" for heterogeneous
716    /// sets — see [`crate::ctprof_compare::AffinitySummary`].
717    /// Wrapped in [`crate::metric_types::CpuSet`] so the
718    /// aggregation pipeline routes through the dedicated
719    /// affinity-summary reduction rather than a numeric path.
720    pub cpu_affinity: crate::metric_types::CpuSet,
721
722    // -- task state (last-CPU, run-state) --
723    /// Last CPU the thread executed on. `/proc/<tid>/stat` field
724    /// 39 (`task_cpu(task)` in `fs/proc/array.c::do_task_stat`,
725    /// emitted via `seq_put_decimal_ll`). Signed for symmetry
726    /// with [`Self::nice`]; the kernel emits non-negative values
727    /// only — `task_cpu` (defined `unsigned int` in
728    /// `include/linux/sched.h`) zero-extends through the
729    /// `seq_put_decimal_ll` widening to `s64`. `0` is the
730    /// absent-value default (collisions with a legitimate CPU 0
731    /// are distinguished by inspecting `cpu_affinity`).
732    /// Wrapped in [`crate::metric_types::OrdinalI32`] so the
733    /// aggregation pipeline reduces by `[min, max]` range across
734    /// the group.
735    pub processor: crate::metric_types::OrdinalI32,
736    /// Single-letter task state from `/proc/<tid>/status` `State:`
737    /// line. Real kernel chars are `R`, `S`, `D`, `T`, `t`, `X`,
738    /// `Z`, `P`, `I` (see `fs/proc/array.c::task_state_array`,
739    /// emitted via `get_task_state`). `'~'` is the absent-value
740    /// sentinel — visually distinct from every real kernel char
741    /// so a downstream consumer can distinguish "no state read"
742    /// from a real value. When `'~'` appears in compare output,
743    /// the `/proc/<tid>/status` read failed (thread likely
744    /// exited mid-capture).
745    ///
746    /// `ThreadState::default()`, the capture-time
747    /// `unwrap_or_else(default_state_char)` fallback, and
748    /// `serde(default)` deserialize of a partial JSON record all
749    /// produce `'~'` (NOT `'\0'`, the bare `char` Default). The
750    /// manual `Default` impl on `ThreadState`, the
751    /// `unwrap_or_else` site in `capture_thread_at_with_tally`,
752    /// and the `serde(default = ...)` attribute on this field
753    /// are paired specifically so the absent-value sentinel is
754    /// the same byte everywhere.
755    ///
756    /// `'~'` (U+007E = 126) is chosen so it sorts AFTER every
757    /// real kernel state letter — `R` (82), `S` (83), `D` (68),
758    /// `T` (84), `t` (116), `X` (88), `Z` (90), `P` (80), `I`
759    /// (73). [`crate::ctprof_compare::AggRule::ModeChar`]
760    /// breaks count-ties toward the LEX-SMALLEST candidate, so
761    /// a sentinel smaller than the real letters would silently
762    /// elect "absent" whenever a default-built thread sat
763    /// alongside a real one in the same group. `'~'` being
764    /// larger than all of them lets the real letter win the
765    /// tie. The earlier `'?'` (U+003F = 63) sentinel was
766    /// numerically smaller than every real state letter — a
767    /// tiebreak hijacker; do not return to it.
768    #[serde(default = "default_state_char")]
769    pub state: char,
770
771    // -- scheduling (cumulative + lifetime peaks; /proc/<tid>/sched schedstat fields, need CONFIG_SCHEDSTATS) --
772    // -- (sched_ext gate: ext.enabled requires CONFIG_SCHED_CLASS_EXT) --
773    /// `true` when the task is currently scheduled by sched_ext —
774    /// `/proc/<tid>/sched` `ext.enabled` line. The kernel emits
775    /// the literal key `ext.enabled` only when
776    /// `CONFIG_SCHED_CLASS_EXT` is enabled; on kernels without it
777    /// the field is absent and lands at the default `false`. When
778    /// `false` on a task expected under sched_ext, the task may
779    /// have been ejected (sched_ext fall-back to CFS on BPF error)
780    /// or never enrolled.
781    ///
782    /// Stays a bare `bool` — not wrapped in a categorical newtype
783    /// — because it is the only bool-valued metric in the
784    /// registry. The
785    /// [`crate::ctprof_compare::AggRule::ModeBool`] dispatch
786    /// coerces it to a `String` via `to_string()`/`Display` at
787    /// the call site (see the
788    /// [`crate::metric_types::CategoricalString`] doc note: if a
789    /// second bool-valued metric appears, promote both to a
790    /// dedicated `CategoricalBool` wrapper rather than keeping
791    /// the ad-hoc coercion).
792    pub ext_enabled: bool,
793    /// Cumulative on-CPU time, ns; `/proc/<tid>/schedstat`
794    /// field 1. `MonotonicNs` per the lifetime-accumulator
795    /// contract.
796    pub run_time_ns: crate::metric_types::MonotonicNs,
797    /// Cumulative time waiting on the runqueue, ns;
798    /// `/proc/<tid>/schedstat` field 2. `MonotonicNs`.
799    pub wait_time_ns: crate::metric_types::MonotonicNs,
800    /// Number of times the task was scheduled onto a CPU;
801    /// `/proc/<tid>/schedstat` field 3. `MonotonicCount`.
802    pub timeslices: crate::metric_types::MonotonicCount,
803    /// Voluntary context switches — task gave up the CPU itself;
804    /// `/proc/<tid>/status` `voluntary_ctxt_switches`.
805    /// `MonotonicCount`.
806    pub voluntary_csw: crate::metric_types::MonotonicCount,
807    /// Involuntary context switches — task was preempted;
808    /// `/proc/<tid>/status` `nonvoluntary_ctxt_switches`.
809    /// `MonotonicCount`.
810    pub nonvoluntary_csw: crate::metric_types::MonotonicCount,
811    /// Total wakeups via `try_to_wake_up()`; `/proc/<tid>/sched`
812    /// `nr_wakeups`. `MonotonicCount`.
813    pub nr_wakeups: crate::metric_types::MonotonicCount,
814    /// Wakeups landed on the same CPU as the waker;
815    /// `/proc/<tid>/sched` `nr_wakeups_local`. `MonotonicCount`.
816    pub nr_wakeups_local: crate::metric_types::MonotonicCount,
817    /// Wakeups landed on a different CPU than the waker;
818    /// `/proc/<tid>/sched` `nr_wakeups_remote`. `MonotonicCount`.
819    pub nr_wakeups_remote: crate::metric_types::MonotonicCount,
820    /// `WF_SYNC` synchronous-wakeup hint count;
821    /// `/proc/<tid>/sched` `nr_wakeups_sync`. `MonotonicCount`.
822    pub nr_wakeups_sync: crate::metric_types::MonotonicCount,
823    /// Wakeups where the task migrated to a different CPU than
824    /// its prior one (`WF_MIGRATED`); `/proc/<tid>/sched`
825    /// `nr_wakeups_migrate`. Distinct from `nr_wakeups_remote`
826    /// (waker CPU != target CPU). `MonotonicCount`.
827    pub nr_wakeups_migrate: crate::metric_types::MonotonicCount,
828    /// Wakeups onto this CPU (cache-affine wakeup
829    /// fast-path). `/proc/<tid>/sched` `nr_wakeups_affine`,
830    /// emitted via `P_SCHEDSTAT`. Plain u64. Zero on kernels
831    /// without `CONFIG_SCHEDSTATS`. Zero under sched_ext:
832    /// `wake_affine` is a CFS-only path.
833    pub nr_wakeups_affine: crate::metric_types::MonotonicCount,
834    /// Total invocations of the cache-affine wakeup heuristic
835    /// `wake_affine()` — denominator for the affine-wake success
836    /// ratio (`nr_wakeups_affine / nr_wakeups_affine_attempts`).
837    /// `/proc/<tid>/sched` `nr_wakeups_affine_attempts`, emitted
838    /// via `P_SCHEDSTAT` (plain u64). The kernel increments this
839    /// counter unconditionally on every `wake_affine()` call in
840    /// `kernel/sched/fair.c::wake_affine`, then increments
841    /// `nr_wakeups_affine` only when the heuristic chose this
842    /// CPU — so the ratio is the success rate of the cache-
843    /// affine fast-path. Zero on kernels without
844    /// `CONFIG_SCHEDSTATS`. Zero under sched_ext: `wake_affine`
845    /// is a CFS-only path and `kernel/sched/ext.c` does not
846    /// increment this counter.
847    pub nr_wakeups_affine_attempts: crate::metric_types::MonotonicCount,
848    /// Total cross-CPU migrations of the task. Incremented
849    /// unconditionally in `kernel/sched/core.c` (`p->se.nr_migrations++`)
850    /// — no schedstat macro, no class gating. Always populated
851    /// regardless of `CONFIG_SCHEDSTATS` or scheduling class.
852    /// `MonotonicCount`.
853    pub nr_migrations: crate::metric_types::MonotonicCount,
854    /// Migrations forced by load balance (the load balancer
855    /// migrated the task even though the local heuristic would
856    /// have skipped it). `/proc/<tid>/sched` `nr_forced_migrations`,
857    /// plain u64 via `P_SCHEDSTAT`. Zero on kernels without
858    /// `CONFIG_SCHEDSTATS`.
859    pub nr_forced_migrations: crate::metric_types::MonotonicCount,
860    /// Failed migrations attributed to affinity mismatch — the
861    /// destination CPU was not in `cpus_allowed`. `/proc/<tid>/sched`
862    /// `nr_failed_migrations_affine`, plain u64 via `P_SCHEDSTAT`.
863    /// Zero on kernels without `CONFIG_SCHEDSTATS`.
864    pub nr_failed_migrations_affine: crate::metric_types::MonotonicCount,
865    /// Failed migrations attributed to the task being currently
866    /// running on the source CPU. `/proc/<tid>/sched`
867    /// `nr_failed_migrations_running`, plain u64 via `P_SCHEDSTAT`.
868    /// Zero on kernels without `CONFIG_SCHEDSTATS`.
869    pub nr_failed_migrations_running: crate::metric_types::MonotonicCount,
870    /// Failed migrations attributed to cache-hot heuristic — the
871    /// source CPU's cache was too hot to leave. `/proc/<tid>/sched`
872    /// `nr_failed_migrations_hot`, plain u64 via `P_SCHEDSTAT`.
873    /// Zero on kernels without `CONFIG_SCHEDSTATS`.
874    pub nr_failed_migrations_hot: crate::metric_types::MonotonicCount,
875    /// Total nanoseconds the task spent on the runqueue waiting
876    /// to be picked. Populated from `/proc/<tid>/sched`'s
877    /// `wait_sum` key — kernel emits via `PN_SCHEDSTAT` as
878    /// `ms.ns_remainder`, reconstructed by the parser to full ns.
879    /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
880    /// sched_ext: the kernel updates this counter via
881    /// `__update_stats_wait_end` (`kernel/sched/stats.c`), called
882    /// from CFS/RT/DL paths only — `kernel/sched/ext.c` does not
883    /// call that helper.
884    pub wait_sum: crate::metric_types::MonotonicNs,
885    /// Number of runqueue-wait windows the task accumulated —
886    /// the per-event tally that pairs with [`Self::wait_sum`].
887    /// Populated from `/proc/<tid>/sched`'s `wait_count` key
888    /// (kernel emits as `P_SCHEDSTAT`, plain u64). Zero on
889    /// kernels without `CONFIG_SCHEDSTATS`. Same write path as
890    /// `wait_sum` (`__update_stats_wait_end` in
891    /// `kernel/sched/stats.c`), so the same sched_ext caveat
892    /// applies: zero under sched_ext.
893    pub wait_count: crate::metric_types::MonotonicCount,
894    /// Longest single runqueue-wait window the task ever
895    /// experienced, in nanoseconds. `/proc/<tid>/sched` `wait_max`
896    /// emitted via `PN_SCHEDSTAT` (`ms.ns_remainder`,
897    /// reconstructed to full ns by the parser). Tail-latency
898    /// signal that pairs with the `wait_sum` average. Zero on
899    /// kernels without `CONFIG_SCHEDSTATS`. Zero under sched_ext:
900    /// the kernel sets this counter via
901    /// `__update_stats_wait_end` from CFS/RT/DL paths only —
902    /// `kernel/sched/ext.c` does not call that helper, so
903    /// sched_ext-managed tasks never accumulate wait_max.
904    pub wait_max: crate::metric_types::PeakNs,
905    /// Pure voluntary sleep time, nanoseconds — `TASK_INTERRUPTIBLE`
906    /// off-CPU windows only, with the involuntary-block
907    /// component already subtracted at capture.
908    ///
909    /// Computed at capture as `sum_sleep_runtime - sum_block_runtime`
910    /// (saturating; the read-skew window where block briefly
911    /// exceeds sleep collapses to zero). The kernel's
912    /// `sum_sleep_runtime` key (read via `PN_SCHEDSTAT` in
913    /// `/proc/<tid>/sched`) is the FULL off-CPU total because
914    /// `__update_stats_enqueue_sleeper` (`kernel/sched/stats.c`)
915    /// charges every sleeper window regardless of which sleep
916    /// state the task was in — voluntary sleep AND involuntary
917    /// block both contribute. Subtracting `sum_block_runtime`
918    /// at capture leaves the voluntary-sleep residual, which
919    /// is the operationally useful signal for "how much time
920    /// did this task spend on a syscall wait that wasn't a
921    /// kernel block."
922    ///
923    /// Capture-side normalization (rather than a derived
924    /// metric at compare time) means every consumer sees the
925    /// pre-normalized value without re-deriving — and the raw
926    /// kernel reading is intentionally NOT preserved in the
927    /// snapshot per the project's pre-1.0 disposable-sidecar
928    /// policy.
929    ///
930    /// There is no `voluntary_sleep_count` counterpart: the
931    /// kernel does not emit one — the scheduler records the
932    /// aggregate runtime but not the sleep-event count
933    /// separately from `nr_wakeups`, which already covers the
934    /// wake-side tally.
935    /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
936    /// sched_ext: `__update_stats_enqueue_sleeper` is called
937    /// from CFS/RT/DL paths only. Also zero when either
938    /// `sum_sleep_runtime` or `sum_block_runtime` fails to parse
939    /// from `/proc/<tid>/sched`: the residual is uncomputable
940    /// without both halves, and falling back to the unsubtracted
941    /// `sum_sleep_runtime` would mislabel involuntary block as
942    /// voluntary sleep.
943    pub voluntary_sleep_ns: crate::metric_types::MonotonicNs,
944    /// Longest single sleep window in nanoseconds.
945    /// `/proc/<tid>/sched` `sleep_max` emitted via `PN_SCHEDSTAT`
946    /// (`ms.ns_remainder`, reconstructed by the parser). Zero on
947    /// kernels without `CONFIG_SCHEDSTATS`. Zero under sched_ext:
948    /// the kernel sets this counter via
949    /// `__update_stats_enqueue_sleeper` from CFS/RT/DL paths
950    /// only.
951    pub sleep_max: crate::metric_types::PeakNs,
952    /// Total nanoseconds blocked in the scheduler — every path
953    /// that puts the task into `TASK_UNINTERRUPTIBLE` contributes:
954    /// swap-in, page-fault resolution, disk I/O, plus
955    /// mutex/rwsem/completion waits inside kernel code that
956    /// hold the task off the runqueue. Populated from
957    /// `/proc/<tid>/sched`'s `sum_block_runtime` key (kernel
958    /// emits `ms.ns_remainder` via `PN_SCHEDSTAT`; the parser
959    /// reconstructs full ns). `block_sum - iowait_sum` is
960    /// therefore an UPPER BOUND on non-iowait involuntary-block
961    /// time — swap/zswap decompression contributes, but so do
962    /// the lock-family waits, so the delta cannot be read as
963    /// swap latency without further attribution. There is no
964    /// `block_count` counterpart: the kernel does not emit one.
965    /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
966    /// sched_ext: the kernel updates this counter via
967    /// `__update_stats_enqueue_sleeper` (`kernel/sched/stats.c`),
968    /// called from CFS/RT/DL paths only.
969    pub block_sum: crate::metric_types::MonotonicNs,
970    /// Longest single block window in nanoseconds.
971    /// `/proc/<tid>/sched` `block_max` emitted via `PN_SCHEDSTAT`
972    /// (`ms.ns_remainder`, reconstructed by the parser). Tail-
973    /// latency signal that pairs with the `block_sum` average.
974    /// Zero on kernels without `CONFIG_SCHEDSTATS`. Zero under
975    /// sched_ext: the kernel sets this counter via
976    /// `__update_stats_enqueue_sleeper` from CFS/RT/DL paths
977    /// only.
978    pub block_max: crate::metric_types::PeakNs,
979    /// Total nanoseconds in I/O wait specifically (subset of
980    /// `block_sum`). Distinguishes disk-backed I/O delay from
981    /// the full involuntary-block total — callers that want
982    /// disk latency alone read this field, callers that want
983    /// every blocked window read `block_sum`. Populated from
984    /// `/proc/<tid>/sched`'s `iowait_sum` key (kernel emits
985    /// `ms.ns_remainder` via `PN_SCHEDSTAT`; the parser
986    /// reconstructs full ns). Zero on kernels without
987    /// `CONFIG_SCHEDSTATS`. Zero under sched_ext: the kernel
988    /// updates this counter via `__update_stats_enqueue_sleeper`
989    /// (`kernel/sched/stats.c`), called from CFS/RT/DL paths
990    /// only.
991    pub iowait_sum: crate::metric_types::MonotonicNs,
992    /// Number of I/O-wait windows the task accumulated — the
993    /// per-event tally that pairs with [`Self::iowait_sum`].
994    /// Populated from `/proc/<tid>/sched`'s `iowait_count` key
995    /// (kernel emits as `P_SCHEDSTAT`, plain u64). Zero on
996    /// kernels without `CONFIG_SCHEDSTATS`. Same write path as
997    /// `iowait_sum` (`__update_stats_enqueue_sleeper` in
998    /// `kernel/sched/stats.c`), so the same sched_ext caveat
999    /// applies: zero under sched_ext.
1000    pub iowait_count: crate::metric_types::MonotonicCount,
1001    /// Longest single CPU-burst (run-without-preempt window) in
1002    /// nanoseconds. `/proc/<tid>/sched` `exec_max` emitted via
1003    /// `PN_SCHEDSTAT` (`ms.ns_remainder`, reconstructed by the
1004    /// parser). Zero on kernels without `CONFIG_SCHEDSTATS`.
1005    /// Updated for sched_ext tasks too: the kernel sets it in
1006    /// `update_se` (`kernel/sched/fair.c`), which sched_ext
1007    /// reaches via `update_curr_scx` → `update_curr_common`.
1008    pub exec_max: crate::metric_types::PeakNs,
1009    /// Longest scheduling slice the task got before being
1010    /// preempted, in nanoseconds. `/proc/<tid>/sched` `slice_max`
1011    /// emitted via `PN_SCHEDSTAT` (`ms.ns_remainder`,
1012    /// reconstructed by the parser). Zero on kernels without
1013    /// `CONFIG_SCHEDSTATS`. Zero under sched_ext: the kernel sets
1014    /// this counter only in `set_next_entity`
1015    /// (`kernel/sched/fair.c`), a CFS-only path —
1016    /// sched_ext-managed tasks never accumulate slice_max even
1017    /// when CONFIG_SCHEDSTATS is enabled.
1018    pub slice_max: crate::metric_types::PeakNs,
1019
1020    // -- jemalloc per-thread TSD counters (tsd_s.thread_allocated / thread_deallocated, via ptrace) --
1021    /// Bytes allocated by this thread over its lifetime — read
1022    /// directly from jemalloc's per-thread TSD u64 counter
1023    /// (`tsd_s.thread_allocated`) via ptrace + `process_vm_readv`.
1024    /// Cumulative-from-thread-creation; jemalloc updates the
1025    /// per-thread TSD counters unconditionally on its alloc fast
1026    /// and slow paths, so attaching the probe late does not lose
1027    /// data.
1028    ///
1029    /// Distinct from [`crate::host_heap::HostHeapState::allocated_bytes`],
1030    /// which is the runner process's own
1031    /// `tikv_jemalloc_ctl::stats::allocated` reading — a global
1032    /// arena counter for the calling process. This field is the
1033    /// per-thread TSD counter for an arbitrary target thread the
1034    /// probe attached to.
1035    ///
1036    /// Zero when the capture layer could not pull the counter:
1037    /// (a) the target process is not linked against jemalloc,
1038    /// (b) the probe attach failed for any other reason (DWARF
1039    /// missing, jemalloc in a DSO rather than the main
1040    /// executable, arch mismatch),
1041    /// (c) the per-thread ptrace step failed (tid exited
1042    /// mid-capture, EPERM under YAMA scope=1 without
1043    /// `CAP_SYS_PTRACE`),
1044    /// or (d) the thread is in the calling process's own tgid
1045    /// (PTRACE_SEIZE rejects self-attach). All four collapse to
1046    /// zero per the best-effort "absent = 0" capture contract.
1047    /// Snapshot-level diagnosis lives on
1048    /// [`CtprofProbeSummary::dominant_failure`] (the per-tag
1049    /// plurality) and
1050    /// [`CtprofProbeSummary::privilege_dominant`] (the EPERM
1051    /// remediation gate, true when ptrace tags account for ≥ 50%
1052    /// of `failed`), reachable via
1053    /// [`CtprofSnapshot::probe_summary`]; the per-tag taxonomy
1054    /// is documented in the `ktstr ctprof capture` CLI help.
1055    pub allocated_bytes: crate::metric_types::Bytes,
1056    /// Bytes freed by this thread over its lifetime — read from
1057    /// jemalloc's per-thread TSD u64 counter
1058    /// (`tsd_s.thread_deallocated`) via the same probe path that
1059    /// populates [`Self::allocated_bytes`].
1060    /// `allocated_bytes - deallocated_bytes` is a thread-local
1061    /// estimate of currently-held bytes; the difference races
1062    /// any in-flight allocator activity since the two counters
1063    /// are sampled in one `process_vm_readv` over a 24-byte span
1064    /// the target may continue to mutate during the read.
1065    pub deallocated_bytes: crate::metric_types::Bytes,
1066
1067    // -- procfs /proc/<tid>/stat: page faults + CPU time (fields 10, 12, 14, 15) --
1068    /// Minor faults (no disk I/O). `/proc/<tid>/stat` field 10.
1069    pub minflt: crate::metric_types::MonotonicCount,
1070    /// Major faults (backed by disk). `/proc/<tid>/stat` field 12.
1071    pub majflt: crate::metric_types::MonotonicCount,
1072    /// User-mode CPU time in USER_HZ clock ticks since thread
1073    /// start. `/proc/<tid>/stat` field 14
1074    /// (`nsec_to_clock_t(utime)` in `fs/proc/array.c::do_task_stat`).
1075    /// USER_HZ-scaled like [`Self::start_time_clock_ticks`] —
1076    /// cross-host comparison between x86_64 and aarch64 is
1077    /// meaningful because USER_HZ is 100 on both, independent of
1078    /// CONFIG_HZ. Suffix `_clock_ticks` mirrors the existing
1079    /// `start_time_clock_ticks` precedent.
1080    pub utime_clock_ticks: crate::metric_types::ClockTicks,
1081    /// Kernel-mode CPU time in USER_HZ clock ticks since thread
1082    /// start. `/proc/<tid>/stat` field 15
1083    /// (`nsec_to_clock_t(stime)` in `fs/proc/array.c::do_task_stat`).
1084    /// Same USER_HZ scaling and `_clock_ticks` suffix convention as
1085    /// [`Self::utime_clock_ticks`].
1086    pub stime_clock_ticks: crate::metric_types::ClockTicks,
1087    /// Kernel-internal scheduler priority (signed). Distinct
1088    /// from [`Self::nice`] — `priority` is the post-bias
1089    /// scheduling priority (`task_prio(task)`) the scheduler
1090    /// uses for ordering, while `nice` is the
1091    /// userspace-presentable [-20, 19] preference.
1092    /// `/proc/<tid>/stat` field 18, emitted via
1093    /// `seq_put_decimal_ll(m, " ", priority)` (the local `priority`
1094    /// = `task_prio(task)`) in `do_task_stat()` (`fs/proc/array.c`).
1095    /// Range per `task_prio()` (`kernel/sched/syscalls.c`):
1096    /// CFS / SCHED_OTHER tasks see `[0..39]` (nice [-20..19]
1097    /// translated by `task_prio()` returning
1098    /// `p->prio - MAX_RT_PRIO`); SCHED_FIFO / SCHED_RR tasks
1099    /// see `[-2..-100]`; SCHED_DEADLINE tasks land at `-101`.
1100    /// Default 0 when the stat read fails — collides with the
1101    /// CFS nice-0 case, so a CFS task at default nice and an
1102    /// absent stat line both render 0. Wrapped in
1103    /// [`crate::metric_types::OrdinalI32`] for the
1104    /// `[min, max]` range reduction across a group.
1105    pub priority: crate::metric_types::OrdinalI32,
1106    /// Real-time scheduler priority. `/proc/<tid>/stat` field
1107    /// 40, emitted via `seq_put_decimal_ull(m, " ", task->rt_priority)`
1108    /// in `do_task_stat()` (`fs/proc/array.c`). Non-zero only when the task
1109    /// runs SCHED_FIFO or SCHED_RR; CFS / SCHED_OTHER tasks
1110    /// land at zero. Useful as a post-hoc filter to identify
1111    /// real-time threads in a snapshot. Wrapped in
1112    /// [`crate::metric_types::OrdinalU32`] for the
1113    /// `[min, max]` range reduction across a group; the inner
1114    /// `u32` matches the kernel's
1115    /// `unsigned int task_struct::rt_priority` declaration
1116    /// (`include/linux/sched.h`) exactly. Practical range is
1117    /// bounded `0..99` regardless of the type width.
1118    pub rt_priority: crate::metric_types::OrdinalU32,
1119
1120    // -- /proc/<tid>/sched additions (counters + ordinal + slice gauge) --
1121    /// Cumulative time this task forced its SMT sibling idle for
1122    /// core-scheduling, in nanoseconds. `/proc/<tid>/sched`
1123    /// `core_forceidle_sum`, dotted ms.ns format via
1124    /// `PN_SCHEDSTAT` in `proc_sched_show_task()` (`kernel/sched/debug.c`).
1125    /// Reconstructed to full ns via the same
1126    /// `parsed_ns_from_dotted` helper as `wait_sum` /
1127    /// `block_sum`.
1128    ///
1129    /// Increment occurs in `__account_forceidle_time()`
1130    /// (`kernel/sched/cputime.c`), called from
1131    /// `__sched_core_account_forceidle()`
1132    /// (`kernel/sched/core_sched.c`). The increment body is a plain
1133    /// `__schedstat_add(p->stats.core_forceidle_sum, delta)` —
1134    /// it is CLASS-AGNOSTIC. The caller iterates
1135    /// `for_each_cpu(i, smt_mask)` and picks
1136    /// `p = rq_i->core_pick ?: rq_i->curr` on each SMT sibling,
1137    /// charging whichever task is running there regardless of
1138    /// scheduling class. So a SCHED_EXT / DEADLINE / RR / FIFO
1139    /// task on a core-scheduled SMT cohort CAN accrue forceidle
1140    /// time the same way a CFS task can.
1141    ///
1142    /// Real gating is at the rq/build level, not per-task, and
1143    /// the runtime gates apply IN SERIES rather than equating —
1144    /// `sched_core_enabled(rq)` and `core_forceidle_count` are
1145    /// independent conditions that BOTH have to fire:
1146    ///
1147    /// - **Build:** `CONFIG_SCHED_CORE` (file-level `#ifdef` in
1148    ///   `kernel/sched/cputime.c` and
1149    ///   `kernel/sched/core_sched.c`).
1150    /// - **Build:** `CONFIG_SCHEDSTATS` (the caller's own
1151    ///   `#ifdef CONFIG_SCHEDSTATS` in `__sched_core_account_forceidle()`).
1152    /// - **Runtime, scheduler-class entry:**
1153    ///   `sched_core_enabled(rq)` is the FIRST gate — checked
1154    ///   at `pick_next_task()` entry (`kernel/sched/core.c`)
1155    ///   with an early `__pick_next_task()` return when false.
1156    ///   No core-wide selection runs without this.
1157    /// - **Runtime, transient counter:**
1158    ///   `rq->core->core_forceidle_count > 0` is a SEPARATE
1159    ///   subsequent gate — `pick_next_task()` only invokes
1160    ///   `sched_core_account_forceidle(rq)` when this counter is
1161    ///   non-zero (`kernel/sched/core.c`); the
1162    ///   `WARN_ON_ONCE(!rq->core->core_forceidle_count)` inside
1163    ///   `__sched_core_account_forceidle()`
1164    ///   (`kernel/sched/core_sched.c`) reasserts the same
1165    ///   precondition. The early-return in the same function
1166    ///   on `core_forceidle_start == 0` is then a third
1167    ///   transient guard against accounting before
1168    ///   forceidle has begun.
1169    /// - **Runtime, occupancy:** non-zero
1170    ///   `core_forceidle_occupation` (the `WARN_ON_ONCE` in
1171    ///   `__sched_core_account_forceidle()`).
1172    ///
1173    /// Kernels that fail any build gate, or rqs that fail any
1174    /// runtime gate, see this counter at zero for every task.
1175    /// Hosts where no SMT cohort has ever accumulated forceidle
1176    /// also see zero across the board.
1177    pub core_forceidle_sum: crate::metric_types::MonotonicNs,
1178    /// Per-thread `se.slice` in nanoseconds. For fair-class
1179    /// tasks (SCHED_NORMAL / SCHED_BATCH) this is the
1180    /// instantaneous slice CFS is currently running the task
1181    /// with. For SCHED_EXT tasks the line is still emitted but
1182    /// reflects stale `p->se.slice` state — ext-class
1183    /// schedulers maintain slice in `p->scx.slice` and do not
1184    /// update `p->se.slice`. Field name `fair_slice_ns` mirrors
1185    /// the kernel emission gate `fair_policy(p->policy)`, not a
1186    /// guarantee about which class actually populated the value.
1187    ///
1188    /// `/proc/<tid>/sched` `se.slice`, plain integer via
1189    /// `P(se.slice)` in `proc_sched_show_task()`
1190    /// (`kernel/sched/debug.c`), gated by `fair_policy(p->policy)`
1191    /// in the same function. `fair_policy()` is defined in
1192    /// `kernel/sched/sched.h` as
1193    /// `normal_policy(policy) || policy == SCHED_BATCH`, and
1194    /// `normal_policy()` (`sched.h`) returns true for
1195    /// SCHED_NORMAL AND, when `CONFIG_SCHED_CLASS_EXT` is
1196    /// built, for SCHED_EXT. So the line IS emitted for
1197    /// SCHED_EXT tasks on a sched_ext-enabled kernel — but the
1198    /// value carries the staleness caveat above. The parser
1199    /// cannot distinguish "ext-class hasn't refreshed
1200    /// `p->se.slice` since the task left the fair class" from
1201    /// "CFS task with a current slice that happens to equal the
1202    /// last value": that ambiguity is the user's to resolve via
1203    /// `policy` (also captured per-thread). Tasks under
1204    /// SCHED_DEADLINE / SCHED_RR / SCHED_FIFO / SCHED_IDLE land
1205    /// at the absent-line default of 0.
1206    ///
1207    /// This is a GAUGE (instantaneous current value), not a
1208    /// counter or high-water mark. Distinct from
1209    /// [`Self::slice_max`] which IS the schedstat lifetime
1210    /// high-water — a thread that hasn't run for a long time
1211    /// can have a stale `fair_slice_ns` value while `slice_max`
1212    /// continues to reflect the historical worst. Aggregation
1213    /// across a group uses `Max` so the rendered cell shows the
1214    /// longest current slice any thread in the group is running
1215    /// with — Sum would multiply a near-identical instantaneous
1216    /// value across the group and obscure the signal (and would
1217    /// also be semantically meaningless: instantaneous gauges
1218    /// do not add).
1219    pub fair_slice_ns: crate::metric_types::GaugeNs,
1220
1221    // -- /proc/<tid>/status (process-wide tgid count) --
1222    /// Total threads in this task's tgid (process-wide thread
1223    /// count, the `signal_struct->nr_threads` snapshot). Field
1224    /// name mirrors the kernel struct member to avoid collision
1225    /// with [`CtprofSnapshot::threads`] (the snapshot's own
1226    /// `Vec<ThreadState>`). `/proc/<pid>/status` `Threads:` line
1227    /// emitted in `task_sig()` (`fs/proc/array.c`) via
1228    /// `seq_put_decimal_ull(m, "Threads:\t", num_threads)`.
1229    /// Identical for every thread of the same tgid.
1230    ///
1231    /// Capture-side dedup: the field is populated ONLY on the
1232    /// thread leader (tid == tgid) and zero for non-leader
1233    /// threads of the same process. The registry pairs this with
1234    /// [`crate::ctprof_compare::AggRule::MaxGaugeCount`] (not
1235    /// Sum) so the rendered cell surfaces "the largest process
1236    /// represented in this bucket" regardless of grouping axis.
1237    /// Sum would be wrong under `--group-by comm` and
1238    /// `--group-by cgroup` because non-leader buckets get a 0
1239    /// contribution from every member — a bucket whose leader
1240    /// thread did NOT match the grouping
1241    /// would render 0 even though processes are represented.
1242    /// Wrapped in [`crate::metric_types::GaugeCount`] so the
1243    /// type system rejects sum-style aggregation: a bucket with
1244    /// N threads sharing a tgid would over-count the parent
1245    /// process N-fold under Sum, while Max is well-defined
1246    /// (largest current count any contributor reported).
1247    pub nr_threads: crate::metric_types::GaugeCount,
1248
1249    // -- /proc/<tid>/smaps_rollup (per-MM memory breakdown) --
1250    /// Per-process memory breakdown from
1251    /// `/proc/<tid>/smaps_rollup`, parsed as a key-value map
1252    /// with values in kilobytes (the kernel's native unit on
1253    /// this file — `__show_smap()` (`fs/proc/task_mmu.c`)
1254    /// emits every line as `Name: NN kB`).
1255    ///
1256    /// Stored as a [`BTreeMap`] for forward-compat with the
1257    /// open key set: rollup mode (gated in `__show_smap()`)
1258    /// emits 22 keys on a recent kernel — Rss, Pss, Pss_Dirty,
1259    /// Pss_Anon, Pss_File, Pss_Shmem, Shared_Clean,
1260    /// Shared_Dirty, Private_Clean, Private_Dirty, Referenced,
1261    /// Anonymous, KSM, LazyFree, AnonHugePages,
1262    /// ShmemPmdMapped, FilePmdMapped, Shared_Hugetlb,
1263    /// Private_Hugetlb, Swap, SwapPss, Locked, plus the
1264    /// `[rollup]` header which the parser elides. The map
1265    /// preserves any future-kernel keys without a schema bump.
1266    /// Pss is the most operationally valuable: proportional
1267    /// share of shared pages — distinguishes "sole owner" from
1268    /// "one of N sharing".
1269    ///
1270    /// Per-MM, not per-thread: every thread of the same tgid
1271    /// shares one mm_struct, so all threads expose identical
1272    /// values. Capture-side dedup populates ONLY the thread
1273    /// leader (tid == tgid) and leaves non-leader threads at
1274    /// the empty map. Mirrors [`Self::nr_threads`]'s
1275    /// leader-dedup discipline. The capture cost is one
1276    /// `read_to_string` per tgid (NOT per-tid) because
1277    /// non-leaders short-circuit before opening the file.
1278    ///
1279    /// Empty when smaps_rollup is absent (older kernels
1280    /// without `/proc/<pid>/smaps_rollup` support — added
1281    /// upstream in 4.14) or unreadable (typical
1282    /// permission-denied for /proc/1/smaps_rollup outside
1283    /// CAP_SYS_PTRACE).
1284    pub smaps_rollup_kib: BTreeMap<String, u64>,
1285
1286    // -- I/O (/proc/<tid>/io) --
1287    //
1288    // The whole file is emitted by `do_io_accounting`
1289    // (`fs/proc/base.c`) under a single `CONFIG_TASK_IO_ACCOUNTING`
1290    // gate, and `CONFIG_TASK_IO_ACCOUNTING` `depends on`
1291    // `CONFIG_TASK_XACCT` in `init/Kconfig` — so from the
1292    // procfs-reader perspective the file either appears with all
1293    // 7 fields or doesn't appear at all. The XACCT split that
1294    // sometimes shows up in kernel commentary describes the
1295    // increment-side path, not the procfs surface; for the
1296    // capture pipeline the relevant gate is `CONFIG_TASK_IO_ACCOUNTING`
1297    // for every field below.
1298    /// Bytes read at the read syscall layer (incl. cached /
1299    /// pagecache hits). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1300    pub rchar: crate::metric_types::Bytes,
1301    /// Bytes written at the write syscall layer (incl.
1302    /// pagecache / writeback). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1303    pub wchar: crate::metric_types::Bytes,
1304    /// Number of read syscalls. Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1305    pub syscr: crate::metric_types::MonotonicCount,
1306    /// Number of write syscalls. Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1307    pub syscw: crate::metric_types::MonotonicCount,
1308    /// Bytes that hit the storage device on read (excludes
1309    /// pagecache hits). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1310    pub read_bytes: crate::metric_types::Bytes,
1311    /// Bytes that hit the storage device on write
1312    /// (post-writeback). Gated by `CONFIG_TASK_IO_ACCOUNTING`.
1313    pub write_bytes: crate::metric_types::Bytes,
1314    /// Bytes the kernel deaccounted from a prior dirty-write
1315    /// because the page was reclaimed without writeback (truncate,
1316    /// inode invalidation). `/proc/<tid>/io` 7th line, gated by
1317    /// `CONFIG_TASK_IO_ACCOUNTING`.
1318    ///
1319    /// `include/linux/task_io_accounting_ops.h`
1320    /// (`task_io_account_cancelled_write`) increments
1321    /// `current->ioac.cancelled_write_bytes` — i.e. the value
1322    /// records on the task that triggers the deaccount
1323    /// (the truncating / unmapping task), NOT the original
1324    /// writer. Sole call site is `folio_account_cleaned`
1325    /// (`mm/page-writeback.c`), invoked when a dirty folio
1326    /// is reclaimed without going through writeback.
1327    ///
1328    /// Operationally this is a "negative write" signal — bytes
1329    /// the kernel previously charged to a thread's `wchar`
1330    /// pipeline that never ended up on disk. Higher values mean
1331    /// more wasted writeback intent. Per-thread interpretation
1332    /// is asymmetric vs. [`Self::write_bytes`]: a thread's
1333    /// `cancelled_write_bytes` does NOT correspond to its own
1334    /// `write_bytes` — the writer and the canceller may be
1335    /// distinct tasks. Group-level Sum across a registry-grouped
1336    /// bucket is therefore meaningful (total bytes the bucket's
1337    /// threads cancelled), but per-thread `actual_write_bytes
1338    /// = write_bytes - cancelled_write_bytes` is NOT defined for
1339    /// that reason — the two counters track different parties.
1340    pub cancelled_write_bytes: crate::metric_types::Bytes,
1341
1342    // -- taskstats delay accounting + memory watermarks (genetlink TASKSTATS family) --
1343    //
1344    // Per-tid records captured via the kernel's taskstats
1345    // genetlink interface (NOT exposed in /proc/<tid>/sched or
1346    // /proc/<tid>/stat). Two field families:
1347    //
1348    //   1. Delay accounting — eight categories (cpu/blkio/swapin/
1349    //      freepages/thrashing/compact/wpcopy/irq), each carrying
1350    //      count (number of events), delay_total_ns (cumulative
1351    //      ns of delay), delay_max_ns (longest single window),
1352    //      delay_min_ns (shortest non-zero window observed;
1353    //      sentinel 0 means "no events"). Gated on
1354    //      `CONFIG_TASKSTATS` + `CONFIG_TASK_DELAY_ACCT` plus the
1355    //      runtime `delayacct=on` toggle (sysctl
1356    //      `kernel.task_delayacct` or boot param `delayacct`).
1357    //
1358    //   2. Memory watermarks — `hiwater_rss_bytes` and
1359    //      `hiwater_vm_bytes`. Gated on `CONFIG_TASKSTATS` +
1360    //      `CONFIG_TASK_XACCT` (NOT `CONFIG_TASK_DELAY_ACCT`).
1361    //      Populated from the shared `mm_struct` so sibling tgid
1362    //      threads report identical values, and kernel threads
1363    //      (mm == NULL) leave the field at zero — see the
1364    //      per-field doc on `hiwater_rss_bytes`.
1365    //
1366    // Capture path is the [`crate::taskstats`] module —
1367    // best-effort, all fields collapse to zero when:
1368    //   - the kernel was built without `CONFIG_TASKSTATS`,
1369    //   - the relevant per-family kconfig is off (DELAY_ACCT or
1370    //     XACCT, depending on the field),
1371    //   - the runtime `delayacct=on` toggle is off (delay-family
1372    //     fields only — XACCT does not gate on the toggle),
1373    //   - the calling process lacks `CAP_NET_ADMIN`,
1374    //   - the per-tid query races a task exit (ESRCH).
1375    //
1376    // CAVEATS:
1377    //   - cpu_delay is RACY (sched_info path, no lock) — count and
1378    //     delay_total are not updated atomically.
1379    //   - swapin and thrashing OVERLAP — a thrashing event is also
1380    //     a swapin event from the syscall layer; do not sum.
1381    //   - delay_min == 0 means "no events observed", NOT "saw a
1382    //     zero-ns event". Compare against the matching count.
1383    //   - hiwater_* values are per-mm, not per-thread; sibling
1384    //     tgid threads report identical values, kernel threads
1385    //     (mm == NULL) report zero. See the per-field doc.
1386    /// Number of off-CPU windows the task waited for the runqueue
1387    /// to schedule it. Source: taskstats `cpu_count`, populated at
1388    /// query time from `tsk->sched_info.pcount` (incremented by
1389    /// `sched_info_arrive` in `kernel/sched/stats.h`, line 282).
1390    /// `delayacct_add_tsk` (`kernel/delayacct.c::delayacct_add_tsk`,
1391    /// line 175) snapshots the value into the reply via
1392    /// `d->cpu_count += t1` where `t1 = tsk->sched_info.pcount`.
1393    pub cpu_delay_count: crate::metric_types::MonotonicCount,
1394    /// Cumulative ns the task spent waiting on the runqueue.
1395    /// Source: taskstats `cpu_delay_total`. RACY: count and total
1396    /// are not updated atomically (sched_info path, no lock); a
1397    /// concurrent reader may observe count or total advance ahead
1398    /// of the other.
1399    pub cpu_delay_total_ns: crate::metric_types::MonotonicNs,
1400    /// Longest single CPU-wait window, ns. Source: taskstats
1401    /// `cpu_delay_max`. Same lifetime-watermark semantics as
1402    /// `wait_max` / `block_max` — `MaxPeak` aggregation surfaces
1403    /// the worst single window any thread in the group ever
1404    /// experienced.
1405    pub cpu_delay_max_ns: crate::metric_types::PeakNs,
1406    /// Shortest non-zero CPU-wait window, ns. Source: taskstats
1407    /// `cpu_delay_min`. Sentinel 0 means "no events observed":
1408    /// the kernel writes the field on every event, so 0 is
1409    /// distinguishable from a genuine zero-ns event by checking
1410    /// `cpu_delay_count == 0`. `PeakNs` aggregation surfaces "the
1411    /// largest minimum any thread reported" across the group.
1412    pub cpu_delay_min_ns: crate::metric_types::PeakNs,
1413    /// Number of block-I/O wait windows. Source: taskstats
1414    /// `blkio_count`. Updates from `delayacct_blkio_start/end` in
1415    /// `kernel/delayacct.c`.
1416    pub blkio_delay_count: crate::metric_types::MonotonicCount,
1417    /// Cumulative ns the task waited on synchronous block I/O.
1418    /// Source: taskstats `blkio_delay_total`. Distinct from
1419    /// `iowait_sum` (schedstat) which counts a different bucket;
1420    /// the delayacct path is the canonical block-I/O delay
1421    /// accounting.
1422    pub blkio_delay_total_ns: crate::metric_types::MonotonicNs,
1423    /// Longest single block-I/O wait window, ns. Source: taskstats
1424    /// `blkio_delay_max`.
1425    pub blkio_delay_max_ns: crate::metric_types::PeakNs,
1426    /// Shortest non-zero block-I/O wait window, ns. Source:
1427    /// taskstats `blkio_delay_min`. Sentinel-0 caveat per
1428    /// `cpu_delay_min_ns`.
1429    pub blkio_delay_min_ns: crate::metric_types::PeakNs,
1430    /// Number of swap-in wait windows. Source: taskstats
1431    /// `swapin_count`. NOTE: overlaps with `thrashing_count` —
1432    /// every thrashing event is also a swapin event from the
1433    /// syscall layer; do not sum.
1434    pub swapin_delay_count: crate::metric_types::MonotonicCount,
1435    /// Cumulative ns waiting for swap-in to complete. Source:
1436    /// taskstats `swapin_delay_total`.
1437    pub swapin_delay_total_ns: crate::metric_types::MonotonicNs,
1438    /// Longest single swap-in wait, ns. Source: taskstats
1439    /// `swapin_delay_max`.
1440    pub swapin_delay_max_ns: crate::metric_types::PeakNs,
1441    /// Shortest non-zero swap-in wait, ns. Sentinel-0 caveat per
1442    /// `cpu_delay_min_ns`.
1443    pub swapin_delay_min_ns: crate::metric_types::PeakNs,
1444    /// Number of direct-reclaim (free-pages) wait windows. Source:
1445    /// taskstats `freepages_count`. Updates from
1446    /// `delayacct_freepages_start/end` (mm/page_alloc.c).
1447    pub freepages_delay_count: crate::metric_types::MonotonicCount,
1448    /// Cumulative ns waiting in direct memory reclaim. Source:
1449    /// taskstats `freepages_delay_total`.
1450    pub freepages_delay_total_ns: crate::metric_types::MonotonicNs,
1451    /// Longest single direct-reclaim wait, ns. Source: taskstats
1452    /// `freepages_delay_max`.
1453    pub freepages_delay_max_ns: crate::metric_types::PeakNs,
1454    /// Shortest non-zero direct-reclaim wait, ns. Sentinel-0 caveat
1455    /// per `cpu_delay_min_ns`.
1456    pub freepages_delay_min_ns: crate::metric_types::PeakNs,
1457    /// Number of thrashing wait windows. Source: taskstats
1458    /// `thrashing_count`. OVERLAPS with `swapin_*`: thrashing
1459    /// detection is a refinement of swapin tracking
1460    /// (mm/workingset.c).
1461    pub thrashing_delay_count: crate::metric_types::MonotonicCount,
1462    /// Cumulative ns waiting under thrashing pressure. Source:
1463    /// taskstats `thrashing_delay_total`.
1464    pub thrashing_delay_total_ns: crate::metric_types::MonotonicNs,
1465    /// Longest single thrashing wait, ns. Source: taskstats
1466    /// `thrashing_delay_max`.
1467    pub thrashing_delay_max_ns: crate::metric_types::PeakNs,
1468    /// Shortest non-zero thrashing wait, ns. Sentinel-0 caveat per
1469    /// `cpu_delay_min_ns`.
1470    pub thrashing_delay_min_ns: crate::metric_types::PeakNs,
1471    /// Number of memory-compaction wait windows. Source: taskstats
1472    /// `compact_count`. Updates from `delayacct_compact_start/end`
1473    /// (mm/compaction.c).
1474    pub compact_delay_count: crate::metric_types::MonotonicCount,
1475    /// Cumulative ns waiting on memory compaction. Source:
1476    /// taskstats `compact_delay_total`.
1477    pub compact_delay_total_ns: crate::metric_types::MonotonicNs,
1478    /// Longest single compaction wait, ns. Source: taskstats
1479    /// `compact_delay_max`.
1480    pub compact_delay_max_ns: crate::metric_types::PeakNs,
1481    /// Shortest non-zero compaction wait, ns. Sentinel-0 caveat
1482    /// per `cpu_delay_min_ns`.
1483    pub compact_delay_min_ns: crate::metric_types::PeakNs,
1484    /// Number of write-protect-copy (CoW) fault wait windows.
1485    /// Source: taskstats `wpcopy_count`. Updates from
1486    /// `delayacct_wpcopy_start/end` (mm/memory.c).
1487    pub wpcopy_delay_count: crate::metric_types::MonotonicCount,
1488    /// Cumulative ns waiting on write-protect-copy faults. Source:
1489    /// taskstats `wpcopy_delay_total`.
1490    pub wpcopy_delay_total_ns: crate::metric_types::MonotonicNs,
1491    /// Longest single wpcopy wait, ns. Source: taskstats
1492    /// `wpcopy_delay_max`.
1493    pub wpcopy_delay_max_ns: crate::metric_types::PeakNs,
1494    /// Shortest non-zero wpcopy wait, ns. Sentinel-0 caveat per
1495    /// `cpu_delay_min_ns`.
1496    pub wpcopy_delay_min_ns: crate::metric_types::PeakNs,
1497    /// Number of IRQ-handler windows the task delegated. Source:
1498    /// taskstats `irq_count`. Updates from `delayacct_irq` in
1499    /// `kernel/delayacct.c` — counts kernel-IRQ time charged to
1500    /// the task by the IRQ accounting subsystem.
1501    pub irq_delay_count: crate::metric_types::MonotonicCount,
1502    /// Cumulative ns of IRQ handling time charged to the task.
1503    /// Source: taskstats `irq_delay_total`.
1504    pub irq_delay_total_ns: crate::metric_types::MonotonicNs,
1505    /// Longest single IRQ-handler window, ns. Source: taskstats
1506    /// `irq_delay_max`.
1507    pub irq_delay_max_ns: crate::metric_types::PeakNs,
1508    /// Shortest non-zero IRQ-handler window, ns. Sentinel-0 caveat
1509    /// per `cpu_delay_min_ns`.
1510    pub irq_delay_min_ns: crate::metric_types::PeakNs,
1511    /// Lifetime high-watermark of resident-set size, bytes. Source:
1512    /// taskstats `hiwater_rss` (kB), converted at parse time via
1513    /// `saturating_mul(1024)`. Updates from `xacct_add_tsk` in
1514    /// `kernel/tsacct.c::xacct_add_tsk`. Distinct from
1515    /// `smaps_rollup_kib["Rss"]` which is the CURRENT RSS —
1516    /// this field is the lifetime peak.
1517    ///
1518    /// **Kernel threads read zero**: `xacct_add_tsk`
1519    /// (`kernel/tsacct.c`) calls `mm = get_task_mm(p)` and the
1520    /// hiwater assignments are guarded by
1521    /// `if (mm)`. Kernel threads (`PF_KTHREAD`, `tsk->mm == NULL`)
1522    /// skip the assignment entirely, so the field stays at the
1523    /// kernel-side zero default.
1524    ///
1525    /// **Sibling threads of the same tgid see the same value**:
1526    /// `get_mm_hiwater_rss(mm)` reads from the shared
1527    /// `mm_struct`, so every thread of a process reports the same
1528    /// hiwater value. The registry's `MaxPeakBytes` aggregation
1529    /// behaves as a per-process selector when buckets span
1530    /// multiple tgids: cross-tgid Max picks the largest
1531    /// per-process watermark in the bucket; intra-tgid Max is a
1532    /// no-op (every sibling reports the same number).
1533    pub hiwater_rss_bytes: crate::metric_types::PeakBytes,
1534    /// Lifetime high-watermark of virtual-memory size, bytes.
1535    /// Source: taskstats `hiwater_vm` (kB), converted at parse
1536    /// time. Same kernel write path as `hiwater_rss_bytes` —
1537    /// inherits the same kernel-thread zero and same sibling-tid
1538    /// shared-mm caveats; see [`Self::hiwater_rss_bytes`].
1539    pub hiwater_vm_bytes: crate::metric_types::PeakBytes,
1540    /// Whether this thread's taskstats genetlink query succeeded and populated
1541    /// the payload — `true` iff `apply_delay_stats` ran on an `Ok` query. This
1542    /// is the capture-mechanism flag for the WHOLE taskstats payload: one query
1543    /// (`fill_stats`) fills BOTH the delay-accounting family (cpu/blkio/... delay
1544    /// counters) AND the xacct memory watermarks (`hiwater_rss_bytes` /
1545    /// `hiwater_vm_bytes`) together, so they share this one flag. `false` when
1546    /// the query could not capture (CONFIG_TASKSTATS off, no CAP_NET_ADMIN, or
1547    /// the query raced task exit), leaving the absent-counter zero defaults. The
1548    /// group aggregation reads this to distinguish a captured (measured) zero
1549    /// from a never-captured payload — without it both read as a sentinel `0`
1550    /// and a derived metric like `total_offcpu_delay_ns` renders "0" instead of
1551    /// "-". A whole group with no captured thread aggregates to
1552    /// [`crate::ctprof_compare::Aggregated::Absent`].
1553    ///
1554    /// QUERY-level: `true` means THIS thread's taskstats query succeeded. Per
1555    /// sub-family ENABLEMENT is carried separately by [`Self::cpu_delay_active`] /
1556    /// [`Self::delay_block_active`] / [`Self::xacct_active`] (baked at capture from
1557    /// the host `/proc/sys/kernel/task_delayacct` + `/proc/config.gz` probes). The
1558    /// group measured predicate ANDs this query-Ok flag with the relevant
1559    /// sub-family active flag, so a sub-family disabled while the query still
1560    /// succeeds (`CONFIG_TASK_XACCT` off, or the `kernel.task_delayacct` sysctl
1561    /// off, with the other family on) now renders "-" not "0". On ktstr's own
1562    /// kernel (all configs `=y`, delayacct booted on) every sub-family is active,
1563    /// so the gating is an in-VM no-op; it only changes host-facing `ctprof
1564    /// capture` against single-family kernels.
1565    pub taskstats_measured: bool,
1566    /// Host-wide enablement of the `cpu_delay_*` sub-family (sched_info-sourced,
1567    /// filled unconditionally by `delayacct_add_tsk`): CONFIG_TASK_DELAY_ACCT is
1568    /// built in — survives the runtime `task_delayacct` toggle. Baked at capture
1569    /// from `host_context::probe_taskstats_active`; AND-ed with
1570    /// [`Self::taskstats_measured`] by the group measured predicate. No
1571    /// `serde(default)` (matching the sibling capture flags): a sidecar predating
1572    /// this field fails to deserialize and is regenerated by re-running, per the
1573    /// disposable-sidecar policy.
1574    pub cpu_delay_active: bool,
1575    /// Host-wide enablement of the delayacct resource-wait sub-family (`blkio` /
1576    /// `swapin` / `freepages` / `thrashing` / `compact` / `wpcopy` / `irq`): the
1577    /// runtime `task_delayacct` toggle is ON (these are gated by `tsk->delays`,
1578    /// allocated at fork only when on). Baked at capture; AND-ed with
1579    /// [`Self::taskstats_measured`].
1580    pub delay_block_active: bool,
1581    /// Host-wide enablement of the xacct watermark sub-family
1582    /// (`hiwater_rss_bytes`, `hiwater_vm_bytes`): CONFIG_TASK_XACCT is built in (no
1583    /// runtime toggle); an unknown host config (`/proc/config.gz` not exposed) is
1584    /// treated as active to avoid a false absent. Baked at capture; AND-ed with
1585    /// [`Self::taskstats_measured`].
1586    pub xacct_active: bool,
1587    /// Whether this thread's jemalloc `allocated_bytes` / `deallocated_bytes`
1588    /// were captured from a successful per-thread TSD probe read, versus left at
1589    /// the absent-as-`0` default (process not jemalloc-linked, the probe could
1590    /// not attach, or the per-thread read failed). Set from the per-thread read
1591    /// outcome — NOT the per-tgid attach — mirroring `taskstats_measured`'s
1592    /// per-thread Ok-gating: a failed read is not a measurement. Same
1593    /// measured-vs-zero discipline as [`Self::taskstats_measured`], for the
1594    /// `live_heap_estimate` derived metric.
1595    pub jemalloc_measured: bool,
1596}
1597
1598impl Default for ThreadState {
1599    /// Zero-valued sentinel — tid=0/tgid=0/empty strings are the
1600    /// "no thread observed yet" placeholder that ctprof inserts
1601    /// into HashMap entries before the /proc walk populates them
1602    /// from the live kernel state. Default-constructed ThreadState
1603    /// values are NOT visible to operator-facing output: the
1604    /// capture path in `capture_thread_at_with_tally`
1605    /// (which delegates to the per-file `/proc` read helpers in
1606    /// `parse`) overwrites each field from
1607    /// `/proc/<pid>/task/<tid>/{stat,status,schedstat,cgroup}` before
1608    /// the entry is read for rendering. The `state` char uses the
1609    /// `'~'` absent-value sentinel rather than the bare `char`
1610    /// Default `'\0'` because '\0' would print as an empty cell in
1611    /// the ctprof table and the absent-value glyph is operator-
1612    /// readable.
1613    fn default() -> Self {
1614        Self {
1615            tid: 0,
1616            tgid: 0,
1617            pcomm: String::new(),
1618            comm: String::new(),
1619            cgroup: String::new(),
1620            start_time_clock_ticks: 0,
1621            policy: Default::default(),
1622            nice: crate::metric_types::OrdinalI32(0),
1623            cpu_affinity: Default::default(),
1624            processor: Default::default(),
1625            // `'~'` (the absent-value sentinel) instead of the
1626            // bare `char` Default `'\0'`; see [`Self::state`].
1627            state: default_state_char(),
1628            ext_enabled: false,
1629            run_time_ns: Default::default(),
1630            wait_time_ns: Default::default(),
1631            timeslices: Default::default(),
1632            voluntary_csw: Default::default(),
1633            nonvoluntary_csw: Default::default(),
1634            nr_wakeups: Default::default(),
1635            nr_wakeups_local: Default::default(),
1636            nr_wakeups_remote: Default::default(),
1637            nr_wakeups_sync: Default::default(),
1638            nr_wakeups_migrate: Default::default(),
1639            nr_wakeups_affine: Default::default(),
1640            nr_wakeups_affine_attempts: Default::default(),
1641            nr_migrations: Default::default(),
1642            nr_forced_migrations: Default::default(),
1643            nr_failed_migrations_affine: Default::default(),
1644            nr_failed_migrations_running: Default::default(),
1645            nr_failed_migrations_hot: Default::default(),
1646            wait_sum: Default::default(),
1647            wait_count: Default::default(),
1648            wait_max: Default::default(),
1649            voluntary_sleep_ns: Default::default(),
1650            sleep_max: Default::default(),
1651            block_sum: Default::default(),
1652            block_max: Default::default(),
1653            iowait_sum: Default::default(),
1654            iowait_count: Default::default(),
1655            exec_max: Default::default(),
1656            slice_max: Default::default(),
1657            allocated_bytes: Default::default(),
1658            deallocated_bytes: Default::default(),
1659            minflt: Default::default(),
1660            majflt: Default::default(),
1661            utime_clock_ticks: Default::default(),
1662            stime_clock_ticks: Default::default(),
1663            priority: Default::default(),
1664            rt_priority: Default::default(),
1665            core_forceidle_sum: Default::default(),
1666            fair_slice_ns: Default::default(),
1667            nr_threads: Default::default(),
1668            smaps_rollup_kib: BTreeMap::new(),
1669            rchar: Default::default(),
1670            wchar: Default::default(),
1671            syscr: Default::default(),
1672            syscw: Default::default(),
1673            read_bytes: Default::default(),
1674            write_bytes: Default::default(),
1675            cancelled_write_bytes: Default::default(),
1676            cpu_delay_count: Default::default(),
1677            cpu_delay_total_ns: Default::default(),
1678            cpu_delay_max_ns: Default::default(),
1679            cpu_delay_min_ns: Default::default(),
1680            blkio_delay_count: Default::default(),
1681            blkio_delay_total_ns: Default::default(),
1682            blkio_delay_max_ns: Default::default(),
1683            blkio_delay_min_ns: Default::default(),
1684            swapin_delay_count: Default::default(),
1685            swapin_delay_total_ns: Default::default(),
1686            swapin_delay_max_ns: Default::default(),
1687            swapin_delay_min_ns: Default::default(),
1688            freepages_delay_count: Default::default(),
1689            freepages_delay_total_ns: Default::default(),
1690            freepages_delay_max_ns: Default::default(),
1691            freepages_delay_min_ns: Default::default(),
1692            thrashing_delay_count: Default::default(),
1693            thrashing_delay_total_ns: Default::default(),
1694            thrashing_delay_max_ns: Default::default(),
1695            thrashing_delay_min_ns: Default::default(),
1696            compact_delay_count: Default::default(),
1697            compact_delay_total_ns: Default::default(),
1698            compact_delay_max_ns: Default::default(),
1699            compact_delay_min_ns: Default::default(),
1700            wpcopy_delay_count: Default::default(),
1701            wpcopy_delay_total_ns: Default::default(),
1702            wpcopy_delay_max_ns: Default::default(),
1703            wpcopy_delay_min_ns: Default::default(),
1704            irq_delay_count: Default::default(),
1705            irq_delay_total_ns: Default::default(),
1706            irq_delay_max_ns: Default::default(),
1707            irq_delay_min_ns: Default::default(),
1708            hiwater_rss_bytes: Default::default(),
1709            hiwater_vm_bytes: Default::default(),
1710            // Absent until a successful capture sets them (apply_delay_stats /
1711            // the jemalloc probe assignment); a default ThreadState is the
1712            // not-yet-captured placeholder.
1713            taskstats_measured: false,
1714            cpu_delay_active: false,
1715            delay_block_active: false,
1716            xacct_active: false,
1717            jemalloc_measured: false,
1718        }
1719    }
1720}
1721
1722impl ThreadState {
1723    /// Overwrite the taskstats-sourced delay-accounting fields
1724    /// from a `DelayStats` payload. Called by `capture_with` /
1725    /// `capture_pid_with` after a successful per-tid
1726    /// [`crate::taskstats::TaskstatsClient::query_tid`] call;
1727    /// query failures leave the fields at the absent-counter
1728    /// default of zero installed in `capture_thread_at_with_tally`.
1729    pub(crate) fn apply_delay_stats(
1730        &mut self,
1731        ds: &crate::taskstats::DelayStats,
1732        active: crate::host_context::TaskstatsActive,
1733    ) {
1734        use crate::metric_types::{MonotonicCount, MonotonicNs, PeakBytes, PeakNs};
1735        self.cpu_delay_count = MonotonicCount(ds.cpu_count);
1736        self.cpu_delay_total_ns = MonotonicNs(ds.cpu_delay_total_ns);
1737        self.cpu_delay_max_ns = PeakNs(ds.cpu_delay_max_ns);
1738        self.cpu_delay_min_ns = PeakNs(ds.cpu_delay_min_ns);
1739        self.blkio_delay_count = MonotonicCount(ds.blkio_count);
1740        self.blkio_delay_total_ns = MonotonicNs(ds.blkio_delay_total_ns);
1741        self.blkio_delay_max_ns = PeakNs(ds.blkio_delay_max_ns);
1742        self.blkio_delay_min_ns = PeakNs(ds.blkio_delay_min_ns);
1743        self.swapin_delay_count = MonotonicCount(ds.swapin_count);
1744        self.swapin_delay_total_ns = MonotonicNs(ds.swapin_delay_total_ns);
1745        self.swapin_delay_max_ns = PeakNs(ds.swapin_delay_max_ns);
1746        self.swapin_delay_min_ns = PeakNs(ds.swapin_delay_min_ns);
1747        self.freepages_delay_count = MonotonicCount(ds.freepages_count);
1748        self.freepages_delay_total_ns = MonotonicNs(ds.freepages_delay_total_ns);
1749        self.freepages_delay_max_ns = PeakNs(ds.freepages_delay_max_ns);
1750        self.freepages_delay_min_ns = PeakNs(ds.freepages_delay_min_ns);
1751        self.thrashing_delay_count = MonotonicCount(ds.thrashing_count);
1752        self.thrashing_delay_total_ns = MonotonicNs(ds.thrashing_delay_total_ns);
1753        self.thrashing_delay_max_ns = PeakNs(ds.thrashing_delay_max_ns);
1754        self.thrashing_delay_min_ns = PeakNs(ds.thrashing_delay_min_ns);
1755        self.compact_delay_count = MonotonicCount(ds.compact_count);
1756        self.compact_delay_total_ns = MonotonicNs(ds.compact_delay_total_ns);
1757        self.compact_delay_max_ns = PeakNs(ds.compact_delay_max_ns);
1758        self.compact_delay_min_ns = PeakNs(ds.compact_delay_min_ns);
1759        self.wpcopy_delay_count = MonotonicCount(ds.wpcopy_count);
1760        self.wpcopy_delay_total_ns = MonotonicNs(ds.wpcopy_delay_total_ns);
1761        self.wpcopy_delay_max_ns = PeakNs(ds.wpcopy_delay_max_ns);
1762        self.wpcopy_delay_min_ns = PeakNs(ds.wpcopy_delay_min_ns);
1763        self.irq_delay_count = MonotonicCount(ds.irq_count);
1764        self.irq_delay_total_ns = MonotonicNs(ds.irq_delay_total_ns);
1765        self.irq_delay_max_ns = PeakNs(ds.irq_delay_max_ns);
1766        self.irq_delay_min_ns = PeakNs(ds.irq_delay_min_ns);
1767        self.hiwater_rss_bytes = PeakBytes(ds.hiwater_rss_bytes);
1768        self.hiwater_vm_bytes = PeakBytes(ds.hiwater_vm_bytes);
1769        // The only site that overwrites the absent-counter zero defaults from a
1770        // real taskstats payload: mark the payload measured (query Ok) so the
1771        // group aggregation (`ctprof_compare::groups::measured_predicate`) can
1772        // distinguish a measured zero from a never-captured payload
1773        // (CONFIG_TASKSTATS off / no CAP_NET_ADMIN / query raced exit), which
1774        // otherwise both read as 0. The per-sub-family enablement (host-global,
1775        // probed once per snapshot) is baked in alongside; the predicate ANDs the
1776        // query-Ok flag with the matching sub-family flag, so a sub-family
1777        // disabled while the query succeeds renders "-" not "0" (see the field
1778        // docs).
1779        self.taskstats_measured = true;
1780        self.cpu_delay_active = active.cpu_delay;
1781        self.delay_block_active = active.delay_block;
1782        self.xacct_active = active.xacct;
1783    }
1784
1785    /// Iterate over [`Self::smaps_rollup_kib`] with values
1786    /// converted from kilobytes to bytes via `saturating_mul(1024)`.
1787    /// The kernel emits smaps_rollup values in kB; the
1788    /// project's display layer auto-scales bytes via the
1789    /// existing "B" → KiB → MiB → GiB ladder, so a single
1790    /// helper centralizes the unit conversion at every render
1791    /// site (write_show + write_diff). Saturating multiply
1792    /// guards against pathological input from a malformed
1793    /// snapshot file. Wrapped in
1794    /// [`crate::metric_types::Bytes`] so the byte-typed value
1795    /// flows through the same auto-scale path as the rest of
1796    /// the byte-tagged registry metrics.
1797    pub fn smaps_rollup_bytes(
1798        &self,
1799    ) -> impl Iterator<Item = (&String, crate::metric_types::Bytes)> {
1800        self.smaps_rollup_kib
1801            .iter()
1802            .map(|(k, v)| (k, crate::metric_types::Bytes(v.saturating_mul(1024))))
1803    }
1804}
1805
1806/// Per-cgroup enrichment record attached to [`CtprofSnapshot`].
1807///
1808/// Populated from the cgroup v2 filesystem at capture time. The
1809/// shape mirrors the kernel's per-controller file layout:
1810/// [`CgroupCpuStats`] holds the `cpu.*` files,
1811/// [`CgroupMemoryStats`] holds the `memory.*` files,
1812/// [`CgroupPidsStats`] holds the `pids.*` files, and [`Psi`]
1813/// holds the `<resource>.pressure` files. These are
1814/// aggregate-over-the-cgroup values — NOT summable from
1815/// per-thread data — so the capture layer reads them directly
1816/// from cgroupfs rather than deriving.
1817///
1818/// Nested-struct shape (rather than a flat ~50-field struct)
1819/// mirrors the kernel's controller-by-controller exposure: a
1820/// reader who knows the kernel layout can map directly between
1821/// cgroupfs files and Rust fields, and the merge policy in
1822/// [`crate::ctprof_compare::flatten_cgroup_stats`] applies
1823/// per-domain (max for limits, min for floors, saturating_add
1824/// for counters) without conflating across domains.
1825///
1826/// Schema note: the previous flat shape (4 fields:
1827/// `cpu_usage_usec`, `nr_throttled`, `throttled_usec`,
1828/// `memory_current`) is gone. Snapshots written by older
1829/// versions deserialize via serde's defaulting — old fields
1830/// land on the new nested fields' zero defaults rather than
1831/// migrating, so a baseline-vs-candidate compare against an
1832/// old snapshot produces "every counter went from N to 0".
1833/// Re-capture both sides with the current build to compare
1834/// faithfully. Per the project's pre-1.0 disposable-sidecar
1835/// policy this is intentional.
1836#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1837#[non_exhaustive]
1838pub struct CgroupStats {
1839    pub cpu: CgroupCpuStats,
1840    pub memory: CgroupMemoryStats,
1841    pub pids: CgroupPidsStats,
1842    /// Pressure Stall Information for this cgroup, per resource.
1843    /// Populated from `<cgroup>/cpu.pressure`,
1844    /// `<cgroup>/memory.pressure`, `<cgroup>/io.pressure`, and
1845    /// `<cgroup>/irq.pressure` (cgroup v2 files declared in
1846    /// `cgroup_psi_files[]` (`kernel/cgroup/cgroup.c`)). Defaults to all-zero
1847    /// when the kernel has CONFIG_PSI off, when PSI is disabled
1848    /// at runtime via the `psi=0` boot param, or when individual
1849    /// resource files are absent (older kernels missing
1850    /// irq.pressure).
1851    pub psi: Psi,
1852}
1853
1854/// CPU controller state for one cgroup. Fields mirror the
1855/// `cpu.*` cgroup v2 files exposed under
1856/// `<cgroup>/cpu.stat`, `<cgroup>/cpu.max`,
1857/// `<cgroup>/cpu.weight`, and `<cgroup>/cpu.weight.nice`.
1858#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1859#[non_exhaustive]
1860pub struct CgroupCpuStats {
1861    /// `usage_usec` from `cpu.stat`. Cumulative CPU time consumed
1862    /// by tasks in this cgroup, in microseconds.
1863    pub usage_usec: u64,
1864    /// `nr_throttled` from `cpu.stat`. Cumulative count of
1865    /// CFS-bandwidth throttling events that paused this cgroup.
1866    pub nr_throttled: u64,
1867    /// `throttled_usec` from `cpu.stat`. Cumulative wall-clock
1868    /// time the cgroup spent throttled by CFS bandwidth.
1869    pub throttled_usec: u64,
1870    /// `cpu.max` quota in microseconds. `None` when the file is
1871    /// absent (root cgroup) OR when the kernel emits the literal
1872    /// "max" token (no CFS bandwidth cap configured for this
1873    /// cgroup).
1874    pub max_quota_us: Option<u64>,
1875    /// `cpu.max` period in microseconds. Default 100_000 (100ms)
1876    /// per the kernel default. Always present alongside the
1877    /// quota half on a child cgroup; defaults to 100_000 when
1878    /// the file is absent (root cgroup).
1879    pub max_period_us: u64,
1880    /// `cpu.weight` (1..=10_000, default 100). `None` when the
1881    /// file is absent (root cgroup); the kernel does not allow
1882    /// 0 as a value, so the absent-vs-zero distinction is
1883    /// load-bearing.
1884    pub weight: Option<u64>,
1885    /// `cpu.weight.nice` (-20..=19, default 0). `None` when the
1886    /// file is absent. Alias-domain for [`Self::weight`] —
1887    /// the kernel writes both files in lockstep but they're
1888    /// captured independently to surface any
1889    /// kernel-version-specific divergence.
1890    pub weight_nice: Option<i32>,
1891}
1892
1893/// Memory controller state for one cgroup. Fields mirror the
1894/// `memory.*` cgroup v2 files. `stat` and `events` are
1895/// captured as flat key-value maps so the data model
1896/// auto-extends when the kernel adds new keys (memory.stat
1897/// has 71 keys on a recent kernel; the explicit list is
1898/// scheduler-correctness-relevant but the map preserves
1899/// regression-detection on lesser-known counters).
1900#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1901#[non_exhaustive]
1902pub struct CgroupMemoryStats {
1903    /// `memory.current`, instantaneous RSS of the cgroup in
1904    /// bytes.
1905    pub current: u64,
1906    /// `memory.max`, hard memory limit in bytes. `None` when
1907    /// the file is absent (root cgroup) OR when the kernel
1908    /// emits the literal "max" token (no hard cap).
1909    pub max: Option<u64>,
1910    /// `memory.high`, soft pressure limit in bytes. `None` when
1911    /// absent or unlimited (same "max"-token semantics as
1912    /// [`Self::max`]).
1913    pub high: Option<u64>,
1914    /// `memory.low`, best-effort protection floor in bytes.
1915    /// `None` when the file is absent (no protection
1916    /// configured); `Some(u64::MAX)` when the kernel emits the
1917    /// literal `max` token (request maximum protection — every
1918    /// byte under the cgroup is protected). Per the kernel's
1919    /// cgroup v2 docs, memory under `low` is protected from
1920    /// reclaim unless no unprotected memory remains. Note the
1921    /// asymmetry vs. limits: `None` means "no floor" (semantic
1922    /// opposite of "max"-as-no-cap on the limit fields above).
1923    pub low: Option<u64>,
1924    /// `memory.min`, hard protection floor in bytes. `None`
1925    /// when absent (no floor). `Some(u64::MAX)` when the kernel
1926    /// emits `max` (full protection). Stronger than `low` —
1927    /// memory under `min` is never reclaimed even under
1928    /// memory pressure.
1929    pub min: Option<u64>,
1930    /// `memory.stat` parsed as a key-value map. Keys mirror the
1931    /// kernel-emitted strings (e.g. `anon`, `file`,
1932    /// `workingset_refault_anon`, `pgfault`, `pgmajfault`,
1933    /// `slab`, the active/inactive variants, etc.). Empty when
1934    /// the file is absent.
1935    pub stat: BTreeMap<String, u64>,
1936    /// `memory.events` parsed as a key-value map. Typical keys:
1937    /// `low`, `high`, `max`, `oom`, `oom_kill`,
1938    /// `oom_group_kill`, `sock_throttled` (subset varies by
1939    /// kernel version). Empty when the file is absent.
1940    pub events: BTreeMap<String, u64>,
1941}
1942
1943/// PIDs controller state for one cgroup. Fields mirror the
1944/// `pids.*` cgroup v2 files. The pids controller is optional
1945/// (must be enabled in `cgroup.subtree_control`); on hosts that
1946/// don't enable it, both fields are `None`.
1947#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
1948#[non_exhaustive]
1949pub struct CgroupPidsStats {
1950    /// `pids.current`, current task count in this cgroup.
1951    /// `None` when the file is absent (pids controller not
1952    /// enabled).
1953    pub current: Option<u64>,
1954    /// `pids.max`, hard task-count limit. `None` when the file
1955    /// is absent OR when the kernel emits the literal "max"
1956    /// token (no cap).
1957    pub max: Option<u64>,
1958}
1959
1960/// One Pressure Stall Information half-line: either the `some`
1961/// or `full` row for one resource. Mirrors the kernel emission
1962/// format `%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu`
1963/// in `psi_show()` (`kernel/sched/psi.c`).
1964///
1965/// `avg10/60/300` are stored as **centi-percent** (lossless
1966/// fixed-point) — the kernel writes `LOAD_INT(avg).LOAD_FRAC(avg)`
1967/// as a 2-decimal-digit percentage in `psi_show()`. The integer
1968/// expansion is `int * 100 + frac`, giving a numerical range of
1969/// `0..=10099`. The upper bound is `100.99` (not `100.00`)
1970/// because the kernel's EWMA helper `calc_load()`
1971/// (`include/linux/sched/loadavg.h`) rounds via `newload +=
1972/// FIXED_1 - 1` before the final `>> FSHIFT`, so a fully-loaded
1973/// group can land just over `100.0` for one sample. This avoids
1974/// serde JSON float-roundtrip drift that would manifest as
1975/// spurious non-zero deltas in compare output.
1976///
1977/// `total_usec` is microseconds (kernel
1978/// `div_u64(total_ns, NSEC_PER_USEC)` in `psi_show()`). Same unit
1979/// as [`CgroupCpuStats::usage_usec`], so the existing
1980/// auto_scale "µs" ladder applies.
1981///
1982/// "some" semantics: at least one task is stalled on this
1983/// resource. "full" semantics: every runnable task is stalled.
1984/// At the SYSTEM level (`/proc/pressure/cpu`), `cpu.full` is
1985/// always zero by kernel design — the explicit gate
1986/// `if (!(group == &psi_system && res == PSI_CPU && full))` in
1987/// `psi_show()` (`kernel/sched/psi.c`) skips the avg/total
1988/// computation, but the `seq_printf` in `psi_show()` still emits
1989/// the structurally-present line. Per-cgroup `cpu.full` (under
1990/// `<cgroup>/cpu.pressure`) IS meaningful and computed
1991/// normally. `irq` is full-only (kernel `only_full = res == PSI_IRQ`
1992/// in `psi_show()`), so [`PsiResource::some`] for irq always reads
1993/// zero.
1994#[derive(Debug, Clone, Copy, Default, serde::Serialize, serde::Deserialize)]
1995#[non_exhaustive]
1996pub struct PsiHalf {
1997    /// 10-second running average of pressure %, scaled by 100
1998    /// (so 0..=10099 covers 0.00..=100.99 — see the EWMA-rounding
1999    /// note on the struct doc).
2000    pub avg10: u16,
2001    /// 60-second running average of pressure %, same scaling.
2002    pub avg60: u16,
2003    /// 300-second running average of pressure %, same scaling.
2004    pub avg300: u16,
2005    /// Cumulative total stalled time in microseconds.
2006    pub total_usec: u64,
2007}
2008
2009impl PsiHalf {
2010    /// Convert the centi-percent `avg10` value to a percentage
2011    /// `f64`. Returns `0.0..=100.99` per the kernel's EWMA
2012    /// rounding (see struct-level doc).
2013    pub fn avg10_percent(&self) -> f64 {
2014        self.avg10 as f64 / 100.0
2015    }
2016
2017    /// Convert the centi-percent `avg60` value to a percentage
2018    /// `f64`. Same range as [`Self::avg10_percent`].
2019    pub fn avg60_percent(&self) -> f64 {
2020        self.avg60 as f64 / 100.0
2021    }
2022
2023    /// Convert the centi-percent `avg300` value to a percentage
2024    /// `f64`. Same range as [`Self::avg10_percent`].
2025    pub fn avg300_percent(&self) -> f64 {
2026        self.avg300 as f64 / 100.0
2027    }
2028}
2029
2030/// Pressure Stall Information for one resource (cpu / memory /
2031/// io / irq), bundling the `some` and `full` halves.
2032#[derive(Debug, Clone, Copy, Default, serde::Serialize, serde::Deserialize)]
2033#[non_exhaustive]
2034pub struct PsiResource {
2035    pub some: PsiHalf,
2036    pub full: PsiHalf,
2037}
2038
2039/// Bundle of [`PsiResource`] for the four kernel-exposed
2040/// resources. Same shape used at both system level
2041/// ([`CtprofSnapshot::psi`]) and per-cgroup
2042/// ([`CgroupStats::psi`]) — the data source differs but the
2043/// kernel emits the same format and field set in both places.
2044#[derive(Debug, Clone, Copy, Default, serde::Serialize, serde::Deserialize)]
2045#[non_exhaustive]
2046pub struct Psi {
2047    pub cpu: PsiResource,
2048    pub memory: PsiResource,
2049    pub io: PsiResource,
2050    /// IRQ pressure. Only the `full` half is populated by the
2051    /// kernel (`psi_show()` sets `only_full = res == PSI_IRQ`);
2052    /// `irq.some` is structurally present but always zero.
2053    /// Requires both `CONFIG_IRQ_TIME_ACCOUNTING` at build AND
2054    /// `irqtime_enabled()` at runtime (`/proc/pressure/irq` returns
2055    /// `-EOPNOTSUPP` per `psi_show()` (`kernel/sched/psi.c`) otherwise);
2056    /// runtime irqtime is gated by the `tsc=...` boot param /
2057    /// `irqtime_enabled` static branch — when off, the file open
2058    /// fails and the parser leaves this resource at the default
2059    /// all-zero value.
2060    pub irq: PsiResource,
2061}
2062
2063/// Global sched_ext sysfs state, captured from
2064/// `/sys/kernel/sched_ext/`. The kernel registers exactly five
2065/// global attributes via `scx_global_attrs[]`
2066/// (`kernel/sched/ext.c`); this struct mirrors them
2067/// 1-to-1.
2068///
2069/// Per-scheduler attrs (`/sys/kernel/sched_ext/root/...`) are
2070/// out of scope: those are scheduler-specific internals
2071/// (queued/dispatched/ops-name) that come and go as schedulers
2072/// load and unload, and answer different questions than the
2073/// global counters here.
2074#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
2075#[non_exhaustive]
2076pub struct SchedExtSysfs {
2077    /// `state` — sched_ext class enable state. One of
2078    /// `enabling`, `enabled`, `disabling`, `disabled` per
2079    /// `scx_enable_state_str[]`
2080    /// (`kernel/sched/ext_internal.h`). Emitted by
2081    /// `scx_attr_state_show()`
2082    /// (`kernel/sched/ext.c`). Defaults to empty string
2083    /// when the file is unreadable; `disabled` when no scx
2084    /// scheduler is currently loaded. The "is sched_ext active
2085    /// during this capture?" answer.
2086    pub state: String,
2087
2088    /// `switch_all` — boolean (rendered as 0/1) indicating
2089    /// whether ALL scheduling classes have been switched to
2090    /// scx (vs. only those tasks the BPF scheduler claims via
2091    /// the per-task selection path). Emitted by
2092    /// `scx_attr_switch_all_show()`
2093    /// (`kernel/sched/ext.c`) via
2094    /// `READ_ONCE(scx_switching_all)`.
2095    pub switch_all: u64,
2096
2097    /// `nr_rejected` — count of tasks rejected from
2098    /// SCHED_EXT during init when `ops.init_task()` set
2099    /// `p->disallow`. Increment in `__scx_init_task()`
2100    /// (`kernel/sched/ext.c`): when a task entering
2101    /// SCHED_EXT has its policy reverted to SCHED_NORMAL
2102    /// because the BPF scheduler asked the kernel to disallow
2103    /// it, `atomic_long_inc(&scx_nr_rejected)` fires.
2104    /// `atomic_long_read(&scx_nr_rejected)` is emitted by
2105    /// `scx_attr_nr_rejected_show()`
2106    /// (`kernel/sched/ext.c`).
2107    ///
2108    /// Resets to 0 on every scheduler load: `scx_root_enable_workfn()`
2109    /// (`kernel/sched/ext.c`) does
2110    /// `atomic_long_set(&scx_nr_rejected, 0)` before bringing
2111    /// the new scheduler online. To detect a reload-driven
2112    /// reset rather than a genuine cumulative drop, pair the
2113    /// nr_rejected delta with [`Self::enable_seq`] — any
2114    /// enable_seq movement across two snapshots invalidates
2115    /// nr_rejected as a monotonic counter.
2116    ///
2117    /// Does NOT count runtime dispatch errors. The "did the
2118    /// scheduler reject a dispatch operation at runtime?"
2119    /// question is answered by per-scheduler debug data
2120    /// (`/sys/kernel/sched_ext/root/...`), out of scope for
2121    /// this global-attrs struct.
2122    pub nr_rejected: u64,
2123
2124    /// `hotplug_seq` — per-CPU-hotplug-event sequence counter.
2125    /// Atomic long incremented every time the kernel observes a
2126    /// hotplug transition. Emitted by
2127    /// `scx_attr_hotplug_seq_show()`
2128    /// (`kernel/sched/ext.c`). Comparing two snapshots:
2129    /// any delta indicates that a CPU online/offline event
2130    /// happened during the interval, which can confound
2131    /// per-CPU statistics.
2132    pub hotplug_seq: u64,
2133
2134    /// `enable_seq` — per-scheduler-load sequence counter.
2135    /// Atomic long incremented in `scx_root_enable_workfn()`
2136    /// (`kernel/sched/ext.c`, `atomic_long_inc(&scx_enable_seq)`)
2137    /// each time a scx scheduler is enabled. Comparing two
2138    /// snapshots: any delta indicates a scheduler reload
2139    /// happened during the interval — counter resets on the
2140    /// scx side will surface here even if the per-thread data
2141    /// looks continuous.
2142    pub enable_seq: u64,
2143}
2144
2145// Parsers + tallying readers live in `parse.rs` to keep this
2146// production file under the per-file line budget. The `use
2147// parse::*;` glob keeps existing call sites unchanged.
2148mod parse;
2149use parse::*;
2150
2151/// Capture one thread's procfs-derived profile under an arbitrary
2152/// procfs root. Each procfs reader returns `Option`; the assembled
2153/// [`ThreadState`] coerces `None` to the field's default per the
2154/// module-level capture contract. The jemalloc per-thread TSD
2155/// counters (`allocated_bytes` / `deallocated_bytes`) are NOT
2156/// populated by this function — they require a tgid-scoped probe
2157/// attach that the caller owns ([`capture_with`] /
2158/// [`capture_pid_with`] do this and write the counters directly
2159/// onto the returned `ThreadState`). On the returned struct, both
2160/// fields therefore land at the absent-counter default of zero
2161/// unless the caller overwrites them.
2162///
2163/// `comm` is the thread name the caller has already read from
2164/// `<proc_root>/<tgid>/task/<tid>/comm` (typically via
2165/// [`read_thread_comm_at`]). Passing it in — symmetric with the
2166/// pre-existing `pcomm` parameter — lets the caller share one
2167/// procfs read with the per-tid probe-recording path
2168/// (`probe_thread_recording`), which needs the thread name for
2169/// tracing on probe failures: a hot loop that re-reads the file
2170/// inside this fn would double the comm syscalls per tid on hosts
2171/// with thousands of threads.
2172///
2173/// Pass empty string for the absent-comm default; the ghost
2174/// filter in [`capture_with`] / [`capture_pid_with`] keys on
2175/// `ThreadState::comm.is_empty()` to drop a tid that exited
2176/// between `iter_task_ids_at` and this call, so an empty `comm`
2177/// is the correct shape for that path.
2178///
2179/// `use_syscall_affinity` gates the `sched_getaffinity(2)` path —
2180/// tests staging a synthetic `/proc` pass `false` so the syscall
2181/// does not read the REAL affinity of the test process; production
2182/// passes `true` and falls back to `Cpus_allowed_list:` when the
2183/// syscall returns EPERM.
2184#[cfg(test)]
2185fn capture_thread_at(
2186    proc_root: &Path,
2187    tgid: i32,
2188    tid: i32,
2189    pcomm: &str,
2190    comm: &str,
2191    use_syscall_affinity: bool,
2192) -> ThreadState {
2193    capture_thread_at_with_tally(
2194        proc_root,
2195        tgid,
2196        tid,
2197        pcomm,
2198        comm,
2199        use_syscall_affinity,
2200        &mut None,
2201    )
2202}
2203
2204/// Per-tid procfs walk. Threads a `&mut ParseTally` through every
2205/// per-file reader so per-tid read failures land in the
2206/// per-snapshot [`CtprofParseSummary`] when the capture
2207/// pipeline runs in production mode (`use_syscall_affinity=true`).
2208/// Synthetic-tree tests typically pass `&mut None` for the tally,
2209/// matching the pre-tally shape.
2210fn capture_thread_at_with_tally(
2211    proc_root: &Path,
2212    tgid: i32,
2213    tid: i32,
2214    pcomm: &str,
2215    comm: &str,
2216    use_syscall_affinity: bool,
2217    tally: &mut Option<&mut ParseTally>,
2218) -> ThreadState {
2219    let cgroup = read_cgroup_at_with_tally(proc_root, tgid, tid, tally).unwrap_or_default();
2220    let stat = read_stat_at_with_tally(proc_root, tgid, tid, tally);
2221    let (run_time_ns, wait_time_ns, timeslices) =
2222        read_schedstat_at_with_tally(proc_root, tgid, tid, tally);
2223    let io = read_io_at_with_tally(proc_root, tgid, tid, tally);
2224    let status = read_status_at_with_tally(proc_root, tgid, tid, tally);
2225    let sched = read_sched_at_with_tally(proc_root, tgid, tid, tally);
2226    let smaps_rollup_kib = read_smaps_rollup_at_with_tally(proc_root, tgid, tid, tally);
2227    let cpu_affinity = if use_syscall_affinity {
2228        crate::cpu_util::read_affinity(tid)
2229            .or(status.cpus_allowed)
2230            .unwrap_or_default()
2231    } else {
2232        status.cpus_allowed.unwrap_or_default()
2233    };
2234    use crate::metric_types::{
2235        Bytes, CategoricalString, ClockTicks, CpuSet, GaugeCount, GaugeNs, MonotonicCount,
2236        MonotonicNs, OrdinalI32, OrdinalU32, PeakNs,
2237    };
2238    ThreadState {
2239        tid: tid as u32,
2240        tgid: tgid as u32,
2241        pcomm: pcomm.to_string(),
2242        comm: comm.to_string(),
2243        cgroup,
2244        start_time_clock_ticks: stat.start_time_clock_ticks.unwrap_or(0),
2245        policy: CategoricalString(stat.policy.map(policy_name).unwrap_or_default()),
2246        nice: OrdinalI32(stat.nice.unwrap_or(0)),
2247        cpu_affinity: CpuSet(cpu_affinity),
2248        processor: OrdinalI32(stat.processor.unwrap_or(0)),
2249        state: status.state.unwrap_or_else(default_state_char),
2250        ext_enabled: sched.ext_enabled.unwrap_or(false),
2251        run_time_ns: MonotonicNs(run_time_ns.unwrap_or(0)),
2252        wait_time_ns: MonotonicNs(wait_time_ns.unwrap_or(0)),
2253        timeslices: MonotonicCount(timeslices.unwrap_or(0)),
2254        voluntary_csw: MonotonicCount(status.voluntary_csw.unwrap_or(0)),
2255        nonvoluntary_csw: MonotonicCount(status.nonvoluntary_csw.unwrap_or(0)),
2256        nr_wakeups: MonotonicCount(sched.nr_wakeups.unwrap_or(0)),
2257        nr_wakeups_local: MonotonicCount(sched.nr_wakeups_local.unwrap_or(0)),
2258        nr_wakeups_remote: MonotonicCount(sched.nr_wakeups_remote.unwrap_or(0)),
2259        nr_wakeups_sync: MonotonicCount(sched.nr_wakeups_sync.unwrap_or(0)),
2260        nr_wakeups_migrate: MonotonicCount(sched.nr_wakeups_migrate.unwrap_or(0)),
2261        nr_wakeups_affine: MonotonicCount(sched.nr_wakeups_affine.unwrap_or(0)),
2262        nr_wakeups_affine_attempts: MonotonicCount(sched.nr_wakeups_affine_attempts.unwrap_or(0)),
2263        nr_migrations: MonotonicCount(sched.nr_migrations.unwrap_or(0)),
2264        nr_forced_migrations: MonotonicCount(sched.nr_forced_migrations.unwrap_or(0)),
2265        nr_failed_migrations_affine: MonotonicCount(sched.nr_failed_migrations_affine.unwrap_or(0)),
2266        nr_failed_migrations_running: MonotonicCount(
2267            sched.nr_failed_migrations_running.unwrap_or(0),
2268        ),
2269        nr_failed_migrations_hot: MonotonicCount(sched.nr_failed_migrations_hot.unwrap_or(0)),
2270        wait_sum: MonotonicNs(sched.wait_sum.unwrap_or(0)),
2271        wait_count: MonotonicCount(sched.wait_count.unwrap_or(0)),
2272        wait_max: PeakNs(sched.wait_max.unwrap_or(0)),
2273        // Capture-time normalization: kernel's `sum_sleep_runtime`
2274        // counts BOTH voluntary sleep AND involuntary block (see
2275        // `__update_stats_enqueue_sleeper` at kernel/sched/stats.c).
2276        // Subtracting the block component leaves pure voluntary
2277        // sleep — the operationally useful signal — and avoids the
2278        // need for a derived metric at compare time.
2279        //
2280        // The subtraction is only meaningful when BOTH halves
2281        // parsed successfully. If `sum_block_runtime` is missing,
2282        // an `unwrap_or(0)` fallback would yield
2283        // `sum_sleep_runtime - 0 = full_sleep_total`, mislabelling
2284        // the involuntary-block component as voluntary sleep and
2285        // breaking the field-doc contract ("voluntary only"). If
2286        // `sum_sleep_runtime` is missing, the fallback would yield
2287        // `0 - block`, which `saturating_sub` collapses to 0 but
2288        // also discards any real voluntary signal that might have
2289        // been recorded if the kernel had emitted both. Either
2290        // half-missing case means the value is uncomputable, so
2291        // it falls through to 0 — matching the "absent data → 0"
2292        // convention used by every sibling field at this site
2293        // (e.g. `wait_sum`, `sleep_max`, `block_sum`) and
2294        // co-locating with the existing `block_sum: 0` that the
2295        // same parse miss already produces below.
2296        //
2297        // `saturating_sub` remains in the both-Some path as
2298        // defense against the kernel-ordering edge case:
2299        // `__update_stats_enqueue_sleeper` adds to
2300        // `sum_sleep_runtime` BEFORE adding the same delta to
2301        // `sum_block_runtime`, so a sample read between those
2302        // writes can transiently yield `block > sleep` even
2303        // though every in-tree path eventually settles to
2304        // `block <= sleep`.
2305        voluntary_sleep_ns: MonotonicNs(match (sched.sleep_sum, sched.block_sum) {
2306            (Some(sleep), Some(block)) => sleep.saturating_sub(block),
2307            _ => 0,
2308        }),
2309        sleep_max: PeakNs(sched.sleep_max.unwrap_or(0)),
2310        block_sum: MonotonicNs(sched.block_sum.unwrap_or(0)),
2311        block_max: PeakNs(sched.block_max.unwrap_or(0)),
2312        iowait_sum: MonotonicNs(sched.iowait_sum.unwrap_or(0)),
2313        iowait_count: MonotonicCount(sched.iowait_count.unwrap_or(0)),
2314        exec_max: PeakNs(sched.exec_max.unwrap_or(0)),
2315        slice_max: PeakNs(sched.slice_max.unwrap_or(0)),
2316        allocated_bytes: Bytes(0),
2317        deallocated_bytes: Bytes(0),
2318        minflt: MonotonicCount(stat.minflt.unwrap_or(0)),
2319        majflt: MonotonicCount(stat.majflt.unwrap_or(0)),
2320        utime_clock_ticks: ClockTicks(stat.utime_clock_ticks.unwrap_or(0)),
2321        stime_clock_ticks: ClockTicks(stat.stime_clock_ticks.unwrap_or(0)),
2322        priority: OrdinalI32(stat.priority.unwrap_or(0)),
2323        rt_priority: OrdinalU32(stat.rt_priority.unwrap_or(0)),
2324        core_forceidle_sum: MonotonicNs(sched.core_forceidle_sum.unwrap_or(0)),
2325        fair_slice_ns: GaugeNs(sched.fair_slice_ns.unwrap_or(0)),
2326        // Dedup `nr_threads` to only the thread leader. Every
2327        // thread of the same tgid sees the same kernel-emitted
2328        // value; populating it on every thread would let any
2329        // Sum-style aggregator multiply the count by itself
2330        // across the group. Leader-only population means the
2331        // registry's `AggRule::MaxGaugeCount` surfaces the
2332        // largest process represented in the bucket — reading
2333        // "the biggest process in this group" rather than "how
2334        // many threads the kernel believes this group contains"
2335        // (which is already covered by the row count).
2336        nr_threads: GaugeCount(if tid == tgid {
2337            status.nr_threads.unwrap_or(0)
2338        } else {
2339            0
2340        }),
2341        smaps_rollup_kib,
2342        rchar: Bytes(io.rchar.unwrap_or(0)),
2343        wchar: Bytes(io.wchar.unwrap_or(0)),
2344        syscr: MonotonicCount(io.syscr.unwrap_or(0)),
2345        syscw: MonotonicCount(io.syscw.unwrap_or(0)),
2346        read_bytes: Bytes(io.read_bytes.unwrap_or(0)),
2347        write_bytes: Bytes(io.write_bytes.unwrap_or(0)),
2348        cancelled_write_bytes: Bytes(io.cancelled_write_bytes.unwrap_or(0)),
2349        // Taskstats fields land here at zero defaults; the caller
2350        // (`capture_with` / `capture_pid_with`) overwrites them
2351        // after the per-tid `TaskstatsClient::query_tid` call,
2352        // mirroring how `allocated_bytes` / `deallocated_bytes`
2353        // are placeholdered above and then filled by the jemalloc
2354        // probe path. Zero defaults are correct for the
2355        // best-effort contract: a kernel without `CONFIG_TASKSTATS`
2356        // / `CONFIG_TASK_DELAY_ACCT`, a host with `delayacct=off`
2357        // at runtime, a process without `CAP_NET_ADMIN`, or a tid
2358        // that exited before `query_tid` succeeded all collapse
2359        // to zero per-field.
2360        cpu_delay_count: MonotonicCount(0),
2361        cpu_delay_total_ns: MonotonicNs(0),
2362        cpu_delay_max_ns: PeakNs(0),
2363        cpu_delay_min_ns: PeakNs(0),
2364        blkio_delay_count: MonotonicCount(0),
2365        blkio_delay_total_ns: MonotonicNs(0),
2366        blkio_delay_max_ns: PeakNs(0),
2367        blkio_delay_min_ns: PeakNs(0),
2368        swapin_delay_count: MonotonicCount(0),
2369        swapin_delay_total_ns: MonotonicNs(0),
2370        swapin_delay_max_ns: PeakNs(0),
2371        swapin_delay_min_ns: PeakNs(0),
2372        freepages_delay_count: MonotonicCount(0),
2373        freepages_delay_total_ns: MonotonicNs(0),
2374        freepages_delay_max_ns: PeakNs(0),
2375        freepages_delay_min_ns: PeakNs(0),
2376        thrashing_delay_count: MonotonicCount(0),
2377        thrashing_delay_total_ns: MonotonicNs(0),
2378        thrashing_delay_max_ns: PeakNs(0),
2379        thrashing_delay_min_ns: PeakNs(0),
2380        compact_delay_count: MonotonicCount(0),
2381        compact_delay_total_ns: MonotonicNs(0),
2382        compact_delay_max_ns: PeakNs(0),
2383        compact_delay_min_ns: PeakNs(0),
2384        wpcopy_delay_count: MonotonicCount(0),
2385        wpcopy_delay_total_ns: MonotonicNs(0),
2386        wpcopy_delay_max_ns: PeakNs(0),
2387        wpcopy_delay_min_ns: PeakNs(0),
2388        irq_delay_count: MonotonicCount(0),
2389        irq_delay_total_ns: MonotonicNs(0),
2390        irq_delay_max_ns: PeakNs(0),
2391        irq_delay_min_ns: PeakNs(0),
2392        hiwater_rss_bytes: crate::metric_types::PeakBytes(0),
2393        hiwater_vm_bytes: crate::metric_types::PeakBytes(0),
2394        // Not-yet-captured here: the delay family's zero defaults above are
2395        // overwritten (and this flag set true) by apply_delay_stats on a
2396        // successful taskstats query; jemalloc_measured is set true where the
2397        // probe assigns allocated_bytes/deallocated_bytes. Both stay false when
2398        // the respective capture did not run.
2399        taskstats_measured: false,
2400        cpu_delay_active: false,
2401        delay_block_active: false,
2402        xacct_active: false,
2403        jemalloc_measured: false,
2404    }
2405}
2406
2407#[cfg(test)]
2408fn capture_thread(tgid: i32, tid: i32, pcomm: &str) -> ThreadState {
2409    let proc_root = Path::new(DEFAULT_PROC_ROOT);
2410    let comm = read_thread_comm_at(proc_root, tgid, tid).unwrap_or_default();
2411    capture_thread_at(proc_root, tgid, tid, pcomm, &comm, true)
2412}
2413
2414/// Running tally for the per-snapshot jemalloc-probe summary line
2415/// emitted by [`capture_with`] and [`capture_pid_with`]. The
2416/// dominant `AttachError` tag and `ProbeError` tag are tracked so
2417/// the summary can surface a remediation hint when one error class
2418/// dominates (e.g. EPERM under YAMA).
2419#[derive(Debug, Default)]
2420struct ProbeSummary {
2421    tgids_walked: u64,
2422    jemalloc_detected: u64,
2423    probed_ok: u64,
2424    failed: u64,
2425    attach_tag_counts: BTreeMap<&'static str, u64>,
2426    probe_tag_counts: BTreeMap<&'static str, u64>,
2427}
2428
2429impl ProbeSummary {
2430    /// Pick the most frequent ACTIONABLE error tag (across attach
2431    /// and probe failures) for the summary line. Ties resolve to
2432    /// REVERSE-alphabetical order so the output is deterministic:
2433    /// the comparator's secondary key is `b.0.cmp(a.0)` (note the
2434    /// argument flip), so when two tags share a count, the
2435    /// alphabetically-EARLIER tag wins (e.g. `dwarf-parse-failure`
2436    /// beats `ptrace-seize`).
2437    ///
2438    /// `jemalloc-not-found` and `readlink-failure` are filtered out
2439    /// of the attach side: both are the expected outcome on the bulk
2440    /// of system processes (most tgids are not jemalloc-linked, and
2441    /// short-lived ones routinely fail readlink mid-walk), so
2442    /// surfacing them as the operator-facing "dominant failure tag"
2443    /// would drown the actionable signal (privilege drops, stripped
2444    /// debuginfo, arch mismatch) under known-benign noise on every
2445    /// snapshot. The filter is the same matches! arm
2446    /// `try_attach_probe_for_tgid_at` uses to route those two tags
2447    /// to debug-level tracing rather than warn-level — the
2448    /// dominant-tag summary mirrors the same actionable/non-actionable
2449    /// cut. Probe tags are not filtered: every `ProbeError` variant
2450    /// is actionable.
2451    fn dominant_tag(&self) -> Option<&'static str> {
2452        self.attach_tag_counts
2453            .iter()
2454            .filter(|(t, _)| !matches!(**t, "jemalloc-not-found" | "readlink-failure"))
2455            .chain(self.probe_tag_counts.iter())
2456            .max_by(|a, b| a.1.cmp(b.1).then_with(|| b.0.cmp(a.0)))
2457            .map(|(tag, _)| *tag)
2458    }
2459
2460    /// True when `ptrace-seize` (or `ptrace-interrupt`) failures
2461    /// dominate, signalling a privilege issue. Used to gate the
2462    /// EPERM remediation hint.
2463    fn ptrace_dominates(&self) -> bool {
2464        let total_ptrace: u64 = self
2465            .probe_tag_counts
2466            .iter()
2467            .filter(|(t, _)| matches!(**t, "ptrace-seize" | "ptrace-interrupt"))
2468            .map(|(_, n)| *n)
2469            .sum();
2470        // Half of failures or more attributable to ptrace
2471        // privilege (ptrace-seize or ptrace-interrupt) — high
2472        // enough that the hint is useful, low enough that a few
2473        // EPERMs in an otherwise-clean run don't drown the
2474        // summary.
2475        self.failed > 0 && total_ptrace * 2 >= self.failed
2476    }
2477
2478    /// Project the internal tally to the curated public surface.
2479    /// Drops the per-tag `attach_tag_counts` / `probe_tag_counts`
2480    /// maps (implementation detail) and surfaces only the
2481    /// counters + dominant tag string + privilege-dominant
2482    /// signal. Mirrors the actionable/non-actionable cut
2483    /// [`Self::dominant_tag`] uses, so `dominant_failure` is
2484    /// `None` exactly when the snapshot has zero actionable
2485    /// failures. `privilege_dominant` mirrors
2486    /// [`Self::ptrace_dominates`] so a downstream consumer can
2487    /// reproduce the EPERM-hint trigger condition without
2488    /// parsing the operator-facing tracing line.
2489    fn to_public(&self) -> CtprofProbeSummary {
2490        CtprofProbeSummary {
2491            tgids_walked: self.tgids_walked,
2492            jemalloc_detected: self.jemalloc_detected,
2493            probed_ok: self.probed_ok,
2494            failed: self.failed,
2495            dominant_failure: self.dominant_tag().map(|t| t.to_string()),
2496            privilege_dominant: self.ptrace_dominates(),
2497        }
2498    }
2499}
2500
2501/// Internal tally of procfs read-level failures, threaded through
2502/// [`capture_thread_at_with_tally`] and projected to the public
2503/// surface via [`Self::to_public`]. Mirrors the [`ProbeSummary`] /
2504/// [`CtprofProbeSummary`] split: tracks per-tid context plus a
2505/// per-file-kind failure map, then drops the implementation-detail
2506/// shape (here the `&'static str` keys vs the public surface's
2507/// `String` keys, which serde-derive cleanly).
2508///
2509/// `tids_walked` is incremented once per tid the capture pass
2510/// attempts, regardless of whether the tid lands in the snapshot —
2511/// the bump happens at the call site (before invoking
2512/// `capture_thread_at_with_tally`), so a ghost-filtered tid still
2513/// counts as walked. The per-tid `pending_failures` set lets the
2514/// caller unwind a ghost-filtered tid's read-failure contributions
2515/// before the summary is finalized — see [`Self::commit_pending`] /
2516/// [`Self::discard_pending`].
2517#[derive(Debug, Default)]
2518struct ParseTally {
2519    tids_walked: u64,
2520    failures_by_file: BTreeMap<&'static str, u64>,
2521    /// Per-tid pending bumps held until the caller commits or
2522    /// discards based on the ghost filter. Cleared between tids.
2523    pending_failures: Vec<&'static str>,
2524    /// Committed total of negative dotted-ns values seen across
2525    /// the snapshot. The kernel's PN_SCHEDSTAT path (`%Ld.%06ld`
2526    /// in `kernel/sched/debug.c`) emits a leading `-` when a
2527    /// schedstat field carries a negative integer part — rare but
2528    /// observable on clock-skew / suspend-resume hosts. The
2529    /// capture-side parser previously folded these into the
2530    /// absent-counter zero silently; this tally surfaces the
2531    /// rate so an operator can spot a host whose schedstat values
2532    /// are routinely negative-and-zeroed.
2533    negative_dotted_values: u64,
2534    /// Per-tid pending negative-dotted bumps held until
2535    /// commit / discard, mirroring [`Self::pending_failures`].
2536    pending_negative_dotted: u64,
2537}
2538
2539impl ParseTally {
2540    /// Record a per-file read failure for the current tid. Held
2541    /// pending until [`Self::commit_pending`] or
2542    /// [`Self::discard_pending`] resolves the tid's outcome.
2543    fn record_failure(&mut self, file_kind: &'static str) {
2544        self.pending_failures.push(file_kind);
2545    }
2546
2547    /// Record a negative dotted-ns value seen during sched parse
2548    /// for the current tid. Held pending until
2549    /// [`Self::commit_pending`] / [`Self::discard_pending`].
2550    fn record_negative_dotted(&mut self) {
2551        self.pending_negative_dotted = self.pending_negative_dotted.saturating_add(1);
2552    }
2553
2554    /// Commit the current tid's pending failures to the per-snapshot
2555    /// tally. Called when the tid lands in the snapshot.
2556    fn commit_pending(&mut self) {
2557        for kind in self.pending_failures.drain(..) {
2558            *self.failures_by_file.entry(kind).or_insert(0) += 1;
2559        }
2560        self.negative_dotted_values = self
2561            .negative_dotted_values
2562            .saturating_add(self.pending_negative_dotted);
2563        self.pending_negative_dotted = 0;
2564    }
2565
2566    /// Discard the current tid's pending failures. Called when the
2567    /// ghost filter rejects the tid — the bumps would correspond to
2568    /// a thread the snapshot doesn't include, so they must not
2569    /// inflate the summary.
2570    fn discard_pending(&mut self) {
2571        self.pending_failures.clear();
2572        self.pending_negative_dotted = 0;
2573    }
2574
2575    /// Total failures across every file kind. Read-side mirror of
2576    /// the public surface's `read_failures` field.
2577    fn total_failures(&self) -> u64 {
2578        self.failures_by_file.values().sum()
2579    }
2580
2581    /// Pick the file kind with the most failures. Ties resolve to
2582    /// REVERSE-alphabetical order for determinism — the
2583    /// alphabetically-EARLIER tag wins (mirrors
2584    /// [`ProbeSummary::dominant_tag`]'s comparator).
2585    fn dominant_file(&self) -> Option<&'static str> {
2586        self.failures_by_file
2587            .iter()
2588            .max_by(|a, b| a.1.cmp(b.1).then_with(|| b.0.cmp(a.0)))
2589            .map(|(tag, _)| *tag)
2590    }
2591
2592    /// True when ≥ 50% of failures are in `schedstat` or `io` —
2593    /// the two procfs files gated by `CONFIG_SCHED_INFO` /
2594    /// `CONFIG_TASK_IO_ACCOUNTING`. Mirrors
2595    /// [`ProbeSummary::ptrace_dominates`]'s shape: dominance gate
2596    /// at half-or-more, false when total is zero.
2597    fn kernel_config_dominates(&self) -> bool {
2598        let total = self.total_failures();
2599        if total == 0 {
2600            return false;
2601        }
2602        let kconfig: u64 = self
2603            .failures_by_file
2604            .iter()
2605            .filter(|(t, _)| matches!(**t, "schedstat" | "io"))
2606            .map(|(_, n)| *n)
2607            .sum();
2608        kconfig * 2 >= total
2609    }
2610
2611    /// Project the internal tally to the curated public surface.
2612    fn to_public(&self) -> CtprofParseSummary {
2613        let read_failures = self.total_failures();
2614        let mut by_file = BTreeMap::new();
2615        for (k, v) in &self.failures_by_file {
2616            by_file.insert((*k).to_string(), *v);
2617        }
2618        CtprofParseSummary {
2619            tids_walked: self.tids_walked,
2620            read_failures,
2621            read_failures_by_file: by_file,
2622            dominant_read_failure: self.dominant_file().map(|t| t.to_string()),
2623            kernel_config_dominant: self.kernel_config_dominates(),
2624            negative_dotted_values: self.negative_dotted_values,
2625        }
2626    }
2627}
2628
2629/// Stable EPERM remediation hint for the capture summary. References
2630/// `$(which ktstr)` rather than a hardcoded path so the suggestion
2631/// works regardless of where the binary is installed.
2632const PTRACE_EPERM_HINT: &str = "hint: re-run as root, or sudo setcap cap_sys_ptrace+eip $(which ktstr), or set kernel.yama.ptrace_scope=0";
2633
2634/// Result of the stateless attach pass for a single tgid:
2635/// the procfs-derived `pcomm` (for tracing) plus the underlying
2636/// `attach_jemalloc_at` outcome. Carries no shared state, so it
2637/// can be assembled by rayon workers in parallel without locking.
2638struct AttachOutcome {
2639    pcomm: String,
2640    result: std::result::Result<
2641        crate::host_thread_probe::JemallocProbe,
2642        crate::host_thread_probe::AttachError,
2643    >,
2644}
2645
2646/// Cache value for the per-`(dev, ino)` probe cache in
2647/// [`capture_with`]'s parallel probe phase. Captures BOTH the
2648/// `JemallocProbe` (for the success path) and the
2649/// `AttachError::tag()` string (for the failure path) so a
2650/// cache hit can re-apply the same `attach_tag_counts` /
2651/// `failed` bumps that the original miss applied via
2652/// [`record_attach_outcome`]. Without `failed_tag`, repeat
2653/// hits on a failed binary would credit only `tgids_walked` —
2654/// the actionable failure (and its dominant-tag accounting)
2655/// would be silently undercounted relative to actual attach
2656/// volume.
2657#[derive(Clone)]
2658struct CachedAttachResult {
2659    probe: Option<crate::host_thread_probe::JemallocProbe>,
2660    /// `None` for the success path (`probe.is_some()`); `Some`
2661    /// for every failure path, even non-actionable tags
2662    /// (`jemalloc-not-found`, `readlink-failure`) — the
2663    /// dominant-tag filter in [`ProbeSummary::dominant_tag`]
2664    /// excludes those, but `attach_tag_counts` itself records
2665    /// every tag for diagnostic completeness.
2666    failed_tag: Option<&'static str>,
2667}
2668
2669/// Stateless half of the per-tgid attach: read `pcomm` and run
2670/// `attach_jemalloc_at` (the expensive ELF parse + DWARF walk).
2671/// No summary mutation — the result is paired with `pcomm` and
2672/// returned to the caller for application via
2673/// [`record_attach_outcome`]. Splitting attach from the summary
2674/// update lets the parallel probe phase in [`capture_with`] hold
2675/// the `summary_mutex` only for the cheap counter+tracing step,
2676/// rather than serialising every rayon worker on the slowest
2677/// call in the pipeline.
2678fn attach_probe_for_tgid_at(proc_root: &Path, tgid: i32) -> AttachOutcome {
2679    #[cfg(test)]
2680    {
2681        // Panic-injection seam: a test sets `PANIC_INJECT_TGID` to
2682        // a sentinel tgid value before calling `capture_with`. When
2683        // the rayon worker for that tgid enters this function, we
2684        // panic to model the failure mode where the ELF parse / DWARF
2685        // walk panics under fd exhaustion or OOM. The
2686        // `catch_unwind` wrapper in `capture_with`'s phase 1 must
2687        // absorb this and surface it through the summary as a
2688        // `worker-panic` attach tag without crashing the snapshot.
2689        let injected = PANIC_INJECT_TGID.load(std::sync::atomic::Ordering::Acquire);
2690        if injected != 0 && injected == tgid {
2691            // Non-string payload variant: when the bool seam is
2692            // armed, panic with a typed payload so the
2693            // `downcast_ref::<&str>` and `downcast_ref::<String>`
2694            // arms in `capture_with` both miss and the
2695            // `unwrap_or("<non-string panic payload>")` fallback
2696            // arm fires. Pinned by
2697            // `capture_with_rayon_worker_panic_non_string_payload_falls_back`.
2698            if PANIC_INJECT_NON_STRING.load(std::sync::atomic::Ordering::Acquire) {
2699                // u64 is `'static + Send`, so it satisfies the
2700                // `Box<dyn Any + Send>` payload bound but neither
2701                // downcasts to `&str` nor `String` — exactly the
2702                // shape the fallback arm guards against.
2703                std::panic::panic_any(0xDEADBEEFu64);
2704            }
2705            panic!("test: injected attach worker panic for tgid {tgid}");
2706        }
2707    }
2708    let pcomm = read_process_comm_at(proc_root, tgid).unwrap_or_default();
2709    let result = crate::host_thread_probe::attach_jemalloc_at(proc_root, tgid);
2710    AttachOutcome { pcomm, result }
2711}
2712
2713/// Test-only seam for the panic-injection harness consumed by
2714/// [`attach_probe_for_tgid_at`]. Set to a non-zero tgid to make
2715/// the next attach call for that tgid panic; reset to 0 to
2716/// disable. The check fires on the rayon worker thread, so the
2717/// `catch_unwind` wrapper in [`capture_with`] is the only thing
2718/// that prevents the panic from propagating out of `pool.install`.
2719/// `cfg(test)` only — production builds carry no injection
2720/// surface.
2721#[cfg(test)]
2722static PANIC_INJECT_TGID: std::sync::atomic::AtomicI32 = std::sync::atomic::AtomicI32::new(0);
2723
2724/// Test-only companion seam for [`PANIC_INJECT_TGID`]: when
2725/// armed (`true`) before calling `capture_with`, the injected
2726/// panic uses a typed non-string payload (`std::panic::panic_any`
2727/// over a `u64`) instead of the default formatted-message
2728/// `panic!`. Lets a test exercise the
2729/// `unwrap_or("<non-string panic payload>")` fallback arm in
2730/// `capture_with`'s panic-handling block — the `downcast_ref`
2731/// chain misses both `&str` and `String` for non-string
2732/// payloads and must fall back rather than panicking on
2733/// `unwrap()`.
2734#[cfg(test)]
2735static PANIC_INJECT_NON_STRING: std::sync::atomic::AtomicBool =
2736    std::sync::atomic::AtomicBool::new(false);
2737
2738/// Stateful half of the per-tgid attach: apply `outcome` to
2739/// `summary` and emit one tracing event. Two attach-error tags
2740/// log at `debug` rather than `warn`: `jemalloc-not-found` (the
2741/// bulk of system processes are not jemalloc-linked, so this is
2742/// the dominant non-actionable outcome on a busy host) and
2743/// `readlink-failure` (a tgid that exited between the procfs
2744/// walk and `readlink(/proc/<pid>/exe)` is also routine — race-
2745/// with-exit on short-lived helpers). Every other variant logs
2746/// at `warn` because a jemalloc-linked target failing to attach
2747/// is actionable (privilege drop, stripped binary, …). The
2748/// matches! arm here is the same one [`ProbeSummary::dominant_tag`]
2749/// uses to filter the operator-facing summary, so the level
2750/// routing and the dominance ranking surface the same
2751/// actionable/non-actionable cut. No I/O — safe to call under a
2752/// short-held mutex from the parallel probe phase.
2753fn record_attach_outcome(
2754    tgid: i32,
2755    outcome: AttachOutcome,
2756    summary: &mut ProbeSummary,
2757) -> CachedAttachResult {
2758    summary.tgids_walked += 1;
2759    let AttachOutcome { pcomm, result } = outcome;
2760    match result {
2761        Ok(probe) => {
2762            summary.jemalloc_detected += 1;
2763            tracing::debug!(tgid, %pcomm, "ctprof probe: jemalloc detected");
2764            CachedAttachResult {
2765                probe: Some(probe),
2766                failed_tag: None,
2767            }
2768        }
2769        Err(err) => {
2770            let tag = err.tag();
2771            *summary.attach_tag_counts.entry(tag).or_insert(0) += 1;
2772            if matches!(tag, "jemalloc-not-found" | "readlink-failure") {
2773                tracing::debug!(tgid, %pcomm, tag, err = %err, "ctprof probe: attach skipped");
2774            } else {
2775                summary.failed += 1;
2776                tracing::warn!(tgid, %pcomm, tag, err = %err, "ctprof probe: attach failed");
2777            }
2778            CachedAttachResult {
2779                probe: None,
2780                failed_tag: Some(tag),
2781            }
2782        }
2783    }
2784}
2785
2786/// Single-call wrapper around [`attach_probe_for_tgid_at`] +
2787/// [`record_attach_outcome`] for sequential callers (tests + the
2788/// per-pid `capture_pid_with` path) that don't need the
2789/// stateless/stateful split. The parallel probe phase in
2790/// [`capture_with`] calls the two halves separately so the
2791/// expensive attach runs outside the summary mutex.
2792fn try_attach_probe_for_tgid_at(
2793    proc_root: &Path,
2794    tgid: i32,
2795    summary: &mut ProbeSummary,
2796) -> Option<crate::host_thread_probe::JemallocProbe> {
2797    let outcome = attach_probe_for_tgid_at(proc_root, tgid);
2798    record_attach_outcome(tgid, outcome, summary).probe
2799}
2800
2801/// Pull `(allocated_bytes, deallocated_bytes)` for one tid via the
2802/// pre-attached probe, recording the outcome in `summary` and
2803/// emitting a `tracing::warn!` once per failed tgid (the engine
2804/// shares the same `AttachError`/`ProbeError` taxonomy across every
2805/// tid of a tgid, so logging each tid would spam the operator).
2806///
2807/// Returns `Some((allocated, deallocated))` on a successful per-thread
2808/// read, `None` on a read failure. The caller uses `None` to leave
2809/// `jemalloc_measured` false — a failed read is not a measurement, so
2810/// the absent-as-0 default must fold to `Aggregated::Absent`, not a
2811/// sentinel `Sum(0)`. Mirrors the taskstats path's per-thread Ok-gating.
2812fn probe_thread_recording(
2813    probe: &crate::host_thread_probe::JemallocProbe,
2814    tid: i32,
2815    tgid: i32,
2816    pcomm: &str,
2817    comm: &str,
2818    summary: &mut ProbeSummary,
2819    failed_tgids_logged: &mut std::collections::BTreeSet<i32>,
2820) -> Option<(u64, u64)> {
2821    match crate::host_thread_probe::probe_thread(probe, tid) {
2822        Ok(c) => {
2823            summary.probed_ok += 1;
2824            Some((c.allocated_bytes, c.deallocated_bytes))
2825        }
2826        Err(err) => {
2827            let tag = err.tag();
2828            *summary.probe_tag_counts.entry(tag).or_insert(0) += 1;
2829            summary.failed += 1;
2830            if failed_tgids_logged.insert(tgid) {
2831                tracing::warn!(
2832                    tgid,
2833                    tid,
2834                    %pcomm,
2835                    %comm,
2836                    tag,
2837                    err = %err,
2838                    "ctprof probe: probe_thread failed",
2839                );
2840            }
2841            None
2842        }
2843    }
2844}
2845
2846/// Emit the once-per-snapshot parse-summary line. Mirrors the
2847/// [`emit_probe_summary`] discipline: one info-level line with the
2848/// per-snapshot tally counts. Includes the dominant failure file
2849/// kind when any read failures landed, the kernel-config
2850/// remediation hint when `schedstat` / `io` dominate, and the
2851/// negative-dotted-value count when the parser saw any
2852/// schedstat fields with a leading `-`. The clauses are
2853/// suppressed when their underlying signal is zero so a clean
2854/// host emits a single short line.
2855fn emit_parse_summary(tally: &ParseTally) {
2856    let tids_walked = tally.tids_walked;
2857    let read_failures = tally.total_failures();
2858    let negative_dotted = tally.negative_dotted_values;
2859    let dominant_clause = tally
2860        .dominant_file()
2861        .map(|tag| format!(" (dominant: {tag})"))
2862        .unwrap_or_default();
2863    let kconfig_clause = if tally.kernel_config_dominates() {
2864        format!("; {PARSE_KCONFIG_HINT}")
2865    } else {
2866        String::new()
2867    };
2868    let negative_clause = if negative_dotted > 0 {
2869        format!(", {negative_dotted} negative-dotted values")
2870    } else {
2871        String::new()
2872    };
2873    tracing::info!(
2874        "ctprof parse: {tids_walked} tids walked, \
2875         {read_failures} read failures{negative_clause}\
2876         {dominant_clause}{kconfig_clause}",
2877    );
2878}
2879
2880/// Emit the once-per-snapshot summary line. Includes the dominant
2881/// failure tag when any failures landed and an EPERM remediation
2882/// hint when ptrace privilege failures dominate.
2883fn emit_probe_summary(summary: &ProbeSummary) {
2884    let tgids_walked = summary.tgids_walked;
2885    let jemalloc_detected = summary.jemalloc_detected;
2886    let probed_ok = summary.probed_ok;
2887    let failed = summary.failed;
2888    if failed > 0 {
2889        let dominant = summary.dominant_tag().unwrap_or("?");
2890        if summary.ptrace_dominates() {
2891            tracing::info!(
2892                "ctprof probe: {tgids_walked} tgids walked, \
2893                 {jemalloc_detected} jemalloc detected, \
2894                 {probed_ok} probed OK, {failed} failed \
2895                 (dominant: {dominant}; {})",
2896                PTRACE_EPERM_HINT,
2897            );
2898        } else {
2899            tracing::info!(
2900                "ctprof probe: {tgids_walked} tgids walked, \
2901                 {jemalloc_detected} jemalloc detected, \
2902                 {probed_ok} probed OK, {failed} failed \
2903                 (dominant: {dominant})",
2904            );
2905        }
2906    } else {
2907        tracing::info!(
2908            "ctprof probe: {tgids_walked} tgids walked, \
2909             {jemalloc_detected} jemalloc detected, \
2910             {probed_ok} probed OK, {failed} failed",
2911        );
2912    }
2913}
2914
2915/// Capture a complete host-wide snapshot under arbitrary procfs
2916/// and cgroup roots. Walks `<proc_root>` for every live tgid,
2917/// enumerates its threads, and assembles a [`CtprofSnapshot`]
2918/// with per-cgroup enrichment populated once per distinct cgroup
2919/// path (many threads share a cgroup; keep the walk
2920/// O(cgroups) rather than O(threads)). The default-roots
2921/// production entry point is [`capture`]; tests pass a tempdir
2922/// to exercise the walk against a synthetic tree.
2923///
2924/// `use_syscall_affinity` gates four real-host touchpoints —
2925/// (a) the [`crate::host_context::collect_host_context`] sweep
2926/// (kernel/CPU/memory/tunables read from the live host); (b)
2927/// phase 1, the parallel jemalloc-probe attach pass that walks
2928/// every tgid's `/proc/<pid>/exe` for ELF + DWARF metadata; (c)
2929/// `sched_getaffinity(2)` inside per-thread capture, with
2930/// fall-back to `Cpus_allowed_list:` on syscall failure;
2931/// (d) `emit_probe_summary` plus the [`CtprofProbeSummary`]
2932/// surfaced on the snapshot, both of which are skipped when
2933/// `use_syscall_affinity` is `false`: `emit_probe_summary` is
2934/// not called and `probe_summary` is `None`. Synthetic-tree
2935/// tests pass `false` so the staged procfs is read in isolation
2936/// (no `sched_getaffinity`, no ELF parses, no `host` block, no
2937/// `probe_summary`); production passes `true`.
2938///
2939/// Self-skip: the caller's own tgid is excluded from the per-tgid
2940/// probe-attach loop because `PTRACE_SEIZE` rejects self-attach
2941/// (the rayon `.filter(|&tgid| tgid != self_pid)` drops self
2942/// before the attach call). Phase 2 still iterates the full tgid
2943/// list including self_pid, and the per-tid lookup
2944/// `probe_map.get(&tgid).and_then(|p| p.as_ref())` returns `None`
2945/// for self_pid because phase 1 never inserted an entry; the
2946/// closure short-circuits via `.map(...).unwrap_or((0, 0))`,
2947/// leaving the jemalloc fields at the absent-counter default.
2948/// Every other procfs-derived
2949/// field populates normally — `capture_thread_at` runs
2950/// unconditionally per tid regardless of probe outcome.
2951fn capture_with(
2952    proc_root: &Path,
2953    cgroup_root: &Path,
2954    sys_root: &Path,
2955    use_syscall_affinity: bool,
2956) -> CtprofSnapshot {
2957    let captured_at_unix_ns = std::time::SystemTime::now()
2958        .duration_since(std::time::UNIX_EPOCH)
2959        .map(|d| d.as_nanos() as u64)
2960        .unwrap_or(0);
2961    let host = if use_syscall_affinity {
2962        Some(crate::host_context::collect_host_context())
2963    } else {
2964        None
2965    };
2966    // Linux pid_max is bounded above by 2^22 (PID_MAX_LIMIT,
2967    // defined in include/linux/threads.h; kernel/pid.c clamps
2968    // pid_max to it via pid_max_max) on every supported
2969    // architecture, well inside i32::MAX, so the u32 → i32 cast
2970    // cannot wrap.
2971    let self_pid = std::process::id() as i32;
2972    let mut threads: Vec<ThreadState> = Vec::new();
2973    let mut failed_tgids_logged: std::collections::BTreeSet<i32> =
2974        std::collections::BTreeSet::new();
2975
2976    // Phase 1: resolve probes in parallel via rayon. The expensive
2977    // ELF parse + DWARF walk runs concurrently across tgids, with
2978    // an inode cache (Mutex-wrapped) so duplicate binaries are
2979    // resolved only once. The result is a map of tgid → probe.
2980    //
2981    // Cache key shape: `(st_dev, st_ino)` of `/proc/<tgid>/exe`'s
2982    // metadata. Two tgids whose exes resolve to the same
2983    // `(dev, ino)` share a cache entry.
2984    //
2985    // Overlay-fs / container collision note: the kernel exposes
2986    // overlayfs files with the OVERLAY MOUNT's superblock device
2987    // (`dentry->d_sb->s_dev`, the overlayfs `struct super_block`
2988    // anonymous device number) and a synthetic `st_ino`. The
2989    // mapping happens in `fs/overlayfs/inode.c::ovl_map_dev_ino`
2990    // — both fields come from the overlay superblock's view, NOT
2991    // the underlying upper or lower layer. Two unrelated mounts
2992    // produce DIFFERENT `s_dev` values (each gets its own
2993    // anonymous bdev); two containers sharing a single mount of
2994    // the same lower-layer image-store path see the same `s_dev`
2995    // and the same hashed `st_ino` for that file. In the
2996    // shared-mount case a cached jemalloc attach result is
2997    // reused across containers — BENIGN, because the cached value
2998    // records "is this binary jemalloc-linked, and at what TSD
2999    // offset", which is a property of the ELF bytes and identical
3000    // across container instances of the same image. Mutable-
3001    // overlay writes (an upper-layer write that copies-up the
3002    // lower ELF) produce a NEW `(s_dev, st_ino)` pair within the
3003    // SAME overlay mount — `ovl_map_dev_ino` rehashes against
3004    // the new upper-layer inode — so the cache misses correctly
3005    // and re-resolves the rewritten binary.
3006    let tgids = iter_tgids_at(proc_root);
3007    let probe_cache: std::sync::Mutex<std::collections::HashMap<(u64, u64), CachedAttachResult>> =
3008        std::sync::Mutex::new(std::collections::HashMap::new());
3009    let summary_mutex = std::sync::Mutex::new(ProbeSummary::default());
3010
3011    let probe_map: std::collections::HashMap<i32, Option<crate::host_thread_probe::JemallocProbe>> =
3012        if use_syscall_affinity {
3013            use rayon::prelude::*;
3014            // Scale parallelism by available CPU headroom: read
3015            // `<proc_root>/loadavg`, subtract from online CPU count,
3016            // clamp to [1, num_cpus/2 + 1]. Avoids drowning a hot
3017            // host. Routing the read through `proc_root` (rather
3018            // than `/proc` directly) keeps the parameterised-root
3019            // contract intact so synthetic-tree tests can stage
3020            // their own loadavg shape.
3021            let max_threads = {
3022                let num_cpus = std::thread::available_parallelism()
3023                    .map(|n| n.get())
3024                    .unwrap_or(4);
3025                let load = std::fs::read_to_string(proc_root.join("loadavg"))
3026                    .ok()
3027                    .and_then(|s| s.split_whitespace().next()?.parse::<f64>().ok())
3028                    .unwrap_or(0.0);
3029                let headroom = (num_cpus as f64 - load).max(1.0) as usize;
3030                headroom.clamp(1, num_cpus / 2 + 1)
3031            };
3032            // ThreadPoolBuilder::build can fail when the OS rejects
3033            // the per-thread `pthread_create` (RLIMIT_NPROC, kernel
3034            // task table at PID_MAX). Fall back to the global rayon
3035            // pool on Err — capture still completes, only loses the
3036            // bounded-headroom guarantee.
3037            let pool_result = rayon::ThreadPoolBuilder::new()
3038                .num_threads(max_threads)
3039                .build();
3040            let work = || {
3041                tgids
3042                    .par_iter()
3043                    .copied()
3044                    .filter(|&tgid| tgid != self_pid)
3045                    .map(|tgid| {
3046                        // Catch panics from the per-tgid attach pipeline so
3047                        // a single rogue worker (fd exhaustion, OOM during
3048                        // DWARF parse, or any panic-on-bug under
3049                        // `attach_jemalloc_at`) cannot tear down
3050                        // `pool.install` and the surrounding capture call.
3051                        // Without this guard, `rayon::ThreadPool::install`
3052                        // re-throws worker panics into the calling thread,
3053                        // collapsing the entire snapshot into an unwind on
3054                        // a single tgid's failure. On panic we record a
3055                        // `worker-panic` attach tag against the summary
3056                        // (counted under `failed`, surfaced in
3057                        // `dominant_failure` when it dominates) and return
3058                        // `(tgid, None)` so phase 2 still walks the tgid's
3059                        // threads with the absent-counter default. The tag
3060                        // is treated as actionable — a panicking attach is
3061                        // a bug or resource-exhaustion signal, distinct
3062                        // from the benign `jemalloc-not-found` /
3063                        // `readlink-failure` outcomes the dominant-tag
3064                        // filter suppresses.
3065                        let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
3066                            let cache_key =
3067                                std::fs::metadata(proc_root.join(tgid.to_string()).join("exe"))
3068                                    .ok()
3069                                    .map(|m| {
3070                                        use std::os::unix::fs::MetadataExt;
3071                                        (m.dev(), m.ino())
3072                                    });
3073
3074                            if let Some(key) = cache_key {
3075                                // `unwrap_or_else(into_inner)` on every
3076                                // shared-mutex lock so a prior worker
3077                                // panic that poisoned a lock cannot
3078                                // cascade-poison every subsequent worker
3079                                // — the catch_unwind arm below records
3080                                // the failure as a `worker-panic`
3081                                // attach-tag bump, and surviving workers
3082                                // should still make progress on the
3083                                // partially-mutated state rather than
3084                                // re-panicking out of `pool.install` and
3085                                // collapsing the snapshot.
3086                                let cached = probe_cache.lock_unpoisoned().get(&key).cloned();
3087                                if let Some(cached_result) = cached {
3088                                    let mut s = summary_mutex.lock_unpoisoned();
3089                                    s.tgids_walked += 1;
3090                                    match &cached_result.failed_tag {
3091                                        None => {
3092                                            // Success path — original miss
3093                                            // already credited
3094                                            // `jemalloc_detected`. Re-apply
3095                                            // here so cache hits stay
3096                                            // symmetric with cache misses;
3097                                            // without this, only the first
3098                                            // sharer of a `(dev, ino)`
3099                                            // would count toward
3100                                            // `jemalloc_detected` and
3101                                            // every subsequent reuse
3102                                            // would silently undercount.
3103                                            s.jemalloc_detected += 1;
3104                                            tracing::debug!(
3105                                                tgid,
3106                                                "ctprof probe: cache hit (jemalloc)"
3107                                            );
3108                                        }
3109                                        Some(tag) => {
3110                                            // Failure path — re-apply the
3111                                            // SAME bookkeeping
3112                                            // [`record_attach_outcome`]
3113                                            // applied on the original
3114                                            // miss: bump
3115                                            // `attach_tag_counts[tag]`
3116                                            // unconditionally, and
3117                                            // `failed` for actionable
3118                                            // tags only (matching the
3119                                            // dominant-tag filter in
3120                                            // [`ProbeSummary::dominant_tag`]).
3121                                            // Without this, repeat hits
3122                                            // on a failed binary would
3123                                            // credit only `tgids_walked`
3124                                            // and the dominant-failure
3125                                            // signal would degrade as
3126                                            // shared-inode reuse climbs.
3127                                            // Logging stays at debug level
3128                                            // — the original miss already
3129                                            // emitted the warn-level event
3130                                            // for actionable tags; spamming
3131                                            // a warn per cache hit would
3132                                            // drown the operator log.
3133                                            *s.attach_tag_counts.entry(tag).or_insert(0) += 1;
3134                                            if !matches!(
3135                                                *tag,
3136                                                "jemalloc-not-found" | "readlink-failure"
3137                                            ) {
3138                                                s.failed += 1;
3139                                            }
3140                                            tracing::debug!(
3141                                                tgid,
3142                                                tag,
3143                                                "ctprof probe: cache hit (prior failure)"
3144                                            );
3145                                        }
3146                                    }
3147                                    cached_result.probe
3148                                } else {
3149                                    // Stateless attach (the expensive ELF parse +
3150                                    // DWARF walk) runs OUTSIDE the summary mutex
3151                                    // so rayon workers parallelise it. The lock
3152                                    // is only held for the cheap counter +
3153                                    // tracing application via `record_attach_outcome`.
3154                                    //
3155                                    // Shared-inode cache misses can produce
3156                                    // duplicate parses when N workers enter
3157                                    // simultaneously — all run the attach before
3158                                    // any inserts. The cache fully amortises
3159                                    // subsequent lookups; the duplicate work is
3160                                    // bounded by the rayon pool size.
3161                                    let outcome = attach_probe_for_tgid_at(proc_root, tgid);
3162                                    let mut s = summary_mutex.lock_unpoisoned();
3163                                    let res = record_attach_outcome(tgid, outcome, &mut s);
3164                                    drop(s);
3165                                    let probe = res.probe.clone();
3166                                    probe_cache.lock_unpoisoned().insert(key, res);
3167                                    probe
3168                                }
3169                            } else {
3170                                // No cache key — exe symlink unreadable. Same
3171                                // attach-outside-lock pattern as the cache-miss
3172                                // branch above; result is not cached because
3173                                // there's no key to file it under.
3174                                let outcome = attach_probe_for_tgid_at(proc_root, tgid);
3175                                let mut s = summary_mutex.lock_unpoisoned();
3176                                record_attach_outcome(tgid, outcome, &mut s).probe
3177                            }
3178                        }));
3179                        let probe = match result {
3180                            Ok(p) => p,
3181                            Err(panic_payload) => {
3182                                // Recover the panic message string for
3183                                // the operator log. The payload is a
3184                                // `Box<dyn Any + Send>` whose runtime
3185                                // type is `&'static str` for `panic!("…")`
3186                                // with a literal and `String` for
3187                                // `panic!("{…}", …)` with formatted args.
3188                                // Both of `attach_jemalloc_at`'s likely
3189                                // panic sites (and the test seam in
3190                                // `attach_probe_for_tgid_at`) panic with
3191                                // a formatted message → `String`. Other
3192                                // panic types (typed values, custom
3193                                // payloads) collapse to a placeholder so
3194                                // the log line still surfaces the tgid.
3195                                let panic_msg = panic_payload
3196                                    .downcast_ref::<&str>()
3197                                    .copied()
3198                                    .or_else(|| {
3199                                        panic_payload.downcast_ref::<String>().map(|s| s.as_str())
3200                                    })
3201                                    .unwrap_or("<non-string panic payload>");
3202                                // Bump counters to mirror what
3203                                // `record_attach_outcome` would have done
3204                                // for an attach error: tgids_walked++,
3205                                // worker-panic tag++, failed++. The lock
3206                                // may be poisoned if the inner panic
3207                                // happened mid-update of the summary, so
3208                                // recover via
3209                                // [`crate::sync::MutexExt::lock_unpoisoned`]
3210                                // rather than `.unwrap()` — bumping a
3211                                // counter on partially-mutated state is
3212                                // strictly less bad than re-panicking out
3213                                // of the worker and tearing down
3214                                // `pool.install`.
3215                                let mut s = summary_mutex.lock_unpoisoned();
3216                                s.tgids_walked += 1;
3217                                *s.attach_tag_counts.entry("worker-panic").or_insert(0) += 1;
3218                                s.failed += 1;
3219                                tracing::error!(
3220                                    tgid,
3221                                    panic_msg,
3222                                    "ctprof probe: attach worker panicked; tgid skipped",
3223                                );
3224                                None
3225                            }
3226                        };
3227                        (tgid, probe)
3228                    })
3229                    .collect()
3230            };
3231            match pool_result {
3232                Ok(pool) => pool.install(work),
3233                Err(e) => {
3234                    tracing::warn!(
3235                        error = %e,
3236                        max_threads,
3237                        "rayon ThreadPoolBuilder failed; falling back to global pool"
3238                    );
3239                    work()
3240                }
3241            }
3242        } else {
3243            std::collections::HashMap::new()
3244        };
3245
3246    // `mut` is required because phase 2 below threads `&mut
3247    // summary` into `probe_thread_recording`.
3248    let mut summary = summary_mutex.into_inner_unpoisoned();
3249    // Tally for procfs read-level failures, surfaced as
3250    // `parse_summary` when the production path runs. Tests that
3251    // pass `use_syscall_affinity=false` skip the assignment so
3252    // the public field stays `None` — same discipline as
3253    // `probe_summary`.
3254    let mut parse_tally = ParseTally::default();
3255    let mut tally_opt: Option<&mut ParseTally> = if use_syscall_affinity {
3256        Some(&mut parse_tally)
3257    } else {
3258        None
3259    };
3260
3261    // Open a single taskstats genetlink socket for the snapshot.
3262    // Best-effort: a kernel without `CONFIG_TASKSTATS`, a process
3263    // without `CAP_NET_ADMIN`, or any other open failure collapses
3264    // to `None` and every per-tid `query_tid` call short-circuits
3265    // through the absent-default zeros installed in
3266    // `capture_thread_at_with_tally`. Synthetic-tree tests pass
3267    // `use_syscall_affinity=false`, so the socket is never opened
3268    // — same discipline as the host-context / probe pass.
3269    // Per-sub-family enablement, probed once per snapshot from the host
3270    // (`/proc/sys/kernel/task_delayacct` + `/proc/config.gz`); baked into each
3271    // captured thread by `apply_delay_stats`, AND-ed with the per-thread query-Ok
3272    // in the group measured predicate. Only consumed on the live capture path.
3273    // Read independently of `collect_host_context`'s host-field probe (two small
3274    // /proc reads per infrequent snapshot) so the gating does not depend on
3275    // host-context collection running on this path.
3276    let taskstats_active = crate::host_context::probe_taskstats_active();
3277    let taskstats_client = if use_syscall_affinity {
3278        match crate::taskstats::TaskstatsClient::open() {
3279            Ok(c) => Some(c),
3280            Err(e) => {
3281                tracing::warn!(
3282                    error = %e,
3283                    "ctprof taskstats: open failed; delay-accounting and memory-watermark \
3284                     fields will be zero. Ensure the kernel was built with CONFIG_TASKSTATS \
3285                     (plus CONFIG_TASK_DELAY_ACCT for delay fields and CONFIG_TASK_XACCT for \
3286                     hiwater fields), the process holds CAP_NET_ADMIN, and the kernel was \
3287                     booted with `delayacct=on` (or sysctl `kernel.task_delayacct=1`)"
3288                );
3289                None
3290            }
3291        }
3292    } else {
3293        None
3294    };
3295    // Per-snapshot tally of `query_tid` outcomes. Allocated only
3296    // when the production-mode capture path runs (`use_syscall_affinity`
3297    // is true) — synthetic-tree tests skip it the same way they
3298    // skip `parse_summary` and `probe_summary`. Counters bump
3299    // even when `taskstats_client.is_none()` happened (open
3300    // failed) — the per-tid loop simply never reaches
3301    // `record_result` in that case, so every counter stays zero
3302    // and the operator sees a tally of all-zeros pointing at the
3303    // open-time tracing warning.
3304    let mut taskstats_tally: Option<crate::taskstats::TaskstatsSummary> = if use_syscall_affinity {
3305        Some(crate::taskstats::TaskstatsSummary::default())
3306    } else {
3307        None
3308    };
3309
3310    // Phase 2: sequential per-tid walk + ptrace reads.
3311    for tgid in &tgids {
3312        let tgid = *tgid;
3313        let pcomm = read_process_comm_at(proc_root, tgid).unwrap_or_default();
3314        let probe: Option<&crate::host_thread_probe::JemallocProbe> = probe_map
3315            .get(&tgid)
3316            .and_then(|p: &Option<crate::host_thread_probe::JemallocProbe>| p.as_ref());
3317        for tid in iter_task_ids_at(proc_root, tgid) {
3318            if let Some(t) = tally_opt.as_mut() {
3319                t.tids_walked += 1;
3320            }
3321            let comm = read_thread_comm_at(proc_root, tgid, tid).unwrap_or_default();
3322            let probe_read = probe.and_then(|p| {
3323                probe_thread_recording(
3324                    p,
3325                    tid,
3326                    tgid,
3327                    &pcomm,
3328                    &comm,
3329                    &mut summary,
3330                    &mut failed_tgids_logged,
3331                )
3332            });
3333            let (allocated_bytes, deallocated_bytes) = probe_read.unwrap_or((0, 0));
3334            let mut t = capture_thread_at_with_tally(
3335                proc_root,
3336                tgid,
3337                tid,
3338                &pcomm,
3339                &comm,
3340                use_syscall_affinity,
3341                &mut tally_opt,
3342            );
3343            t.allocated_bytes = crate::metric_types::Bytes(allocated_bytes);
3344            t.deallocated_bytes = crate::metric_types::Bytes(deallocated_bytes);
3345            // jemalloc is MEASURED iff the per-thread probe READ succeeded
3346            // (`probe_read.is_some()`): a non-jemalloc tgid has `probe == None`,
3347            // and an attached tgid whose per-thread read failed yields `None`
3348            // too — both leave the absent-as-0 defaults and `jemalloc_measured`
3349            // false so the group folds to Absent, not a sentinel Sum(0). Mirrors
3350            // the taskstats Ok-gating below (a failed read is not a measurement).
3351            t.jemalloc_measured = probe_read.is_some();
3352            // Best-effort taskstats query for delay-accounting +
3353            // hiwater memory watermarks. tid > 0 invariant is
3354            // guaranteed by `iter_task_ids_at`'s `> 0` filter; the
3355            // u32 cast is therefore safe. Failures (the kernel
3356            // doesn't support taskstats, the tid raced exit, the
3357            // socket was never opened) fall through to the zero
3358            // defaults already installed in
3359            // `capture_thread_at_with_tally`. Each query result —
3360            // success or failure — feeds the per-snapshot tally
3361            // so the operator can distinguish "every tid raced
3362            // exit" from "CAP_NET_ADMIN missing" from "kernel
3363            // built without CONFIG_TASKSTATS" without parsing the
3364            // tracing log.
3365            if let Some(client) = taskstats_client.as_ref() {
3366                let result = client.query_tid(tid as u32);
3367                if let Some(tally) = taskstats_tally.as_mut() {
3368                    tally.record_result(&result);
3369                }
3370                if let Ok(ds) = result {
3371                    t.apply_delay_stats(&ds, taskstats_active);
3372                }
3373            }
3374            // Ghost-thread filter: a tid that exited between the
3375            // `iter_task_ids_at` readdir and our per-file reads
3376            // produces an all-Default `ThreadState` — empty comm
3377            // and zero start_time_clock_ticks, because every
3378            // procfs file read bailed with ENOENT mid-capture.
3379            // Including these entries pollutes the comparison: a
3380            // baseline run might capture 1000 such ghosts and a
3381            // candidate 500, producing a spurious "500 ghost
3382            // threads vanished" diff signal in every report. A
3383            // legitimate thread under a real kernel always
3384            // carries at least one of these fields — kernel
3385            // threads have a non-empty comm at creation, user
3386            // threads inherit one from their parent — so an
3387            // entry with BOTH empty implies mid-capture exit.
3388            // The filter preserves the "captures-what-existed"
3389            // intent without softening the "captures every live
3390            // thread" invariant.
3391            if t.comm.is_empty() && t.start_time_clock_ticks == 0 {
3392                if let Some(t) = tally_opt.as_mut() {
3393                    t.discard_pending();
3394                }
3395                continue;
3396            }
3397            if let Some(t) = tally_opt.as_mut() {
3398                t.commit_pending();
3399            }
3400            threads.push(t);
3401        }
3402    }
3403    let probe_summary = if use_syscall_affinity {
3404        emit_probe_summary(&summary);
3405        Some(summary.to_public())
3406    } else {
3407        None
3408    };
3409    let parse_summary = if use_syscall_affinity {
3410        emit_parse_summary(&parse_tally);
3411        Some(parse_tally.to_public())
3412    } else {
3413        None
3414    };
3415    let mut cgroup_stats: BTreeMap<String, CgroupStats> = BTreeMap::new();
3416    for t in &threads {
3417        if !t.cgroup.is_empty() && !cgroup_stats.contains_key(&t.cgroup) {
3418            cgroup_stats.insert(
3419                t.cgroup.clone(),
3420                read_cgroup_stats_at(cgroup_root, &t.cgroup),
3421            );
3422        }
3423    }
3424    let psi = read_host_psi_at(proc_root);
3425    let sched_ext = read_sched_ext_sysfs_at(sys_root);
3426    CtprofSnapshot {
3427        captured_at_unix_ns,
3428        host,
3429        threads,
3430        cgroup_stats,
3431        probe_summary,
3432        parse_summary,
3433        taskstats_summary: taskstats_tally,
3434        psi,
3435        sched_ext,
3436    }
3437}
3438
3439/// Capture a complete host-wide snapshot against the default
3440/// procfs and cgroup roots (`/proc` and `/sys/fs/cgroup`).
3441/// Probes every jemalloc-linked tgid the walk reaches and
3442/// populates per-thread `allocated_bytes` / `deallocated_bytes`
3443/// from the jemalloc TSD counters; tgids the probe cannot attach
3444/// against (ptrace denied, not jemalloc-linked, stripped binary)
3445/// land their threads at the absent-counter default of 0 per the
3446/// best-effort capture contract.
3447///
3448/// # Cost
3449///
3450/// O(threads-on-host) for the procfs walk; additionally one ELF
3451/// open + DWARF parse for every tgid `attach_jemalloc` resolves
3452/// successfully, plus a ptrace seize/interrupt/waitpid/detach
3453/// round-trip per thread of those tgids. On a host with many
3454/// jemalloc-linked daemons (database / browser / runtime
3455/// processes) the probe path dominates the wall-clock cost.
3456/// Callers that need only one tgid's data should use
3457/// [`capture_pid`] to scope the walk.
3458pub fn capture() -> CtprofSnapshot {
3459    capture_with(
3460        Path::new(DEFAULT_PROC_ROOT),
3461        Path::new(DEFAULT_CGROUP_ROOT),
3462        Path::new(DEFAULT_SYS_ROOT),
3463        true,
3464    )
3465}
3466
3467/// Capture a ctprof snapshot scoped to a single tgid.
3468///
3469/// Walks `/proc/<pid>/task` for thread enumeration but skips every
3470/// other tgid on the host, sidestepping the wall-clock cost (and
3471/// blast-radius) of the global probe pass that [`capture`] runs.
3472/// Probes the target tgid's jemalloc TSD counters when it is
3473/// jemalloc-linked and not the calling process; otherwise the
3474/// per-thread allocated / deallocated fields land at zero per the
3475/// best-effort capture contract.
3476///
3477/// Useful for tests and tools that already know which process they
3478/// care about — the resulting snapshot's `threads` vec only carries
3479/// entries for `pid`'s tgid (one entry per thread of that process).
3480/// `host` and `cgroup_stats` populate normally so the snapshot
3481/// stays self-describing.
3482pub fn capture_pid(pid: i32) -> CtprofSnapshot {
3483    capture_pid_with(
3484        Path::new(DEFAULT_PROC_ROOT),
3485        Path::new(DEFAULT_CGROUP_ROOT),
3486        Path::new(DEFAULT_SYS_ROOT),
3487        pid,
3488        true,
3489    )
3490}
3491
3492/// `proc_root` + `cgroup_root` parameterised variant of
3493/// [`capture_pid`]. Lets tests stage a synthetic procfs / cgroupfs
3494/// for the capture walk without touching the real host.
3495///
3496/// `use_syscall_affinity` gates the same four real-host
3497/// touchpoints as [`capture_with`] — host-context collection,
3498/// the jemalloc probe attach (here scoped to the single target
3499/// `pid` rather than a phase-1 sweep across every tgid),
3500/// `sched_getaffinity(2)` inside per-thread capture, and
3501/// `emit_probe_summary` plus the [`CtprofProbeSummary`] on the
3502/// snapshot. Synthetic-tree tests pass `false` because the
3503/// staged procfs has no real ELF behind `/proc/<pid>/exe`;
3504/// production passes `true`. Self-skip parallels the global path:
3505/// when `pid == self_pid`, the `probe` binding is `None` (the
3506/// `&& pid != self_pid` guard skips the attach), and each tid's
3507/// `probe.as_ref().map(...).unwrap_or((0, 0))` short-circuits to
3508/// the absent-counter default for the jemalloc fields, with every
3509/// other procfs-derived field populated normally.
3510fn capture_pid_with(
3511    proc_root: &Path,
3512    cgroup_root: &Path,
3513    sys_root: &Path,
3514    pid: i32,
3515    use_syscall_affinity: bool,
3516) -> CtprofSnapshot {
3517    let captured_at_unix_ns = std::time::SystemTime::now()
3518        .duration_since(std::time::UNIX_EPOCH)
3519        .map(|d| d.as_nanos() as u64)
3520        .unwrap_or(0);
3521    let host = if use_syscall_affinity {
3522        Some(crate::host_context::collect_host_context())
3523    } else {
3524        None
3525    };
3526    // Linux pid_max is bounded above by 2^22 (PID_MAX_LIMIT,
3527    // defined in include/linux/threads.h; kernel/pid.c clamps
3528    // pid_max to it via pid_max_max) on every supported
3529    // architecture, well inside i32::MAX, so the u32 → i32 cast
3530    // cannot wrap.
3531    let self_pid = std::process::id() as i32;
3532    let pcomm = read_process_comm_at(proc_root, pid).unwrap_or_default();
3533    let mut summary = ProbeSummary::default();
3534    let mut failed_tgids_logged: std::collections::BTreeSet<i32> =
3535        std::collections::BTreeSet::new();
3536    let probe = if use_syscall_affinity && pid != self_pid {
3537        try_attach_probe_for_tgid_at(proc_root, pid, &mut summary)
3538    } else {
3539        None
3540    };
3541    let mut threads: Vec<ThreadState> = Vec::new();
3542    let mut parse_tally = ParseTally::default();
3543    let mut tally_opt: Option<&mut ParseTally> = if use_syscall_affinity {
3544        Some(&mut parse_tally)
3545    } else {
3546        None
3547    };
3548    // Per-sub-family enablement, probed once per snapshot (see `capture_with`).
3549    let taskstats_active = crate::host_context::probe_taskstats_active();
3550    // Best-effort taskstats client — same discipline as `capture_with`.
3551    let taskstats_client = if use_syscall_affinity {
3552        match crate::taskstats::TaskstatsClient::open() {
3553            Ok(c) => Some(c),
3554            Err(e) => {
3555                tracing::warn!(
3556                    error = %e,
3557                    "ctprof taskstats: open failed; delay-accounting and memory-watermark \
3558                     fields will be zero. Ensure the kernel was built with CONFIG_TASKSTATS \
3559                     (plus CONFIG_TASK_DELAY_ACCT for delay fields and CONFIG_TASK_XACCT for \
3560                     hiwater fields), the process holds CAP_NET_ADMIN, and the kernel was \
3561                     booted with `delayacct=on` (or sysctl `kernel.task_delayacct=1`)"
3562                );
3563                None
3564            }
3565        }
3566    } else {
3567        None
3568    };
3569    // Per-snapshot tally — mirrors the `capture_with` discipline.
3570    // Allocated only under `use_syscall_affinity` so the
3571    // synthetic-tree code path keeps `taskstats_summary: None` on
3572    // the resulting snapshot, identical to `parse_summary` /
3573    // `probe_summary`.
3574    let mut taskstats_tally: Option<crate::taskstats::TaskstatsSummary> = if use_syscall_affinity {
3575        Some(crate::taskstats::TaskstatsSummary::default())
3576    } else {
3577        None
3578    };
3579    for tid in iter_task_ids_at(proc_root, pid) {
3580        if let Some(t) = tally_opt.as_mut() {
3581            t.tids_walked += 1;
3582        }
3583        let comm = read_thread_comm_at(proc_root, pid, tid).unwrap_or_default();
3584        let probe_read = probe.as_ref().and_then(|p| {
3585            probe_thread_recording(
3586                p,
3587                tid,
3588                pid,
3589                &pcomm,
3590                &comm,
3591                &mut summary,
3592                &mut failed_tgids_logged,
3593            )
3594        });
3595        let (allocated_bytes, deallocated_bytes) = probe_read.unwrap_or((0, 0));
3596        let mut t = capture_thread_at_with_tally(
3597            proc_root,
3598            pid,
3599            tid,
3600            &pcomm,
3601            &comm,
3602            use_syscall_affinity,
3603            &mut tally_opt,
3604        );
3605        t.allocated_bytes = crate::metric_types::Bytes(allocated_bytes);
3606        t.deallocated_bytes = crate::metric_types::Bytes(deallocated_bytes);
3607        // jemalloc measured iff the per-thread probe READ succeeded
3608        // (`probe_read.is_some()`); see the capture_with twin. A non-jemalloc
3609        // tgid (probe None) or an attached tgid whose per-thread read failed both
3610        // leave jemalloc_measured false so the group folds to Absent instead of a
3611        // sentinel Sum(0).
3612        t.jemalloc_measured = probe_read.is_some();
3613        if let Some(client) = taskstats_client.as_ref() {
3614            let result = client.query_tid(tid as u32);
3615            if let Some(tally) = taskstats_tally.as_mut() {
3616                tally.record_result(&result);
3617            }
3618            if let Ok(ds) = result {
3619                t.apply_delay_stats(&ds, taskstats_active);
3620            }
3621        }
3622        if t.comm.is_empty() && t.start_time_clock_ticks == 0 {
3623            if let Some(t) = tally_opt.as_mut() {
3624                t.discard_pending();
3625            }
3626            continue;
3627        }
3628        if let Some(t) = tally_opt.as_mut() {
3629            t.commit_pending();
3630        }
3631        threads.push(t);
3632    }
3633    let probe_summary = if use_syscall_affinity {
3634        emit_probe_summary(&summary);
3635        Some(summary.to_public())
3636    } else {
3637        None
3638    };
3639    let parse_summary = if use_syscall_affinity {
3640        emit_parse_summary(&parse_tally);
3641        Some(parse_tally.to_public())
3642    } else {
3643        None
3644    };
3645    let mut cgroup_stats: BTreeMap<String, CgroupStats> = BTreeMap::new();
3646    for t in &threads {
3647        if !t.cgroup.is_empty() && !cgroup_stats.contains_key(&t.cgroup) {
3648            cgroup_stats.insert(
3649                t.cgroup.clone(),
3650                read_cgroup_stats_at(cgroup_root, &t.cgroup),
3651            );
3652        }
3653    }
3654    let psi = read_host_psi_at(proc_root);
3655    let sched_ext = read_sched_ext_sysfs_at(sys_root);
3656    CtprofSnapshot {
3657        captured_at_unix_ns,
3658        host,
3659        threads,
3660        cgroup_stats,
3661        probe_summary,
3662        parse_summary,
3663        taskstats_summary: taskstats_tally,
3664        psi,
3665        sched_ext,
3666    }
3667}
3668
3669/// Capture a snapshot and write it to `path` in the canonical
3670/// zstd+JSON format. Wrapper over [`capture`] +
3671/// [`CtprofSnapshot::write`] so CLI code can stay a single
3672/// call.
3673pub fn capture_to(path: &Path) -> Result<()> {
3674    capture().write(path)
3675}
3676
3677// Test modules — alphabetized.
3678#[cfg(test)]
3679mod tests_capture;
3680#[cfg(test)]
3681mod tests_cgroup;
3682#[cfg(test)]
3683mod tests_helpers;
3684#[cfg(test)]
3685mod tests_parse;
3686#[cfg(test)]
3687mod tests_parse_summary;
3688#[cfg(test)]
3689mod tests_probe;
3690#[cfg(test)]
3691mod tests_snapshot;
3692#[cfg(test)]
3693mod tests_thread_state;