Expand description
Per-thread ctprof (cgroup/thread profiler) data model + capture layer.
CtprofSnapshot is the serialized container for a single
host-wide per-thread profile. Capture produces one via the
ktstr ctprof capture -o snapshot.ctprof.zst subcommand;
comparison reads two and joins them on the selected grouping
axis (pcomm, cgroup, or comm).
Field families and probe-timing invariance:
- Cumulative counters and totals (the majority): wakeups,
migrations, csw, run/wait/sleep/block/iowait time, schedstat
counts, page-fault counters, syscall counters, byte counters,
the taskstats per-bucket
*_countand*_delay_total_ns, the jemalloc per-thread allocated/deallocated TSD counters, etc. Sampled twice at different instants the value increases monotonically; probe-attach latency does not alter the reading. - Lifetime high-water peaks: schedstat
*_maxfamily (wait_max,sleep_max,block_max,exec_max,slice_max), every taskstats*_delay_max_ns/*_delay_min_ns, and the memory watermarks (hiwater_rss_bytes,hiwater_vm_bytes). These are non-decreasing-over-time but per-event extrema rather than sums, so they are non-summable across threads (the registry reduces them viaMaxPeak/MaxPeakBytes). Same probe-timing invariance as the cumulative counters. - Instantaneous gauges (sensitive to probe timing):
ThreadState::nr_threads(signal_struct->nr_threads snapshot),ThreadState::fair_slice_ns(instantaneousp->se.slice), andThreadState::state(task_state_array letter). Sampled at capture time and can genuinely differ between two probes of the same thread. The registry pairs them withMaxGaugeCount/MaxGaugeNs/ModeCharreductions rather than theSum*rules used for cumulative counters. - Categorical / ordinal scalars (point-in-time
snapshots):
policy,nice,priority,processor,rt_priority, plus the identity strings (pcomm,comm,cgroup) and thecrate::metric_types::CpuSetcpu_affinity. These are sampled at capture time and can change at runtime (e.g.sched_setaffinitymid-run flipsprocessorandcpu_affinity), so they share the gauge family’s probe-timing sensitivity. The registry reduces them viaMode*/Range*/Affinityrather thanSum*.
The jemalloc per-thread TSD counters
(tsd_s.thread_allocated / thread_deallocated) jemalloc
maintains unconditionally on its alloc/dalloc fast and slow
paths, so the ptrace-based attach this layer performs does
not perturb them; counters previously accumulated remain
valid across the brief stop the attach induces. Metrics not
derivable from cumulative state (e.g. perf_event_open
counters that reset on attachment) are intentionally absent
from this capture layer.
§Capture model
capture walks /proc for every live tgid, enumerates its
threads, and populates each ThreadState from a handful of
procfs sources: stat, schedstat, status, io, sched,
comm, cgroup. The procfs walk runs sequentially per tid in
capture_with phase 2. Phase 1 attaches the jemalloc TSD
probe in parallel across tgids when use_syscall_affinity is
true (the production path); under use_syscall_affinity = false (the synthetic-tree test path), phase 1 is skipped
entirely — the per-tgid probe map starts and stays empty, and
phase 2’s per-tid lookup falls through to the absent-counter
default of zero. See “Probe wiring” below for the per-tgid
mechanics.
§Probe wiring (most-expensive step)
For every tgid the walk reaches, the capture pipeline calls
the pub(crate) host_thread_probe::attach_jemalloc_at (or
its default-root attach_jemalloc wrapper) to resolve the
target’s jemalloc TLS symbol + per-tsd_s field offsets via
an ELF parse and DWARF walk; per-thread counter reads then
dispatch through host_thread_probe::probe_thread for one
ptrace cycle: seize → interrupt → waitpid → getregset →
process_vm_readv → detach (the detach happens automatically
via the ScopeDetach Drop guard, so any fallible step still
leaves the target unstuck). The remote read pulls a
contiguous 24-byte counter span — the canonical jemalloc
TSD_DATA_FAST layout (allocated, fast-event slot,
deallocated) — but the byte count is computed dynamically by
combined_read_span from the DWARF-resolved field offsets, so
a future jemalloc layout change is absorbed. This is the
dominant wall-clock cost of a snapshot:
O(unique-exe-inode tgids) ELF parses + O(jemalloc-linked
tgids) DWARF walks + O(threads of jemalloc-linked tgids)
ptrace cycles. The first term covers non-jemalloc tgids: each
distinct /proc/<pid>/exe inode still costs one ELF parse to
discover absence (the inode-keyed cache below collapses
repeats). attach_jemalloc_at is the sole detection gate —
tgids that attach successfully populate allocated_bytes /
deallocated_bytes; tgids that fail attach (not jemalloc-
linked, stripped binary, ptrace denied, arch mismatch — see
host_thread_probe::AttachError) land their threads at the
absent-counter default of zero.
Phase 1 parallelism is gated by host CPU headroom (read from
<proc_root>/loadavg, clamped to [1, num_cpus/2 + 1]) so the
capture cannot drown a hot host with concurrent ELF reads.
Per-tgid attach results are inode-keyed cached so a fork-bombed
tgid family resolves DWARF once. The per-tgid wrapper
try_attach_probe_for_tgid_at records every outcome in a single
ProbeSummary tally; emit_probe_summary surfaces a single
info-level line per snapshot summarising tgids walked, jemalloc
detected, probed OK, failed, plus the dominant actionable
failure tag and an EPERM remediation hint when ptrace-attach
failures dominate.
Each internal procfs reader returns Option (graceful on
missing/unreadable — a kernel without CONFIG_SCHED_INFO (the
schedstat file) or CONFIG_TASK_IO_ACCOUNTING (the io file)
makes that file absent, so its reader yields None without
failing the rest of the thread). The assembled
ThreadState treats None as “absent at capture” via the
field type — counters collapse to 0, identity strings
collapse to empty, affinity collapses to an empty vec. A
missing reading is therefore indistinguishable from a genuine
zero in the serialized output; the capture contract is
best-effort, never-fail-the-snapshot. Tests that need stronger
guarantees inspect the underlying readers directly (they remain
Option-shaped, unit-tested in this module).
§Privilege
Pulling the jemalloc per-thread TSD counters requires
ptrace(PTRACE_SEIZE) against the target. Under
kernel.yama.ptrace_scope=0 any same-uid process attaches.
Under =1 (Debian/Ubuntu host default) the tracer must be an
ancestor of the target or carry CAP_SYS_PTRACE; =2 and =3
raise the bar further. When attach fails, the per-thread
allocated_bytes / deallocated_bytes collapse to 0 per the
best-effort contract — the rest of the snapshot still
populates from procfs.
Structs§
- Cgroup
CpuStats - CPU controller state for one cgroup. Fields mirror the
cpu.*cgroup v2 files exposed under<cgroup>/cpu.stat,<cgroup>/cpu.max,<cgroup>/cpu.weight, and<cgroup>/cpu.weight.nice. - Cgroup
Memory Stats - Memory controller state for one cgroup. Fields mirror the
memory.*cgroup v2 files.statandeventsare captured as flat key-value maps so the data model auto-extends when the kernel adds new keys (memory.stat has 71 keys on a recent kernel; the explicit list is scheduler-correctness-relevant but the map preserves regression-detection on lesser-known counters). - Cgroup
Pids Stats - PIDs controller state for one cgroup. Fields mirror the
pids.*cgroup v2 files. The pids controller is optional (must be enabled incgroup.subtree_control); on hosts that don’t enable it, both fields areNone. - Cgroup
Stats - Per-cgroup enrichment record attached to
CtprofSnapshot. - Ctprof
Parse Summary - Per-snapshot procfs read-failure statistics. Curated projection of the capture pipeline’s internal read-tally — exposes per-file counters and a dominant-failure tag a downstream consumer needs to decide whether the snapshot’s procfs-derived fields (CSW, schedstats, IO, etc.) are trustworthy on a given host without scanning every thread for default values.
- Ctprof
Probe Summary - Per-snapshot probe outcome statistics. Curated projection of
the capture pipeline’s internal probe tally — exposes the
counters, the dominant failure tag, and a
privilege_dominantboolean a downstream consumer needs to decide whether the snapshot’sallocated_bytes/deallocated_bytesfields are trustworthy on a given host without parsing the operator- facing tracing line. - Ctprof
Snapshot - Top-level serialized artifact produced by
ktstr ctprof. - Psi
- Bundle of
PsiResourcefor the four kernel-exposed resources. Same shape used at both system level (CtprofSnapshot::psi) and per-cgroup (CgroupStats::psi) — the data source differs but the kernel emits the same format and field set in both places. - PsiHalf
- One Pressure Stall Information half-line: either the
someorfullrow for one resource. Mirrors the kernel emission format%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%lluinpsi_show()(kernel/sched/psi.c). - PsiResource
- Pressure Stall Information for one resource (cpu / memory /
io / irq), bundling the
someandfullhalves. - Sched
ExtSysfs - Global sched_ext sysfs state, captured from
/sys/kernel/sched_ext/. The kernel registers exactly five global attributes viascx_global_attrs[](kernel/sched/ext.c); this struct mirrors them 1-to-1. - Thread
State - Per-thread resource profile.
Functions§
- capture
- Capture a complete host-wide snapshot against the default
procfs and cgroup roots (
/procand/sys/fs/cgroup). Probes every jemalloc-linked tgid the walk reaches and populates per-threadallocated_bytes/deallocated_bytesfrom the jemalloc TSD counters; tgids the probe cannot attach against (ptrace denied, not jemalloc-linked, stripped binary) land their threads at the absent-counter default of 0 per the best-effort capture contract. - capture_
pid - Capture a ctprof snapshot scoped to a single tgid.
- capture_
to - Capture a snapshot and write it to
pathin the canonical zstd+JSON format. Wrapper overcapture+CtprofSnapshot::writeso CLI code can stay a single call.