Module ctprof

Module ctprof 

Source
Expand description

Per-thread ctprof (cgroup/thread profiler) data model + capture layer.

CtprofSnapshot is the serialized container for a single host-wide per-thread profile. Capture produces one via the ktstr ctprof capture -o snapshot.ctprof.zst subcommand; comparison reads two and joins them on the selected grouping axis (pcomm, cgroup, or comm).

Field families and probe-timing invariance:

  • Cumulative counters and totals (the majority): wakeups, migrations, csw, run/wait/sleep/block/iowait time, schedstat counts, page-fault counters, syscall counters, byte counters, the taskstats per-bucket *_count and *_delay_total_ns, the jemalloc per-thread allocated/deallocated TSD counters, etc. Sampled twice at different instants the value increases monotonically; probe-attach latency does not alter the reading.
  • Lifetime high-water peaks: schedstat *_max family (wait_max, sleep_max, block_max, exec_max, slice_max), every taskstats *_delay_max_ns / *_delay_min_ns, and the memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes). These are non-decreasing-over-time but per-event extrema rather than sums, so they are non-summable across threads (the registry reduces them via MaxPeak / MaxPeakBytes). Same probe-timing invariance as the cumulative counters.
  • Instantaneous gauges (sensitive to probe timing): ThreadState::nr_threads (signal_struct->nr_threads snapshot), ThreadState::fair_slice_ns (instantaneous p->se.slice), and ThreadState::state (task_state_array letter). Sampled at capture time and can genuinely differ between two probes of the same thread. The registry pairs them with MaxGaugeCount / MaxGaugeNs / ModeChar reductions rather than the Sum* rules used for cumulative counters.
  • Categorical / ordinal scalars (point-in-time snapshots): policy, nice, priority, processor, rt_priority, plus the identity strings (pcomm, comm, cgroup) and the crate::metric_types::CpuSet cpu_affinity. These are sampled at capture time and can change at runtime (e.g. sched_setaffinity mid-run flips processor and cpu_affinity), so they share the gauge family’s probe-timing sensitivity. The registry reduces them via Mode* / Range* / Affinity rather than Sum*.

The jemalloc per-thread TSD counters (tsd_s.thread_allocated / thread_deallocated) jemalloc maintains unconditionally on its alloc/dalloc fast and slow paths, so the ptrace-based attach this layer performs does not perturb them; counters previously accumulated remain valid across the brief stop the attach induces. Metrics not derivable from cumulative state (e.g. perf_event_open counters that reset on attachment) are intentionally absent from this capture layer.

§Capture model

capture walks /proc for every live tgid, enumerates its threads, and populates each ThreadState from a handful of procfs sources: stat, schedstat, status, io, sched, comm, cgroup. The procfs walk runs sequentially per tid in capture_with phase 2. Phase 1 attaches the jemalloc TSD probe in parallel across tgids when use_syscall_affinity is true (the production path); under use_syscall_affinity = false (the synthetic-tree test path), phase 1 is skipped entirely — the per-tgid probe map starts and stays empty, and phase 2’s per-tid lookup falls through to the absent-counter default of zero. See “Probe wiring” below for the per-tgid mechanics.

§Probe wiring (most-expensive step)

For every tgid the walk reaches, the capture pipeline calls the pub(crate) host_thread_probe::attach_jemalloc_at (or its default-root attach_jemalloc wrapper) to resolve the target’s jemalloc TLS symbol + per-tsd_s field offsets via an ELF parse and DWARF walk; per-thread counter reads then dispatch through host_thread_probe::probe_thread for one ptrace cycle: seize → interrupt → waitpid → getregset → process_vm_readv → detach (the detach happens automatically via the ScopeDetach Drop guard, so any fallible step still leaves the target unstuck). The remote read pulls a contiguous 24-byte counter span — the canonical jemalloc TSD_DATA_FAST layout (allocated, fast-event slot, deallocated) — but the byte count is computed dynamically by combined_read_span from the DWARF-resolved field offsets, so a future jemalloc layout change is absorbed. This is the dominant wall-clock cost of a snapshot: O(unique-exe-inode tgids) ELF parses + O(jemalloc-linked tgids) DWARF walks + O(threads of jemalloc-linked tgids) ptrace cycles. The first term covers non-jemalloc tgids: each distinct /proc/<pid>/exe inode still costs one ELF parse to discover absence (the inode-keyed cache below collapses repeats). attach_jemalloc_at is the sole detection gate — tgids that attach successfully populate allocated_bytes / deallocated_bytes; tgids that fail attach (not jemalloc- linked, stripped binary, ptrace denied, arch mismatch — see host_thread_probe::AttachError) land their threads at the absent-counter default of zero.

Phase 1 parallelism is gated by host CPU headroom (read from <proc_root>/loadavg, clamped to [1, num_cpus/2 + 1]) so the capture cannot drown a hot host with concurrent ELF reads. Per-tgid attach results are inode-keyed cached so a fork-bombed tgid family resolves DWARF once. The per-tgid wrapper try_attach_probe_for_tgid_at records every outcome in a single ProbeSummary tally; emit_probe_summary surfaces a single info-level line per snapshot summarising tgids walked, jemalloc detected, probed OK, failed, plus the dominant actionable failure tag and an EPERM remediation hint when ptrace-attach failures dominate.

Each internal procfs reader returns Option (graceful on missing/unreadable — a kernel without CONFIG_SCHED_INFO (the schedstat file) or CONFIG_TASK_IO_ACCOUNTING (the io file) makes that file absent, so its reader yields None without failing the rest of the thread). The assembled ThreadState treats None as “absent at capture” via the field type — counters collapse to 0, identity strings collapse to empty, affinity collapses to an empty vec. A missing reading is therefore indistinguishable from a genuine zero in the serialized output; the capture contract is best-effort, never-fail-the-snapshot. Tests that need stronger guarantees inspect the underlying readers directly (they remain Option-shaped, unit-tested in this module).

§Privilege

Pulling the jemalloc per-thread TSD counters requires ptrace(PTRACE_SEIZE) against the target. Under kernel.yama.ptrace_scope=0 any same-uid process attaches. Under =1 (Debian/Ubuntu host default) the tracer must be an ancestor of the target or carry CAP_SYS_PTRACE; =2 and =3 raise the bar further. When attach fails, the per-thread allocated_bytes / deallocated_bytes collapse to 0 per the best-effort contract — the rest of the snapshot still populates from procfs.

Structs§

CgroupCpuStats
CPU controller state for one cgroup. Fields mirror the cpu.* cgroup v2 files exposed under <cgroup>/cpu.stat, <cgroup>/cpu.max, <cgroup>/cpu.weight, and <cgroup>/cpu.weight.nice.
CgroupMemoryStats
Memory controller state for one cgroup. Fields mirror the memory.* cgroup v2 files. stat and events are captured as flat key-value maps so the data model auto-extends when the kernel adds new keys (memory.stat has 71 keys on a recent kernel; the explicit list is scheduler-correctness-relevant but the map preserves regression-detection on lesser-known counters).
CgroupPidsStats
PIDs controller state for one cgroup. Fields mirror the pids.* cgroup v2 files. The pids controller is optional (must be enabled in cgroup.subtree_control); on hosts that don’t enable it, both fields are None.
CgroupStats
Per-cgroup enrichment record attached to CtprofSnapshot.
CtprofParseSummary
Per-snapshot procfs read-failure statistics. Curated projection of the capture pipeline’s internal read-tally — exposes per-file counters and a dominant-failure tag a downstream consumer needs to decide whether the snapshot’s procfs-derived fields (CSW, schedstats, IO, etc.) are trustworthy on a given host without scanning every thread for default values.
CtprofProbeSummary
Per-snapshot probe outcome statistics. Curated projection of the capture pipeline’s internal probe tally — exposes the counters, the dominant failure tag, and a privilege_dominant boolean a downstream consumer needs to decide whether the snapshot’s allocated_bytes / deallocated_bytes fields are trustworthy on a given host without parsing the operator- facing tracing line.
CtprofSnapshot
Top-level serialized artifact produced by ktstr ctprof.
Psi
Bundle of PsiResource for the four kernel-exposed resources. Same shape used at both system level (CtprofSnapshot::psi) and per-cgroup (CgroupStats::psi) — the data source differs but the kernel emits the same format and field set in both places.
PsiHalf
One Pressure Stall Information half-line: either the some or full row for one resource. Mirrors the kernel emission format %s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu in psi_show() (kernel/sched/psi.c).
PsiResource
Pressure Stall Information for one resource (cpu / memory / io / irq), bundling the some and full halves.
SchedExtSysfs
Global sched_ext sysfs state, captured from /sys/kernel/sched_ext/. The kernel registers exactly five global attributes via scx_global_attrs[] (kernel/sched/ext.c); this struct mirrors them 1-to-1.
ThreadState
Per-thread resource profile.

Functions§

capture
Capture a complete host-wide snapshot against the default procfs and cgroup roots (/proc and /sys/fs/cgroup). Probes every jemalloc-linked tgid the walk reaches and populates per-thread allocated_bytes / deallocated_bytes from the jemalloc TSD counters; tgids the probe cannot attach against (ptrace denied, not jemalloc-linked, stripped binary) land their threads at the absent-counter default of 0 per the best-effort capture contract.
capture_pid
Capture a ctprof snapshot scoped to a single tgid.
capture_to
Capture a snapshot and write it to path in the canonical zstd+JSON format. Wrapper over capture + CtprofSnapshot::write so CLI code can stay a single call.