ktstr/workload/mod.rs
1//! Worker process management and telemetry.
2//!
3//! Workers are `fork()`ed processes by default ([`CloneMode::Fork`],
4//! the `#[default]`) so each can be placed in its own cgroup;
5//! [`CloneMode::Thread`] instead uses [`std::thread::spawn`], so those
6//! workers share the parent's `tgid`, address space, and signal-handler
7//! table. Key types:
8//! - [`WorkType`] -- what each worker does
9//! - [`WorkloadConfig`] -- spawn configuration (count, affinity, work type, policy)
10//! - [`WorkloadHandle`] -- RAII handle to spawned workers
11//! - [`WorkerReport`] -- per-worker telemetry collected after stop
12//! - [`AffinityIntent`] -- per-worker affinity intent (Inherit, LlcAligned, Exact, etc.)
13//! - [`ResolvedAffinity`] -- resolved CPU affinity for workers
14//! - [`WorkSpec`] -- workload definition for a single group of workers within a cgroup
15//! - [`WorkPhase`] -- a single phase in a [`WorkType::Sequence`] compound work pattern
16//! - [`SchedPolicy`] -- Linux scheduling policy for a worker process
17//! - [`MemPolicy`] -- NUMA memory placement policy for worker processes
18//!
19//! See the [WorkSpec Types](https://ktstr.dev/guide/concepts/work-types.html)
20//! and [Worker Processes](https://ktstr.dev/guide/architecture/workers.html)
21//! chapters of the guide.
22//!
23//! # Module layout
24//!
25//! - `affinity` — [`AffinityIntent`] / [`ResolvedAffinity`] +
26//! the resolver and `sched_setaffinity` wrapper.
27//! - `config` — declarative test-author input
28//! ([`WorkloadConfig`], [`WorkSpec`], [`SchedPolicy`],
29//! [`MemPolicy`], [`MpolFlags`], [`CloneMode`],
30//! [`FutexLockMode`], [`WakeMechanism`], [`AluWidth`]) and
31//! the `humantime_serde_helper` shared by every `Duration`
32//! field.
33//! - `types` — [`WorkType`] / [`WorkPhase`] /
34//! [`WorkTypeValidationError`] and the WorkType naming
35//! surface (`from_name`, `suggest`, `ALL_NAMES`).
36//! - `spawn` — runtime spawn pipeline: [`WorkloadHandle`],
37//! `SpawnGuard`, [`Migration`], [`WorkerReport`],
38//! [`WorkerExitInfo`], `build_nodemask`,
39//! `apply_mempolicy_with_flags`, `apply_nice`. Tests are
40//! co-located in `spawn/tests_*.rs` siblings with shared
41//! fixtures in `spawn/testing.rs`.
42//! - `worker` — `worker_main` and the per-WorkType bodies.
43//! `worker/io.rs` holds the IO-backing RAII wrappers and
44//! `worker/sched.rs` holds the scheduler/clock/metric
45//! helpers (incl. `set_sched_policy`).
46//!
47//! # Naming conventions
48//!
49//! ## "Intent" vs "Resolved" naming
50//!
51//! Types named with an `Intent` suffix carry **test-author intent**
52//! (the input to the workload pipeline). Types named with a
53//! `Resolved` prefix carry **runtime-resolved configuration** (the
54//! output of intent + topology + cgroup state). [`AffinityIntent`]
55//! resolves to [`ResolvedAffinity`] at spawn time via
56//! [`resolve_affinity_for_cgroup`](crate::scenario::resolve_affinity_for_cgroup).
57//!
58//! [`CloneMode`] is a runtime-resolved value because the test
59//! author writes `CloneMode::Fork` / `CloneMode::Thread` directly
60//! (no resolution layer); the `Mode` suffix denotes a single
61//! kernel-facing dispatch decision rather than a two-stage
62//! intent/resolved pipeline.
63//!
64//! [`SchedClass`] and [`SchedPolicy`] follow the same coarse-intent /
65//! concrete-runtime split using legacy kernel terminology rather
66//! than the `Intent`/`Resolved` naming — see [`SchedClass`] for
67//! the per-class mapping.
68//!
69//! ## "Churn" vs "Sweep" suffixes on [`WorkType`] variants
70//!
71//! Variants whose names end in `Churn` cycle their target setting at
72//! high frequency to exercise the kernel's per-task state machines
73//! under rapid transitions. [`WorkType::AffinityChurn`] samples a
74//! random CPU from the effective cpuset on every iteration
75//! (`rand::rng().random_range`); [`WorkType::PageFaultChurn`] touches
76//! a fresh random subset of pages each cycle (xorshift64). Most Churn
77//! variants pick each value randomly and independently of the
78//! previous one; [`WorkType::PolicyChurn`] is the exception — despite
79//! the `Churn` name it cycles through the supported scheduling
80//! policies in a fixed, ordered sequence keyed on the iteration
81//! counter (`iterations % policies.len()`).
82//!
83//! Variants whose names end in `Sweep` rotate their target setting
84//! through an **ordered list or range** — the next value is a
85//! deterministic function of the iteration counter, not a random
86//! pick. [`WorkType::NiceSweep`] cycles nice values from
87//! `effective_min..=19` modulo the range size;
88//! [`WorkType::NumaWorkingSetSweep`] rotates the working-set
89//! binding through `target_nodes` in declaration order. The
90//! intent is to walk a phase space evenly so every value gets
91//! comparable observation time, rather than producing the
92//! unbiased-random transitions Churn produces.
93//!
94//! Choose `Churn` when the workload's value is its
95//! transition-frequency entropy; choose `Sweep` when the workload
96//! must visit every phase deterministically.
97
98mod affinity;
99mod config;
100pub(crate) mod schbench;
101mod spawn;
102pub(crate) mod taobench;
103mod types;
104mod worker;
105
106pub use affinity::*;
107pub use config::*;
108// `spawn` uses an itemised re-export rather than `pub use spawn::*`
109// because the submodule contains internal helpers (`SpawnGuard`,
110// `STOP`, `apply_mempolicy_with_flags`, …) that should stay
111// crate-internal. Only the test-author-visible surface is
112// re-exported here. `WorkerReportClaim` is the proc-macro-
113// generated companion to `WorkerReport` (see the `crate::Claim`
114// derive on the `WorkerReport` struct).
115pub use spawn::{
116 Migration, PhaseSlice, WorkerExitInfo, WorkerReport, WorkerReportClaim, WorkloadHandle,
117};
118// Crate-internal re-export of the wake-latency reservoir cap + the
119// Algorithm-R push so the per-phase per-cgroup carrier builder
120// (`crate::assert::phase_cgroup_stats`) re-caps the POOLED samples at the
121// same bound the per-worker path uses (the carrier concatenates every
122// worker's vec, so the pool must be re-capped before it crosses the
123// size-limited guest bulk port). The `worker` module itself stays private.
124pub(crate) use worker::{MAX_WAKE_SAMPLES, reservoir_push};
125// `build_nodemask` is the low-level `set_mempolicy(2)` / `mbind(2)`
126// nodemask builder. It's deliberately NOT in the public surface —
127// test authors express NUMA placement through the [`MemPolicy`]
128// enum — but `crate::vmm::host_topology` invokes `mbind(2)` directly to
129// bind guest memory regions to host NUMA nodes
130// (`crate::vmm::numa_mem`'s `mbind_regions` via
131// `host_topology::mbind_to_nodes`) and needs an in-crate path to
132// the helper.
133pub(crate) use spawn::build_nodemask;
134pub use types::*;
135// schbench_rs is otherwise a private submodule. Its user-facing config type is
136// re-exported (the WorkType::Schbench variant carries it), as are the standalone
137// host-side validation entry point and its report type / percentile labels, plus
138// the pipe-mode (`-p`) throughput helper (all used by the gated
139// `ktstr-schbench-validate` bin for the side-by-side comparison against the
140// reference schbench).
141pub use schbench::{
142 PipeTransferReport, SCHBENCH_PERCENTILES, SchbenchConfig, StandaloneReport,
143 pipe_transfer_report, run_standalone,
144};
145// taobench_rs is a private submodule like schbench_rs above. Its user-facing
146// config type + the host-side validation runner/report are re-exported here (the
147// WorkType::Taobench variant carries the config; the runner backs the
148// ktstr-taobench-validate driver). `run_standalone` is aliased to avoid colliding
149// with schbench's flat `run_standalone` re-export above.
150pub use taobench::{
151 TaobenchConfig, TaobenchStandaloneReport, TaobenchStats,
152 run_standalone as taobench_run_standalone,
153};
154
155// `FanOutCompute` stores its u64 generation counter at offset 0 of
156// a 16-byte shared region and relies on the low 4 bytes of that
157// counter living at offset 0 so the futex syscall (which reads the
158// raw u32 at `futex_ptr`) sees the low u32 of the u64. That layout
159// assumption holds on little-endian targets (x86_64, aarch64) and
160// flips on big-endian — the futex would read the high 32 bits
161// instead, and an increment of the u64 would leave the low 4 bytes
162// unchanged until the 2^32-th advance. Reject the big-endian build
163// at compile time rather than shipping a silently-broken binary.
164#[cfg(not(target_endian = "little"))]
165compile_error!(
166 "ktstr's FanOutCompute generation-counter layout assumes a \
167 little-endian target — the u64 counter at offset 0 of the \
168 shared futex region must expose its low 32 bits to the \
169 futex syscall at that same offset. Porting to a big-endian \
170 target requires reworking the layout so futex_wait sees the \
171 incrementing low 4 bytes."
172);