ktstr/scenario/snapshot/
mod.rs

1//! Diagnostic snapshot capture and traversal.
2//!
3//! Test scenarios use [`Op::CaptureSnapshot`](crate::scenario::ops::Op::CaptureSnapshot)
4//! to request a host-side diagnostic capture mid-run. The capture
5//! result — a `crate::monitor::dump::FailureDumpReport` — is keyed by the `name` argument
6//! and stored on the scenario's [`SnapshotBridge`], where downstream
7//! test code reaches it via [`Snapshot`] for typed traversal of
8//! BTF-rendered map values, per-CPU entries, and scalar variables.
9//!
10//! # Lifecycle
11//!
12//! 1. **Wire-up.** Before [`execute_steps`](crate::scenario::ops::execute_steps)
13//!    runs, host orchestration installs a [`SnapshotBridge`] in the
14//!    current thread via [`SnapshotBridge::set_thread_local`]. The
15//!    bridge owns the storage map and a callable that performs the
16//!    capture.
17//!
18//! 2. **Capture.** When the executor reaches `Op::CaptureSnapshot { name }`,
19//!    it invokes [`SnapshotBridge::capture`] with the name. The
20//!    closure performs the freeze rendezvous (request/reply with
21//!    the freeze coordinator), builds a `crate::monitor::dump::FailureDumpReport`, and
22//!    returns it; the bridge stores it under the name.
23//!
24//! 3. **Inspection.** After the scenario completes, the test author
25//!    pulls captured reports out via [`SnapshotBridge::drain`] and
26//!    constructs [`Snapshot`] views to assert against rendered
27//!    values:
28//!    `snapshot.var("nr_cpus_onln").as_u64()? > 0`,
29//!    `snapshot.map("scx_per_task")?.find(|e| e.get("tid").as_i64().map_or(false, |t| t == pid))`.
30//!
31//! # On-demand vs error-trigger captures
32//!
33//! `Op::CaptureSnapshot` requests are orthogonal to the error-class freeze
34//! path. The freeze coordinator's existing state machine for
35//! `SCX_EXIT_ERROR` triggers (Idle → TookEarly → Done) governs the
36//! *unsolicited* capture pipeline; on-demand captures funnel
37//! through a separate request/reply channel and never touch the
38//! error-trigger state. The coordinator services on-demand requests
39//! even after Done so post-failure scenarios can still snapshot
40//! state for context. The serialisation rule: at most one capture in
41//! flight at a time — the on-demand path waits for the previous
42//! capture's vCPUs to fully return to `parked == false` before
43//! issuing the next freeze request, mirroring the rendezvous
44//! invariants the error-trigger path already obeys.
45//!
46//! # Guest → host wire: virtio-console port-1 TLV request/reply
47//!
48//! The guest-driven capture trigger rides the virtio-console bulk
49//! port (`/dev/vport0p1`), not an ioeventfd/MMIO doorbell.
50//!
51//! 1. The guest [`Op::CaptureSnapshot`](crate::scenario::ops::Op::CaptureSnapshot)
52//!    handler calls
53//!    `crate::vmm::guest_comms::request_snapshot` with
54//!    `crate::vmm::wire::SNAPSHOT_KIND_CAPTURE`, the capture
55//!    `name` as the tag, and a timeout. `request_snapshot`
56//!    allocates a per-request `request_id`, builds a
57//!    `SnapshotRequestPayload { request_id, kind, tag }`, and
58//!    sends it as a TLV frame over the port-1 TX writer.
59//! 2. The host freeze coordinator services the request, builds the
60//!    `crate::monitor::dump::FailureDumpReport`, and stores it on
61//!    its [`SnapshotBridge`] keyed by the tag.
62//! 3. `request_snapshot` blocks reading TLV reply frames from the
63//!    same `O_RDWR` fd until it observes one whose payload
64//!    `request_id` matches, then returns a
65//!    `crate::vmm::wire::SnapshotRequestResult`.
66//!
67//! The guest
68//! [`Op::WatchSnapshot`](crate::scenario::ops::Op::WatchSnapshot)
69//! registration uses the same port-1 stream with
70//! `crate::vmm::wire::SNAPSHOT_KIND_WATCH`.
71//!
72//! # No-bridge path
73//!
74//! When `Op::CaptureSnapshot` runs with no installed bridge, the op
75//! fails loudly rather than skipping (per the no-silent-drops
76//! policy): in-guest it routes through the port-1 transport and
77//! `bail`s on a transport failure (including a latched-dead
78//! transport); in host_only mode with no test-fixture bridge it
79//! `bail`s with a "not supported in host_only mode" error.
80//!
81//! # Field accessor traversal
82//!
83//! [`SnapshotMap`], [`SnapshotEntry`], and [`SnapshotField`] form a
84//! lazy borrow chain over the report. Dotted-path lookups (e.g.
85//! `entry.get("ctx.weight.value")`) walk
86//! `RenderedValue::Struct` members by name and follow
87//! `RenderedValue::Ptr` dereferences transparently — the test
88//! author writes the dotted path the BTF source would suggest;
89//! pointer chasing is invisible.
90//!
91//! Missing fields land in [`SnapshotField::Missing`] with an
92//! actionable error string identifying the path component that
93//! could not be resolved AND the available alternatives at that
94//! level. Terminal accessors (`as_u64`, `as_i64`, `as_bool`,
95//! `as_str`) return `Result<T, SnapshotError>` so an absent /
96//! type-mismatched field bubbles up as a recoverable error rather
97//! than panicking.
98//!
99//! # Cross-surface accessor vocabulary
100//!
101//! [`SnapshotField`], [`JsonField`], and
102//! `crate::monitor::btf_render::RenderedValue` share a uniform
103//! method vocabulary so a test author moves between the
104//! BTF-rendered (BPF maps + globals), JSON-rendered (scheduler
105//! stats), and raw-tree surfaces without re-learning syntax:
106//!
107//! | Method                | What it does                                                     |
108//! |-----------------------|------------------------------------------------------------------|
109//! | `.as_u64()`/`.as_i64()`/`.as_f64()`/`.as_bool()` | Typed scalar extract.                  |
110//! | `.as_str()`           | UTF-8 string extract (SnapshotField / JsonField only; Enum variant / JSON string). |
111//! | `.as_u64_array()` / `.as_u32_array()` / `.as_i64_array()` / `.as_f64_array()` / `.as_bool_array()` | Element-typed array extract. |
112//! | `.get(path)`          | Dotted-path walk (`"a.b.c"`); returns a typed sub-view.          |
113//! | `.member(name)`       | Single-step struct-member walk (RenderedValue only; no dots).    |
114//! | `.index(i)`           | Array element by 0-indexed position (RenderedValue only).        |
115//! | `.raw()`              | Drop into the wrapper's underlying value for raw Option-returning navigation (RenderedValue for SnapshotField, serde_json::Value for JsonField). |
116//!
117//! The wrapper types ([`SnapshotField`], [`JsonField`]) return
118//! `Result` with rich [`SnapshotError`] context; the raw
119//! `RenderedValue` layer returns `Option` (the caller has already
120//! pattern-matched into a known variant, so absence is a
121//! programming-error class handled locally). Convert between
122//! layers with `SnapshotField::raw()`.
123//!
124//! For multi-scheduler scenarios (after
125//! [`crate::scenario::ops::Op::ReplaceScheduler`] or two
126//! [`crate::scenario::ops::Op::AttachScheduler`] calls), use
127//! [`Snapshot::active`] to project the view to the currently-
128//! attached scheduler's maps and chain the standard accessors
129//! against it. [`Snapshot::live_var`] is the shorthand for
130//! `self.active()?.var(name)`; [`Snapshot::vars`] iterates every
131//! captured copy when the framework cannot determine "active"
132//! automatically.
133
134/// Maximum number of rendered keys captured into
135/// [`SnapshotError::NoMatch::available_keys`] during a failed
136/// `find` / `max_by` traversal. Three is a balance between
137/// disambiguation power (enough to suggest the keyspace shape) and
138/// failure-message readability (does not overrun a terminal line).
139pub(super) const NO_MATCH_KEY_SAMPLE: usize = 3;
140
141/// Maximum number of characters each rendered key in
142/// [`SnapshotError::NoMatch::available_keys`] retains before being
143/// truncated with a trailing `…`. Wide struct keys (e.g. a
144/// 50-field `task_ctx`) would otherwise produce kilobytes of
145/// failure text per sampled key.
146pub(super) const NO_MATCH_KEY_CHAR_CAP: usize = 80;
147
148/// Discriminator that `render_entry_key`'s fallback path prepends
149/// to the raw `key_hex` bytes when an entry's BTF-rendered key was
150/// missing at capture time. [`SnapshotError::NoMatch`]'s `Display`
151/// impl uses the same prefix as the gate for its BTF-missing hint
152/// (when every sampled key starts with this string, BTF was
153/// uniformly absent for the map's key type and the hint points the
154/// operator at `CONFIG_DEBUG_INFO_BTF=y`). Naming the producer +
155/// consumer contract once here keeps a future rename of one side
156/// from silently desynchronising the other. Test sites in this
157/// module intentionally retain the literal `"hex:"` so they pin the
158/// value separately from the const that synchronises production.
159pub(super) const HEX_KEY_PREFIX: &str = "hex:";
160
161mod error;
162
163pub use error::{
164    DrainedSnapshotEntry, ExcludedMap, MissingStatsReason, SnapshotError, SnapshotResult,
165};
166
167pub mod bridge;
168
169pub use bridge::{
170    BridgeGuard, CaptureCallback, CgroupProcsSnapshot, KernelOpCallback, MAX_STORED_EVENTS,
171    MAX_STORED_SNAPSHOTS, MAX_WATCH_SNAPSHOTS, SnapshotBridge, SnapshotBridgeEvent,
172    WatchRegisterCallback, with_active_bridge,
173};
174
175mod entry;
176mod field;
177mod json;
178pub mod pickers;
179mod view;
180
181pub use entry::SnapshotEntry;
182pub use field::SnapshotField;
183pub(crate) use field::walk_dotted_path;
184pub use json::{JsonField, stats_path};
185pub use view::{Snapshot, SnapshotMap};
186
187// ---------------------------------------------------------------------------
188// Snapshot view over a captured FailureDumpReport
189// ---------------------------------------------------------------------------
190
191// ---------------------------------------------------------------------------
192// SnapshotEntry
193// ---------------------------------------------------------------------------
194
195#[cfg(test)]
196mod tests;