ktstr/flock/
mod.rs

1//! Advisory flock(2) primitives shared across every ktstr lock file.
2//!
3//! ktstr uses advisory `flock(2)` in four places:
4//!
5//!  - LLC reservation locks at `{lock_dir}/ktstr-llc-{N}.lock` and
6//!    per-CPU locks at `{lock_dir}/ktstr-cpu-{C}.lock` where
7//!    `lock_dir` is resolved by `crate::cache::resolve_lock_dir`
8//!    (`KTSTR_LOCK_DIR` env var, fallback `/tmp`). See
9//!    `crate::vmm::host_topology::acquire_resource_locks` and
10//!    friends.
11//!  - Per-cache-entry coordination locks at
12//!    `{cache_root}/.locks/{cache_key}.lock` (see
13//!    `crate::cache::CacheDir::acquire_shared_lock` and friends).
14//!  - Per-source-tree build locks at
15//!    `{cache_root}/.locks/source-{path_hash}.lock` (see
16//!    `crate::cli::kernel_build::build::acquire_source_tree_lock`) — serialize concurrent
17//!    `make` invocations against the same kernel source checkout.
18//!  - Observational enumeration from `ktstr locks --json` — a
19//!    read-only scan that does NOT acquire flocks; reads
20//!    /proc/locks through `read_holders` to attribute holders
21//!    without contending with active acquirers.
22//!
23//! All four share:
24//!  - Non-blocking `LOCK_NB` attempt (the cache-entry path wraps this
25//!    in a poll loop for timed-wait semantics).
26//!  - `O_CLOEXEC` on every open so the kernel's "release flock when
27//!    the last fd referring to the OFD closes" invariant matches what
28//!    `OwnedFd::drop` does — a leaked fd across `exec(2)` would keep
29//!    the lock alive in the child and fool the next acquirer's
30//!    `/proc/locks` scan into naming the wrong pid.
31//!  - /proc/locks parsing keyed on the mount-point-derived
32//!    `{major:02x}:{minor:02x}:{inode}` triple, resolved via
33//!    `/proc/self/mountinfo` (not `stat().st_dev` — see below).
34//!  - [`HolderInfo`] with `pid` + truncated `/proc/{pid}/cmdline` for
35//!    actionable error messages.
36//!
37//! # Module layout
38//!
39//! Each submodule owns a single, cohesive subsystem:
40//!
41//!  - `fs_filter` — refuses to operate on filesystems where
42//!    `flock(2)` is unreliable (NFS, CIFS/SMB, CephFS, AFS, FUSE).
43//!  - `primitives` — the kernel-syscall wrappers
44//!    ([`try_flock`] / [`block_flock`] / `materialize`) that open a
45//!    lockfile and request a flock operation.
46//!  - `mountinfo` — `/proc/self/mountinfo` parser and the
47//!    `{major:02x}:{minor:02x}:{inode}` needle derivation that
48//!    `proc_locks` keys off.
49//!  - `proc_locks` — `/proc/locks` scanner that enumerates the
50//!    PIDs holding a given lockfile's flock.
51//!  - `holder` — converts a PID into a
52//!    [`HolderInfo`] (reads `/proc/{pid}/cmdline`) and renders a
53//!    `&[HolderInfo]` into a multi-line operator-facing string.
54//!  - `acquire` — high-level poll-with-timeout helper that wraps
55//!    `primitives::try_flock` in a deadline loop and decorates
56//!    timeout errors with the holder list from `proc_locks` and
57//!    `holder`.
58//!
59//! # Why mountinfo, not `stat().st_dev`
60//!
61//! `/proc/locks` emits `i_sb->s_dev` for each held flock — the
62//! filesystem's superblock device id. For most filesystems that
63//! matches `stat().st_dev`, but on btrfs, overlayfs, and bind-mounts
64//! the kernel installs a custom `getattr` implementation that returns
65//! an anonymous device id (`anon_dev`) distinct from `s_dev`. That
66//! divergence means the stat-derived needle would never match the
67//! /proc/locks line — a naive `read_holders` would silently return
68//! empty on every btrfs-backed `/tmp`, every overlay-rootfs
69//! container, and every bind-mounted /tmp, which is a silent
70//! correctness failure for `--cpu-cap` contention diagnostics and
71//! the `ktstr locks` observational command.
72//!
73//! Needle production (see `mountinfo::needle_from_path`):
74//!
75//! `mountinfo::needle_from_path` resolves `path` to the mount-point
76//! covering it via `/proc/self/mountinfo` (longest-prefix match on
77//! the `mount_point` field), then reads the `{major:minor}` field of
78//! that mount entry. Combines with `stat().st_ino` for the full
79//! triple. The mountinfo `{major:minor}` is the kernel's
80//! `i_sb->s_dev` verbatim, so the resulting needle matches
81//! /proc/locks by construction. The needle feeds
82//! `proc_locks::read_holders_for_needle`, which scans
83//! `/proc/locks` exactly once and byte-compares.
84//!
85//! # Remote-filesystem rejection
86//!
87//! [`try_flock`] refuses to operate on NFS / CIFS / SMB2 / CEPH /
88//! AFS / FUSE (see `fs_filter::reject_remote_fs`). `flock(2)` on
89//! those filesystems is either advisory-only under some server
90//! configurations (NFSv3 without NLM coordination) or silently
91//! returns success without serializing peers (FUSE when the
92//! userspace server doesn't implement the flock op). ktstr's
93//! resource-budget contract is not robust to that silent
94//! degradation, so the safe call is to reject at lockfile-open
95//! time with an actionable message.
96
97use serde::Serialize;
98
99pub(crate) mod acquire;
100pub(crate) mod fs_filter;
101pub(crate) mod holder;
102pub(crate) mod mountinfo;
103pub(crate) mod primitives;
104pub(crate) mod proc_locks;
105
106pub use holder::format_holder_list;
107pub use primitives::{block_flock, try_flock};
108
109pub(crate) use acquire::acquire_flock_with_timeout;
110pub(crate) use holder::NO_HOLDERS_RECORDED;
111pub(crate) use mountinfo::read_mountinfo;
112pub(crate) use primitives::materialize;
113pub(crate) use proc_locks::{read_holders, read_holders_with_mountinfo};
114
115/// Subdirectory name (under whatever root each caller picks) that
116/// holds advisory `flock(2)` sentinels. Both [`crate::cache`] and
117/// the run-dir flock surface in `crate::test_support::sidecar`
118/// key off this constant for the `.locks/` convention. Also
119/// referenced by run-listing walkers' dotfile filter
120/// (`is_run_directory` in the same sidecar module) to keep
121/// the lock subdirectory out of "list runs" output. `crate::vmm::disk_template`
122/// maintains its own local copy of the same value for the
123/// cache-side `.locks/` convention; the two are kept in sync by
124/// convention rather than via a shared import.
125pub(crate) const LOCK_DIR_NAME: &str = ".locks";
126
127/// Requested sharing mode for [`try_flock`]. Translated to the
128/// corresponding non-blocking [`rustix::fs::FlockOperation`]
129/// internally; callers never see the libc-specific constants.
130///
131/// Shared between LLC + per-CPU flocks (`vmm::host_topology`) and
132/// cache-entry flocks (`cache`). A single type prevents three-enum
133/// drift — earlier revisions had `FlockMode` + `FlockKind` +
134/// `LlcLockMode` with identical shape. `LlcLockMode` remains distinct
135/// as the scheduler-intent layer (perf-mode vs. no-perf-mode
136/// request), not a flock operation.
137#[derive(Debug, Clone, Copy, PartialEq, Eq)]
138pub enum FlockMode {
139    /// Exclusive (`LOCK_EX`) — sole access to the lock file.
140    Exclusive,
141    /// Shared (`LOCK_SH`) — multiple holders can coexist.
142    Shared,
143}
144
145/// Identity of a process holding an advisory flock. Used by error
146/// messages in both LLC-coordination and cache-entry paths, plus the
147/// `ktstr locks` observational subcommand.
148///
149/// Cmdline is read from `/proc/{pid}/cmdline`, NUL-separated by the
150/// kernel, lossy-UTF-8 decoded, `\0 → space`, and truncated to
151/// roughly 100 chars (the `holder::CMDLINE_MAX_CHARS` cap) with a
152/// `…` marker so a log line remains single-line. A missing / racing
153/// / permission-denied `/proc/{pid}/cmdline` produces
154/// `"<cmdline unavailable>"` so the pid still surfaces with
155/// diagnostic value.
156///
157/// `#[non_exhaustive]` so future fields (`start_time`, `fd_count`,
158/// etc.) don't break external match arms or struct literals. Derives
159/// `Serialize` (with `snake_case` field renaming for JSON schema
160/// stability) for the `ktstr locks --json` surface; no `Deserialize`
161/// because this type is produced-only.
162#[derive(Debug, Clone, Serialize)]
163#[serde(rename_all = "snake_case")]
164#[non_exhaustive]
165pub struct HolderInfo {
166    /// PID of the flock holder as reported by `/proc/locks`.
167    pub pid: u32,
168    /// Truncated `/proc/{pid}/cmdline` of the holder process.
169    pub cmdline: String,
170}