Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Resource Budget

ktstr boots KVM VMs and builds kernels on hosts that are usually doing other things at the same time — more tests, kernel builds, a developer session. The resource budget is how concurrent ktstr processes share host CPUs without silently corrupting each other’s measurements: every run reserves host LLCs through advisory file locks, and budgeted runs are additionally confined to an exact CPU count by a cgroup v2 cpuset sandbox.

When to use it

  • Multi-tenant CI hosts where unbounded parallelism starves concurrent jobs but the full performance-mode contract (RT scheduling, hugepages, NUMA mbind) is too heavy.
  • Kernel builds beside perf-mode tests — the build’s shared lock coordinates with the perf-mode exclusive lock, so make never stomps a measurement in progress.
  • Concurrent no-perf-mode VMs — a cap of N CPUs bounds how much capacity each run reserves; peers wait instead of racing for CPU.

The three coordination modes

Every VM run takes one of three coordination paths, selected by two switches: performance_mode on the test or builder, and --no-perf-mode / KTSTR_NO_PERF_MODE (any non-empty value).

ModeSelected byLLC lockfilesPer-CPU lockfilesEnforcement
Performance modeperformance_mode = trueexclusive (LOCK_EX), one per virtual LLCnone — the exclusive LLC lock covers its CPUsvCPU pinning, RT scheduling, hugepages, NUMA mbind
Budgeted (no-perf-mode)--no-perf-mode / KTSTR_NO_PERF_MODEshared (LOCK_SH) on the planned LLC setnone — the cgroup cpuset is the enforcement layercgroup v2 cpuset sandbox + soft affinity mask
Defaultneithershared (LOCK_SH) on the 1:1 plan’s LLCsexclusive (LOCK_EX), one per assigned host CPUnone — reservation only

Lockfiles live at {KTSTR_LOCK_DIR or /tmp}/ktstr-llc-{N}.lock and {KTSTR_LOCK_DIR or /tmp}/ktstr-cpu-{C}.lock. The modes compose: shared holders coexist with each other, an exclusive holder blocks every shared acquirer and vice versa. So any number of budgeted and default runs share LLCs among themselves, a perf-mode run waits for all of them to release, and while a perf-mode run holds its LLCs nobody else touches those CPUs. Default runs additionally exclude each other per CPU, so two default VMs never time-slice the same host CPU. Kernel builds take the budgeted path.

When the default path cannot map its topology 1:1 onto the host it does not fail: if a plan exists but every slot is busy, the run skips with ResourceContention and nextest retries; if no plan can exist (host too small), the run proceeds overcommitted — every vCPU thread masked to the allowed CPUs — and warns when that means oversubscription (see below).

Too-small hosts: who asked determines the verdict

The outcome of an unsatisfiable request depends on where the request came from — an explicit guarantee must never silently degrade, and an operator typo must not look like a host limitation:

RequestErrorOutcome
performance_mode = true, host can’t honor isolationPerfModeUnavailableskip (fail under KTSTR_NO_SKIP_MODE)
Per-test cpu_budget above the allowed CPUsTopologyInsufficientskip (fail under KTSTR_NO_SKIP_MODE)
Operator --cpu-cap / KTSTR_CPU_CAP above the allowed CPUsCpuBudgetUnsatisfiablehard fail
Default mode, no 1:1 placement possibleruns overcommitted, warns

A test attribute is a capability requirement a bigger host would satisfy, so it skips. An operator-typed number that does not exist on this host is a misconfiguration, so it fails. The over-cap error names both numbers:

--cpu-cap N = 96 exceeds the 64 CPUs this process is allowed on (from
sched_getaffinity / Cpus_allowed_list). Pick a value ≤ 64, release the
cgroup/taskset constraint restricting this process, or omit --cpu-cap
to use the auto-sized default (30% of the allowed set for kernel
builds; the vCPU count, floored at 30%, for VMs).

The default-mode overcommit warning fires only when the allowed CPU set is genuinely smaller than the vCPU count (a CI runner or systemd slice can be narrower than the online host):

ktstr: WARNING: only 8 host CPUs available for 16 vCPUs (2.0x
oversubscription) — the process cpuset is smaller than the guest, so
the auto-sized CPU budget collapsed to it. NOTHING opted into this.
The host time-slices the vCPU threads, confounding guest-scheduler
measurement (absolute work scales ~1/2; timing metrics are host
artifacts). Widen the process cpuset, or shrink the guest topology.

The stamped cpu_budget in the run’s sidecar also drops below the vCPU count, so an A/B comparison against an overcommitted run is flagged rather than silently confounded.

The CPU budget

The budget is resolved in precedence order:

  1. --cpu-cap N on the command line.
  2. KTSTR_CPU_CAP=N when the flag is absent (empty string = unset).
  3. Neither: kernel builds get 30% of the allowed CPUs (rounded up, minimum 1); no-perf-mode VMs get max(30%, min(vcpus, allowed)) so a wide VM’s vCPU threads are not host-oversubscribed by the 30% mask — an oversubscribed guest measures host contention, not its own scheduler. An explicit cap below the vCPU count is the deliberate opt-in to oversubscription for contention testing.

0 is rejected with --cpu-cap must be ≥ 1 CPU (got 0) — zero is a scripting sentinel, not a silent “no cap”.

The reference set is the calling process’s allowed CPUs (sched_getaffinity, with a /proc/self/status fallback), not the host’s online count — so the reservation stays valid under cgroup-restricted CI runners. An empty allowed set is a hard error: guessing on a misconfigured host is worse than failing visibly.

A per-test cpu_budget attribute on #[ktstr_test] overrides the auto-size for that test; an operator --cpu-cap / KTSTR_CPU_CAP wins over both.

Flag availability

  • --no-perf-mode: cargo ktstr test / coverage / llvm-cov / shell, and ktstr shell. KTSTR_NO_PERF_MODE (any non-empty value) works everywhere.
  • --cpu-cap N: ktstr shell, ktstr kernel build, cargo ktstr shell, cargo ktstr kernel build — and it requires --no-perf-mode (perf mode already holds whole LLCs exclusively, so a cap would double-reserve). For cargo ktstr test / coverage / llvm-cov set KTSTR_CPU_CAP=N instead.

How a reservation is planned

Budgeted acquisition runs three phases:

  1. Discover — stat every LLC lockfile and read /proc/locks once to snapshot current holders. No locks taken.
  2. Plan — rank LLCs: prefer LLCs that already have holders (consolidation packs shared runs together), seed on the best-ranked LLC’s NUMA node, and greedily fill that node before spilling to nearest-by-distance neighbors. Accumulate LLCs until their allowed CPUs cover the budget.
  3. Acquire — non-blocking shared locks on every selected LLC, all-or-nothing. If any lock is busy, every held lock is dropped and the whole cycle retries a few times with short ascending backoff; after the final attempt it bails with a ResourceContention error naming the winning holders.

The lock granularity is per-LLC, but the reserved CPU list holds exactly the budget — the last selected LLC typically contributes only a prefix of its CPUs. When the plan spans more than one NUMA node, stderr warns:

ktstr: reserving LLCs [0 (node 0), 2 (node 1)] across 2 NUMA nodes
(preferred single-node contiguous unavailable). Build will run;
memory-access latency may be higher.

Cgroup v2 cpuset sandbox

Budgeted runs write the reserved CPUs and their NUMA nodes into a child cgroup — cpuset.cpus, then cpuset.mems, then the pid into cgroup.procs, in that order because the kernel may kill a task migrated into a cgroup whose cpuset.mems is still empty. After each write the effective value is read back: narrowing by a parent cgroup (a systemd slice, a container limit) is a fatal error under an explicit --cpu-cap and a warning otherwise. Kernel builds inside the sandbox also get their make -j width set to the reserved CPU count — without that, make -j$(nproc) fans gcc children out to a width the cpuset then has to time-slice, silently defeating the budget in scheduling terms.

Observing locks

ktstr locks (or cargo ktstr locks) prints every ktstr lock currently held on the host — LLC, per-CPU, kernel-cache, and run-dir locks — with each holder’s PID and command line. It is read-only and takes no locks itself. Use it when an acquire fails with ResourceContention: the error names the busy LLCs, the snapshot shows every contending peer at once. The full output and flags are in ktstr (standalone).

KTSTR_BYPASS_LLC_LOCKS — escape hatch

Setting KTSTR_BYPASS_LLC_LOCKS=1 skips lock acquisition entirely: the VM boots or the build starts immediately, with no coordination against concurrent runs. Use it only when measurement noise is acceptable — an isolated workstation, or a CI queue that already serializes jobs at a higher layer. It is mutually exclusive with --cpu-cap / KTSTR_CPU_CAP at every entry point; the rejection message always contains "resource contract" so it is greppable.

Filesystem requirement

Every lockfile path must live on a local filesystem — tmpfs, ext4, xfs, btrfs, f2fs, and bcachefs are the accepted set. NFS, CIFS/SMB, CephFS, AFS, and FUSE mounts are rejected at open time: flock(2) coordination or /proc/locks holder enumeration is unreliable on these configurations, and ktstr refuses to run on a lock it cannot trust. The error names the offending filesystem and the fix: move the lockfile path (KTSTR_LOCK_DIR, the cache root, or the runs root) to a local filesystem. Unknown-but-local filesystems (zfs, erofs, …) pass through.