Skip to content

perf(ci): add VERSION file and build/test caching fixes#19424

Draft
davdhacs wants to merge 11 commits intomasterfrom
davdhacs/go-cache-optimization
Draft

perf(ci): add VERSION file and build/test caching fixes#19424
davdhacs wants to merge 11 commits intomasterfrom
davdhacs/go-cache-optimization

Conversation

@davdhacs
Copy link
Contributor

@davdhacs davdhacs commented Mar 13, 2026

Summary

Enable Go build and test caching across commits by stabilizing cache ActionIDs. Reduces CI test job times by 3-9x and build compilation steps by 1.5-3.7x.

Measured results

Test jobs (unit-tests.yaml)

Seed run: 23069907040. Warm runs: 23071454130, 23073119966. Master comparison: 23063508964, 23054165953.

Job Master warm VERSION warm vs master
go (GOTAGS="") 38m 13.8m 2.7x faster
go (GOTAGS=release) 34m 13.5m 2.5x faster
go-postgres (GOTAGS="") 31m 3.8m 8.2x faster
go-postgres (GOTAGS=release) 30m 4.0m 7.5x faster
sensor-integration 22m 5.3m 4.2x faster
go-bench 18m 17.9m same
local-roxctl-tests 9m 8.8m same

Test cache hit rate: 78% (1946/2509 packages). Results confirmed stable across two warm runs.

Step-level breakdown for key test jobs (showing where time is spent):

Job Step Master VERSION
go (GOTAGS="") Container init 56s 54s
Cache restore 116s 92s
Go Unit Tests 1822s 488s
Cache save 70s 30s
go-postgres (GOTAGS="", 15) Container init 55s 57s
Cache restore 97s 72s
Go Unit Tests 1568s 38s
Cache save 44s 14s
sensor-integration Cache restore 52s 58s
Tests 1160s 152s
Cache save 13s 13s

Build jobs (build.yaml)

Build run: 23073119981. Master comparison: 23063508995. Step-level timing for Go compilation steps only:

pre-build-go-binaries (pure Go compilation):

Arch Master VERSION Speedup
amd64 100s 27s 3.7x
arm64 68s 19s 3.6x

pre-build-cli (cross-compiles for multiple platforms):

Master VERSION Speedup
Build CLI 101s 60s 1.7x

Build cache stats: 4401/4403 deps cached (2 compiled) for go-binaries. Only pkg/version/internal and pkg/version recompile per commit.

Why caches miss today

Go's GOCACHE keys builds and tests by "ActionID" — a hash of compiler flags, source file contents, and dependency outputs. On master, every commit gets 0% cache hits for two reasons:

1. The commit SHA is baked into every ActionID. We pass -X MainVersion=4.11.x-193-g<sha> as an ldflag, and Go includes ldflags in the link ActionID. Since the SHA changes on every commit, every link action gets a unique ID, causing a full cache miss for every binary. Go 1.18+ also silently enabled -buildvcs=auto which stamps vcs.revision and vcs.modified into binaries — another per-commit change to ActionIDs. No code reads these VCS stamps.

2. Git sets file mtimes to checkout time, and Go's test cache relies on mtime. Go's test result cache has two phases: first it checks the test binary identity (the ActionID from #1), then it validates that the source files haven't changed by checking (path, size, mtime) — not content. Since git checkout sets all file mtimes to "now", every CI run has different mtimes, so even if #1 were fixed, test results would still miss.

Together, this means master rebuilds and re-runs every test on every commit, even when nothing relevant changed.

The fix

1. chore: add VERSION file with base version tag

A VERSION file containing 4.11.x provides a stable base version tag. make/env.mk falls back to it when git describe fails (e.g. shallow clones). start-release.yml verifies VERSION matches the release and creates a PR to bump it on master for the next minor version.

2. perf(ci): replace -X ldflags with generated zversion.go

Instead of passing version info via -X ldflags (which pollute every binary's link ActionID), go-tool.sh now generates a pkg/version/internal/zversion.go source file that sets the version at init time. Because the version is now in a source file rather than in ldflags, it only affects the ActionID of the packages that contain it — not the link ActionID of every binary in the repo. Also adds -buildvcs=false to stop the unused VCS stamping.

For builds, zversion.go uses the full git-describe version so release binaries are correctly versioned. The changed file content causes pkg/version/internal and pkg/version to recompile, but the cascade stops there: Go's compile cache keys each package using the content hash of its dependency outputs (contentID), not their actionIDs. When pkg/version recompiles, its output is byte-identical (its source didn't change — only its dependency's init() body changed), so its contentID is unchanged. All ~4400 downstream packages see the same dependency contentID, compute the same actionID, and cache-hit. Verified locally: only 2 of 1559 packages recompile after an empty commit.

For tests, the stable base tag is necessary because test caching works differently. Go's test cache checks the test binary identity, which includes the link ActionID. The linker hashes all input .a files by content — and pkg/version/internal.a did change. Since 1191 of 1952 packages (61%) transitively depend on pkg/version, using the full version for tests would invalidate 61% of test cache entries on every commit. Verified locally: tests miss cache after an empty commit with full version, but cache-hit with stable base tag.

3. perf(ci): enable Go test cache with fixed mtime

After checkout, git ls-files -z | xargs -0 touch -t 200101010000 sets all file mtimes to a fixed past date. This makes Go's test cache mtime validation pass across CI runs when file content hasn't changed. Also adds a key-suffix input to the cache action so GOTAGS matrix entries ("" vs release) don't overwrite each other's GOCACHE.

How these pieces interact

The build cache (compilation + linking) is fixed by commits 1+2: removing -X ldflags stabilizes link ActionIDs, and the zversion.go contentID cascade stops at pkg/version so downstream compilation cache-hits.

The test cache requires all three: commit 2 stabilizes the test binary ActionID (phase 1 — the link ActionID no longer changes per commit), and commit 3 stabilizes the file mtimes that Go checks to validate cached test results (phase 2). Both phases must hit for a cached test result to be reused.

See also: #19425 (GOCACHE stale entry trimming — independent, reduces cache size by ~27%).

Partially generated by AI.

davdhacs and others added 3 commits March 13, 2026 13:42
Add a VERSION file containing the base version tag (4.11.x), following
the same pattern as COLLECTOR_VERSION, SCANNER_VERSION, FACT_VERSION.

- VERSION: contains "4.11.x"
- make/env.mk: make tag falls back to VERSION when git describe fails
- start-release.yml: verifies VERSION matches release, creates PR to
  update VERSION on master for the next minor release

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the //XDef: annotation + status.sh + -X ldflags mechanism with
a generated pkg/version/internal/zversion.go file.

For tests: uses the stable base tag from VERSION (e.g. "4.11.x"),
keeping Go link ActionIDs stable across commits. This enables both
the Go link cache (2x faster builds) and test result cache.

For builds: uses the full git-describe format (e.g. "4.11.x-320-gabcdef")
so release binaries have the correct version.

Also adds -buildvcs=false to remove unused VCS stamping that Go 1.18+
silently enabled (no code reads debug.ReadBuildInfo).

Verified locally:
- Same ldflags (link cache hit): 2.4s for 8 binaries
- Different ldflags (link cache miss): 5.2s (2.2x slower)

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Go test cache has two phases:
1. Test binary ID (includes link ActionID with ldflags) — fixed by commit 2
2. File mtime validation — fixed here

git checkout sets all file mtimes to "now", causing Phase 2 to miss.
Fix: set all tracked file mtimes to a fixed date after checkout.

Also adds key-suffix input to cache-go-dependencies to disambiguate
GOTAGS matrix entries. github.job is the same for all matrix entries
(e.g. "go" for both GOTAGS="" and GOTAGS=release). Without key-suffix,
matrix entries overwrite each other's GOCACHE.

Combined measured results (PR #19404):
- go unit tests: 45m → 14m (3.3x)
- go-postgres: 35m → 4m (8.7x)
- sensor-integration: 21m → 5m (4x)
- Test cache hit rate: 92% (1942/2108)

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The build cache statistics added to go_build rely on go build -v line counts and a fixed /tmp/go-build-compiled-count file, which can be both inaccurate (non-1:1 with compiled units) and racy if multiple builds run in parallel; consider using a unique temp file per invocation and/or gating this logging behind a flag or env var.
  • The git ls-files | xargs touch step in the composite action mutates mtimes for every tracked file, which could interfere with incremental build tooling or other languages in the same job; consider narrowing this to Go source/test files or to the directories actually used by Go.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The build cache statistics added to `go_build` rely on `go build -v` line counts and a fixed `/tmp/go-build-compiled-count` file, which can be both inaccurate (non-1:1 with compiled units) and racy if multiple builds run in parallel; consider using a unique temp file per invocation and/or gating this logging behind a flag or env var.
- The `git ls-files | xargs touch` step in the composite action mutates mtimes for every tracked file, which could interfere with incremental build tooling or other languages in the same job; consider narrowing this to Go source/test files or to the directories actually used by Go.

## Individual Comments

### Comment 1
<location path="scripts/go-tool.sh" line_range="108" />
<code_context>
-  invoke_go_build -o "$output" "${dirs[@]}"
+  local total
+  total=$(go list -deps "${dirs[@]}" 2>/dev/null | wc -l | tr -d ' ')
+  local compiled_count=/tmp/go-build-compiled-count
+  invoke_go_build -o "$output" "${dirs[@]}" 2> >(tee >(wc -l | tr -d ' ' > "$compiled_count") >&2)
+  wait
</code_context>
<issue_to_address>
**issue (bug_risk):** Use a per-process or per-invocation temp file instead of a fixed path for compiled_count

A fixed `/tmp` path means concurrent runs of this script (or multiple `go_build` calls in one job) will race on the same `compiled_count` file, causing corrupted or incorrect counts. Please use a unique temp file per invocation (e.g., via `mktemp` or including `$$`/`$BASHPID` and job name in the filename) and clean it up afterward.
</issue_to_address>

### Comment 2
<location path="scripts/go-tool.sh" line_range="112-115" />
<code_context>
+  invoke_go_build -o "$output" "${dirs[@]}" 2> >(tee >(wc -l | tr -d ' ' > "$compiled_count") >&2)
+  wait
+  local compiled
+  compiled=$(cat "$compiled_count" 2>/dev/null | tr -d ' ')
+  local cached=$((total - compiled))
+  echo >&2 "Build cache: ${cached}/${total} deps cached ($compiled compiled)"
+  if [[ "$total" -gt 0 ]] && [[ $((cached * 100 / total)) -lt 50 ]]; then
+    echo >&2 "WARNING: Build cache hit rate below 50% (${cached}/${total}). Significant code changes invalidate the cached build — expect slower compilation."
+  fi
</code_context>
<issue_to_address>
**issue (bug_risk):** Harden arithmetic against empty or non-numeric totals/compiled counts

If `go list` or the build step fails (or their output format changes), `total`/`compiled` can be empty or non-numeric. Then `cached=$((total - compiled))` and the percentage calculation will throw a bash arithmetic error and abort. Please default missing values to 0 and ensure both vars are integers (e.g. `${total:-0}`, `${compiled:-0}` with a simple regex check), or skip the cache-reporting logic when the counts aren’t valid.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

invoke_go_build -o "$output" "${dirs[@]}"
local total
total=$(go list -deps "${dirs[@]}" 2>/dev/null | wc -l | tr -d ' ')
local compiled_count=/tmp/go-build-compiled-count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Use a per-process or per-invocation temp file instead of a fixed path for compiled_count

A fixed /tmp path means concurrent runs of this script (or multiple go_build calls in one job) will race on the same compiled_count file, causing corrupted or incorrect counts. Please use a unique temp file per invocation (e.g., via mktemp or including $$/$BASHPID and job name in the filename) and clean it up afterward.

Comment on lines +112 to +115
compiled=$(cat "$compiled_count" 2>/dev/null | tr -d ' ')
local cached=$((total - compiled))
echo >&2 "Build cache: ${cached}/${total} deps cached ($compiled compiled)"
if [[ "$total" -gt 0 ]] && [[ $((cached * 100 / total)) -lt 50 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Harden arithmetic against empty or non-numeric totals/compiled counts

If go list or the build step fails (or their output format changes), total/compiled can be empty or non-numeric. Then cached=$((total - compiled)) and the percentage calculation will throw a bash arithmetic error and abort. Please default missing values to 0 and ensure both vars are integers (e.g. ${total:-0}, ${compiled:-0} with a simple regex check), or skip the cache-reporting logic when the counts aren’t valid.

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Mar 13, 2026

Images are ready for the commit at ba974ea.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-330-gba974eae96.

@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.71%. Comparing base (3d6b7a6) to head (ba974ea).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #19424   +/-   ##
=======================================
  Coverage   49.71%   49.71%           
=======================================
  Files        2701     2701           
  Lines      203453   203453           
=======================================
+ Hits       101139   101140    +1     
+ Misses      94786    94785    -1     
  Partials     7528     7528           
Flag Coverage Δ
go-unit-tests 49.71% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@davdhacs davdhacs added the ci-save-cache Enable Go cache saves on this PR branch for testing label Mar 13, 2026
davdhacs and others added 3 commits March 13, 2026 14:47
actionlint flags matrix.gotags as undefined in jobs without a matrix
strategy (go-bench, local-roxctl-tests, sensor-integration-tests).
The key-suffix is only needed for the go and go-postgres jobs which
have the gotags matrix to prevent GOCACHE key collisions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs davdhacs changed the title perf(ci): enable Go build and test caching across commits perf(ci): add VERSION file and caching fixes Mar 14, 2026
@davdhacs davdhacs changed the title perf(ci): add VERSION file and caching fixes perf(ci): add VERSION file and build/test caching fixes Mar 14, 2026
davdhacs and others added 4 commits March 13, 2026 23:02
GOCACHE accumulates stale entries across commits as packages change:
each commit creates new cache entries for recompiled packages, but old
entries from previous commits are never cleaned up. Go's built-in trim
only runs once per 24h with 5-day retention — too slow for CI.

Fix: backdate all GOCACHE entries to year 2000 after restore. During
the build, Go's markUsed() updates mtimes of accessed entries to "now".
After the build, a trim step deletes entries still at year 2000 (stale).
The cache auto-save post-step then saves only the trimmed directory.

Key detail: trim.txt must be set to "now" after backdating, otherwise
Go's built-in Trim() sees "last trim in year 2000" and deletes ALL
backdated entries before the build even starts.

Also bumps cache key from v1 to v2 for a clean slate.

Locally verified: with ~50 commits of real source changes, trim removes
679MB (32%) of stale entries from a 2.1GB cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants