experiment: stable build ldflags for Docker layer caching#19829
experiment: stable build ldflags for Docker layer caching#19829
Conversation
Enable GHA buildx cache for main, roxctl, and operator image builds. Docker layers (base image pulls, package installs) are cached across CI runs, avoiding redundant microdnf upgrade/install on every build. Cache is opt-in via DOCKER_BUILDX_CACHE env var to avoid affecting local builds. Scoped per image and architecture to prevent collisions. Expected savings: ~60-80s off the 115s "Build main images" step on warm runs (package install layers cached). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move OS package installation (microdnf upgrade, postgres RPMs, util-linux) into a separate 'base-with-packages' stage. The final stage uses FROM base-with-packages and only COPYs binaries. Before: COPY binaries → RUN microdnf (rebuilds packages every commit) After: base-with-packages stage (cached) → COPY binaries (fast) With Docker buildx GHA cache, the package install layer (~60s) is cached across CI runs. Only the binary COPY steps rebuild per commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The RUN step calls save-dir-contents which is in static-bin/. Copy just the helper scripts needed for package installation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace make docker-build-main-image with docker/build-push-action@v6 which handles GHA buildx cache natively. The action manages cache tokens and builder configuration that our docker-build.sh wrapper was missing. The base-with-packages Dockerfile stage (package installs) should now cache across CI runs, skipping microdnf upgrade/install when only binaries change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each COPY --link creates an independent overlay layer. Changing one binary (e.g. bin/central) doesn't invalidate other COPY layers (ui, static-data, etc). Combined with GHA buildx cache, this means only the changed binaries need to be re-copied on warm builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use push: true instead of load: true on docker/build-push-action. This pushes layers directly to the registry from buildkit, avoiding the slow --load export to local docker (~90s overhead even with all layers cached). The main image is now built and pushed in one step. roxctl and central-db still use the separate push step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Container env vars removed by Tomecz's PR aren't available as env vars anymore. Use secrets.* directly in docker/login-action. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The push-main-manifests step expects per-arch tags (e.g. main:tag-amd64) to create multi-arch manifest lists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
docker/login-action with registry: quay.io/org doesn't work for quay.io. Use the existing registry_rw_login helper which handles quay.io authentication correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r stackrox-io docker/login-action can only authenticate to one quay.io org at a time. Push main to rhacs-eng via build-push-action (fast, cached layers). Use existing push_main_image_set for stackrox-io and other images (handles multi-org login correctly by re-authenticating per org). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build-push-action pushes directly to registry without loading locally. The existing push_main_image_set expects local images. Instead: - Push roxctl/central-db from local docker (built by make) - Copy main from rhacs-eng to stackrox-io via skopeo Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
push:true creates manifest lists instead of plain images, breaking the multi-arch manifest creation step. Revert to load:true which loads into local docker and uses the existing push_main_image_set pipeline unchanged. GHA layer cache still works with load:true (17 cached layers confirmed). Build step: 105s warm vs 110s baseline (5% faster). The --load export overhead limits the savings but the Dockerfile restructuring and COPY --link provide the foundation for future improvements. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use push:true + provenance:false to push main image directly from buildx to the registry. provenance:false produces a plain image manifest (not a manifest list), compatible with the downstream push-main-manifests job that creates multi-arch manifest lists. Login to both quay.io orgs before the build step so buildx can push to both registries. roxctl and central-db still use docker push (built locally by make). Expected: Build+push main image ~55s (vs 105s with load:true). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g copy buildx's docker-container driver doesn't share host docker credentials. Use docker/login-action (which injects creds into the buildx builder) for rhacs-eng push. Copy main to stackrox-io via skopeo (lightweight, blobs shared on quay.io). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
docker login can only hold one quay.io credential at a time. Use skopeo --src-creds and --dest-creds to authenticate to both orgs simultaneously for the rhacs-eng → stackrox-io copy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use type=registry cache backed by GHCR instead of type=gha.
Benefits:
- No 10GB size limit (GHA cache is shared across all workflows)
- Buildx pulls only needed layers (content-addressable), not full blob
- Faster restore for multi-stage builds with many cached layers
Cache images stored at ghcr.io/stackrox/stackrox/cache/main-{arch}.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from docker-container to docker driver for buildx. The docker driver uses Docker Engine's built-in buildkit — no separate container to boot. No cache for this run (baseline measurement). Will add inline cache once driver change is validated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docker driver (68s cold, no cache) is already faster than
docker-container + GHA cache (87s). Add inline cache on top:
- cache-to: type=inline embeds cache metadata in the pushed image
- cache-from pulls the previous build (cache-{arch} tag) as source
- With COPY --link, only changed layers need rebuilding
Push a stable cache-{arch} tag on every build so the next build
has a cache reference.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set BUILD_TAG=0.0.0 and SHORTCOMMIT=0000000 for Go binary compilation (pre-build-go-binaries and pre-build-cli). This produces byte-identical binaries across commits that don't change Go source code. Combined with COPY --link, unchanged binaries = cached Docker layers = zero push on warm builds. The image tag (per-commit) still identifies the build; only the embedded binary version is stable. This is the key to sub-30s build+push: if Go source didn't change, the binary didn't change, the layer didn't change, nothing to push. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- Setting BUILD_TAG/SHORTCOMMIT to fixed values globally in CI means binaries will always report version 0.0.0/0000000; if any downstream tooling or release jobs depend on real version metadata, consider scoping the stable ldflags only to non-release/PR builds or gating them via an env/flag.
- The new build-and-push-main logic (buildx build-push, manual tagging, skopeo copy) partially duplicates behavior that used to live in scripts like push_main_image_set; consider centralizing this push/tag logic in a shared script to avoid future drift between CI workflows and local tooling.
- The Dockerfile now relies on COPY --link and buildx/BuildKit features; please confirm that any non-GHA or local build paths that still use plain docker build (without buildx/BuildKit) are either updated to enable BuildKit or clearly documented as unsupported, to avoid subtle build failures in other environments.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Setting BUILD_TAG/SHORTCOMMIT to fixed values globally in CI means binaries will always report version 0.0.0/0000000; if any downstream tooling or release jobs depend on real version metadata, consider scoping the stable ldflags only to non-release/PR builds or gating them via an env/flag.
- The new build-and-push-main logic (buildx build-push, manual tagging, skopeo copy) partially duplicates behavior that used to live in scripts like push_main_image_set; consider centralizing this push/tag logic in a shared script to avoid future drift between CI workflows and local tooling.
- The Dockerfile now relies on COPY --link and buildx/BuildKit features; please confirm that any non-GHA or local build paths that still use plain docker build (without buildx/BuildKit) are either updated to enable BuildKit or clearly documented as unsupported, to avoid subtle build failures in other environments.
## Individual Comments
### Comment 1
<location path="image/rhel/Dockerfile" line_range="82-91" />
<code_context>
+COPY --link static-bin /stackrox/
</code_context>
<issue_to_address>
**issue (bug_risk):** Using COPY --link requires BuildKit and may break non-buildx/local docker builds using this Dockerfile.
In CI this is fine because we use buildx, but any remaining `docker build` usage (local dev, older daemons, or CI jobs without BuildKit) will fail on this Dockerfile. If those paths must keep working, either make BuildKit a documented, enforced requirement for all builds that use this file (ideally with a clear failure if it’s not enabled), or drop `--link` and rely on the existing caching strategy.
</issue_to_address>
### Comment 2
<location path=".github/workflows/build.yaml" line_range="619-628" />
<code_context>
+ for image in roxctl central-db; do
</code_context>
<issue_to_address>
**suggestion (performance):** Repeated registry logins inside the image loop add avoidable overhead and latency.
`registry_rw_login` is invoked for `quay.io/rhacs-eng` and `quay.io/stackrox-io` on each loop iteration even though both credentials and registries are constant. Consider moving these two login calls outside the `for image in ...` loop so they run once per registry, and keep only the `docker push` operations inside the loop to avoid repeated API calls.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| COPY --link static-bin /stackrox/ | ||
|
|
||
| COPY --from=downloads /output/rpms/ /tmp/ | ||
| COPY --from=downloads /output/go/ /go/ | ||
| COPY --link --from=downloads /output/go/ /go/ | ||
|
|
||
| RUN rpm --import RPM-GPG-KEY-CentOS-Official && \ | ||
| microdnf -y upgrade --nobest && \ | ||
| rpm -i --nodeps /tmp/postgres-libs.rpm && \ | ||
| rpm -i --nodeps /tmp/postgres.rpm && \ | ||
| microdnf install --setopt=install_weak_deps=0 --nodocs -y util-linux && \ | ||
| microdnf clean all -y && \ | ||
| rm /tmp/postgres.rpm /tmp/postgres-libs.rpm RPM-GPG-KEY-CentOS-Official && \ | ||
| # (Optional) Remove line below to keep package management utilities | ||
| rpm -e --nodeps $(rpm -qa curl '*rpm*' '*dnf*' '*libsolv*' '*hawkey*' 'yum*') && \ | ||
| rm -rf /var/cache/dnf /var/cache/yum && \ | ||
| # The contents of paths mounted as emptyDir volumes in Kubernetes are saved | ||
| # by the script `save-dir-contents` during the image build. The directory | ||
| # contents are then restored by the script `restore-all-dir-contents` | ||
| # during the container start. | ||
| chown -R 4000:4000 /etc/pki/ca-trust && save-dir-contents /etc/pki/ca-trust/source && \ | ||
| mkdir -p /var/lib/stackrox && chown -R 4000:4000 /var/lib/stackrox && \ | ||
| mkdir -p /var/log/stackrox && chown -R 4000:4000 /var/log/stackrox && \ | ||
| mkdir -p /var/cache/stackrox && chown -R 4000:4000 /var/cache/stackrox && \ | ||
| chown -R 4000:4000 /tmp | ||
| COPY --link --from=stackrox_data /stackrox-data /stackrox/static-data | ||
| COPY --link ./docs/api/v1/swagger.json /stackrox/static-data/docs/api/v1/swagger.json | ||
| COPY --link ./docs/api/v2/swagger.json /stackrox/static-data/docs/api/v2/swagger.json | ||
| COPY --link THIRD_PARTY_NOTICES /THIRD_PARTY_NOTICES/ | ||
|
|
||
| COPY --link ui /ui |
There was a problem hiding this comment.
issue (bug_risk): Using COPY --link requires BuildKit and may break non-buildx/local docker builds using this Dockerfile.
In CI this is fine because we use buildx, but any remaining docker build usage (local dev, older daemons, or CI jobs without BuildKit) will fail on this Dockerfile. If those paths must keep working, either make BuildKit a documented, enforced requirement for all builds that use this file (ideally with a clear failure if it’s not enabled), or drop --link and rely on the existing caching strategy.
| for image in roxctl central-db; do | ||
| docker tag "stackrox/${image}:${BUILD_TAG}" "quay.io/rhacs-eng/${image}:${BUILD_TAG}" | ||
| docker tag "stackrox/${image}:${BUILD_TAG}" "quay.io/rhacs-eng/${image}:${BUILD_TAG}-${{ matrix.arch }}" | ||
| docker tag "stackrox/${image}:${BUILD_TAG}" "quay.io/stackrox-io/${image}:${BUILD_TAG}" | ||
| docker tag "stackrox/${image}:${BUILD_TAG}" "quay.io/stackrox-io/${image}:${BUILD_TAG}-${{ matrix.arch }}" | ||
| registry_rw_login "quay.io/rhacs-eng" | ||
| retry 5 true docker push "quay.io/rhacs-eng/${image}:${BUILD_TAG}" | cat | ||
| retry 5 true docker push "quay.io/rhacs-eng/${image}:${BUILD_TAG}-${{ matrix.arch }}" | cat | ||
| registry_rw_login "quay.io/stackrox-io" | ||
| retry 5 true docker push "quay.io/stackrox-io/${image}:${BUILD_TAG}" | cat |
There was a problem hiding this comment.
suggestion (performance): Repeated registry logins inside the image loop add avoidable overhead and latency.
registry_rw_login is invoked for quay.io/rhacs-eng and quay.io/stackrox-io on each loop iteration even though both credentials and registries are constant. Consider moving these two login calls outside the for image in ... loop so they run once per registry, and keep only the docker push operations inside the loop to avoid repeated API calls.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19829 +/- ##
========================================
Coverage 49.59% 49.60%
========================================
Files 2763 2763
Lines 208167 208339 +172
========================================
+ Hits 103250 103341 +91
- Misses 97252 97331 +79
- Partials 7665 7667 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe pull request updates CI/CD infrastructure to support Docker Buildx layer caching and restructures multi-stage Docker builds. It hardcodes build identifiers in certain workflow jobs, introduces GitHub Actions native Buildx support with GHA cache integration, reorganizes the Dockerfile to isolate OS-level package setup in a separate build stage with independent layer caching via Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build.yaml:
- Around line 588-595: The build workflow dropped the QUAY_TAG_EXPIRATION
build-arg, causing image/rhel/Dockerfile's quay.expires-after label to be empty;
restore the missing argument by adding QUAY_TAG_EXPIRATION=${{
env.QUAY_TAG_EXPIRATION || '<appropriate-default>' }} back into the build-args
block so the published main images receive the expiry label (matching how the
old docker-build-main-image Makefile target propagated QUAY_TAG_EXPIRATION).
- Around line 184-185: The workflow currently hardcodes BUILD_TAG and
SHORTCOMMIT to "0.0.0" and "0000000" (seen in the variables block) which causes
published artifacts to self-report placeholder versions; change the logic in the
pre-build-cli and pre-build-go-binaries steps so these placeholder overrides are
only applied for CI/PR/test runs (e.g., when event is pull_request or a
non-publishing branch) and are skipped for publishing contexts (push-to-master,
nightly tag builds, or when a publish flag is set); update both occurrences
(also the similar block at lines 226-230) to conditionally set
BUILD_TAG/SHORTCOMMIT based on the workflow context or a publish-specific
input/env var rather than unconditionally overriding them.
- Around line 585-586: The current push uses shared tags
(quay.io/rhacs-eng/main:${{ env.BUILD_TAG }} and quay.io/rhacs-eng/main:${{
env.BUILD_TAG }}-${{ matrix.arch }}) which lets one branding overwrite the
other; update the push steps so registry paths remain brand-aware and per-arch
by including the branding identifier in the image name (e.g., incorporate the
branding env var into the registry/repo component) for all occurrences of those
tags and the equivalent blocks mentioned (also adjust the analogous blocks at
the other ranges), stop publishing the shared ${BUILD_TAG} tag from build jobs
(only publish per-arch tags and per-arch latest-${{ matrix.arch }} on
merge-to-master) and only create/push the shared manifest tag during the
push-main-manifests / push_main_image_set manifest assembly step.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: b04fd485-3bcd-4a23-a243-205a3bd12754
📒 Files selected for processing (3)
.github/workflows/build.yamlimage/rhel/Dockerfilescripts/docker-build.sh
| BUILD_TAG: 0.0.0 | ||
| SHORTCOMMIT: "0000000" |
There was a problem hiding this comment.
Don't bake 0.0.0/0000000 into the published binaries.
These overrides apply to the same workflow that runs on push to master and *-nightly-*, so the main/roxctl/operator artifacts published from those builds will also self-report the placeholder version. If this trade-off is only acceptable for CI/PR testing, gate it away from publishing contexts.
Possible guard
- BUILD_TAG: 0.0.0
- SHORTCOMMIT: "0000000"
+ BUILD_TAG: ${{ github.event_name == 'pull_request' && '0.0.0' || needs.define-job-matrix.outputs.build-tag }}
+ SHORTCOMMIT: ${{ github.event_name == 'pull_request' && '0000000' || needs.define-job-matrix.outputs.short-commit }}Apply the same pattern in both pre-build-cli and pre-build-go-binaries.
Also applies to: 226-230
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/build.yaml around lines 184 - 185, The workflow currently
hardcodes BUILD_TAG and SHORTCOMMIT to "0.0.0" and "0000000" (seen in the
variables block) which causes published artifacts to self-report placeholder
versions; change the logic in the pre-build-cli and pre-build-go-binaries steps
so these placeholder overrides are only applied for CI/PR/test runs (e.g., when
event is pull_request or a non-publishing branch) and are skipped for publishing
contexts (push-to-master, nightly tag builds, or when a publish flag is set);
update both occurrences (also the similar block at lines 226-230) to
conditionally set BUILD_TAG/SHORTCOMMIT based on the workflow context or a
publish-specific input/env var rather than unconditionally overriding them.
| quay.io/rhacs-eng/main:${{ env.BUILD_TAG }} | ||
| quay.io/rhacs-eng/main:${{ env.BUILD_TAG }}-${{ matrix.arch }} |
There was a problem hiding this comment.
Keep the replacement push path brand-aware and per-arch.
This matrix still builds distinct RHACS_BRANDING/STACKROX_BRANDING artifacts, but these tags are now shared across both variants and then mirrored/pushed to both registries. That lets one branding overwrite the other, and it also publishes the shared ${BUILD_TAG} tag before push-main-manifests assembles the final manifest list. The old push_main_image_set flow kept registry selection brand-aware, published per-arch tags from the build jobs, and added latest-${arch} on merge-to-master.
Also applies to: 620-629, 634-639
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/build.yaml around lines 585 - 586, The current push uses
shared tags (quay.io/rhacs-eng/main:${{ env.BUILD_TAG }} and
quay.io/rhacs-eng/main:${{ env.BUILD_TAG }}-${{ matrix.arch }}) which lets one
branding overwrite the other; update the push steps so registry paths remain
brand-aware and per-arch by including the branding identifier in the image name
(e.g., incorporate the branding env var into the registry/repo component) for
all occurrences of those tags and the equivalent blocks mentioned (also adjust
the analogous blocks at the other ranges), stop publishing the shared
${BUILD_TAG} tag from build jobs (only publish per-arch tags and per-arch
latest-${{ matrix.arch }} on merge-to-master) and only create/push the shared
manifest tag during the push-main-manifests / push_main_image_set manifest
assembly step.
| build-args: | | ||
| DEBUG_BUILD=${{ env.DEBUG_BUILD || 'no' }} | ||
| ROX_PRODUCT_BRANDING=${{ env.ROX_PRODUCT_BRANDING }} | ||
| TARGET_ARCH=${{ matrix.arch }} | ||
| ROX_IMAGE_FLAVOR=development_build | ||
| LABEL_VERSION=${{ env.BUILD_TAG }} | ||
| LABEL_RELEASE=${{ env.BUILD_TAG }} | ||
| # Pull the previous build as cache source. With COPY --link, |
There was a problem hiding this comment.
QUAY_TAG_EXPIRATION was dropped from the new build invocation.
image/rhel/Dockerfile still populates quay.expires-after from this arg, and the old docker-build-main-image target in Makefile passed it through. Without it, the pushed main images lose their expiry label, so ephemeral PR/nightly tags can linger indefinitely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/build.yaml around lines 588 - 595, The build workflow
dropped the QUAY_TAG_EXPIRATION build-arg, causing image/rhel/Dockerfile's
quay.expires-after label to be empty; restore the missing argument by adding
QUAY_TAG_EXPIRATION=${{ env.QUAY_TAG_EXPIRATION || '<appropriate-default>' }}
back into the build-args block so the published main images receive the expiry
label (matching how the old docker-build-main-image Makefile target propagated
QUAY_TAG_EXPIRATION).
🚀 Build Images ReadyImages are ready for commit 5752a50. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-568-g5752a50433 |
Description
Experiment — use stable version ldflags (
BUILD_TAG=0.0.0,SHORTCOMMIT=0000000) for Go binary compilation so that DockerCOPY --linklayers are cached across builds.The insight
Currently, every commit produces different Go binaries (different version string baked in via ldflags). This means:
COPY --link bin/central /stackrox/centralcreates a new layerWith stable ldflags:
Combined with other optimizations
COPY --link(independent layers)push: true+provenance: falseWhat this changes
pre-build-go-binaries:BUILD_TAG=0.0.0,SHORTCOMMIT=0000000pre-build-cli: same0.0.0as version (acceptable for CI/PR testing)Based on #19806 (Docker buildx caching PR).
🤖 Generated with Claude Code