fix(ci): resolve disk space exhaustion in operator build#19417
fix(ci): resolve disk space exhaustion in operator build#19417
Conversation
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In the operator Dockerfile you hard-code default cache paths (
/workspace/pkg/modand/root/.cache/go-build); consider wiring these togo env GOMODCACHE/go env GOCACHE(or at least documenting/aligning with the base image’s defaults) to avoid subtle breakage if the Go toolchain layout changes. - The Docker layer cache key in the workflow uses
hashFiles('**/go.sum'), so any go module change anywhere in the repo will invalidate the operator cache; if you want tighter caching, consider scoping this to the operator-relatedgo.sumfiles (for examplego.sumandoperator/go.sum).
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the operator Dockerfile you hard-code default cache paths (`/workspace/pkg/mod` and `/root/.cache/go-build`); consider wiring these to `go env GOMODCACHE`/`go env GOCACHE` (or at least documenting/aligning with the base image’s defaults) to avoid subtle breakage if the Go toolchain layout changes.
- The Docker layer cache key in the workflow uses `hashFiles('**/go.sum')`, so any go module change anywhere in the repo will invalidate the operator cache; if you want tighter caching, consider scoping this to the operator-related `go.sum` files (for example `go.sum` and `operator/go.sum`).
## Individual Comments
### Comment 1
<location path=".github/workflows/build.yaml" line_range="627-628" />
<code_context>
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
+ with:
+ driver-opts: |
+ image=moby/buildkit:latest
+
+ - name: Cache Docker layers
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Using `moby/buildkit:latest` can introduce non-determinism and unexpected breakage in CI builds.
Pulling `moby/buildkit:latest` makes CI behavior change as the image updates, which can cause flaky, hard-to-reproduce failures. Please pin to a specific BuildKit tag or digest so builds are deterministic and upgrades are controlled.
Suggested implementation:
```
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver-opts: |
image=moby/buildkit:v0.15.1
```
If your project has a documented set of pinned tool versions (e.g., in a README or tooling matrix), you may want to:
1. Align the `moby/buildkit` version here (`v0.15.1`) with whatever version is documented there.
2. Optionally further harden determinism by using a digest (`image=moby/buildkit@sha256:...`) instead of a tag, and updating it intentionally when upgrading BuildKit.
</issue_to_address>
### Comment 2
<location path="operator/Makefile" line_range="363-364" />
<code_context>
+ DOCKER_BUILDKIT=1 BUILDKIT_PROGRESS=plain ../scripts/docker-build.sh \
-t ${IMG} \
$(if $(GOARCH),--build-arg TARGET_ARCH=$(GOARCH)) \
+ --build-arg GOMODCACHE_PATH=$(shell go env GOMODCACHE) \
+ --build-arg GOCACHE_PATH=$(shell go env GOCACHE) \
+ $(if $(BUILDKIT_CACHE_FROM),--cache-from $(BUILDKIT_CACHE_FROM)) \
+ $(if $(BUILDKIT_CACHE_TO),--cache-to $(BUILDKIT_CACHE_TO)) \
</code_context>
<issue_to_address>
**issue (performance):** Passing host `go env` paths into the container may misalign cache paths with the container’s Go configuration.
Inside the Docker build, GOPATH is `/workspace`, so the container’s natural `GOMODCACHE`/`GOCACHE` differ from the host’s (e.g. `/home/runner/...`). Overriding Dockerfile defaults with host-derived paths means cache mounts may not match where `go mod download` writes, reducing cache effectiveness or sending writes to unexpected locations. Prefer using the Dockerfile defaults or deriving these paths inside the container from its own `go env` instead of the host’s.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
.github/workflows/build.yaml
Outdated
| scripts/lib.sh retry 6 true make -C operator/ docker-build | ||
| # Move cache to avoid unbounded growth | ||
| rm -rf /tmp/.buildx-cache | ||
| mv /tmp/.buildx-cache-new /tmp/.buildx-cache |
There was a problem hiding this comment.
Is this because the BUILDKIT_CACHE_TO is the actual used(re-freshed) cache (and the *_FROM is now not relevant)?
| ARG roxpath | ||
| # Cache mount paths - can be overridden to match local go env | ||
| ARG GOMODCACHE_PATH=/workspace/pkg/mod | ||
| ARG GOCACHE_PATH=/root/.cache/go-build |
There was a problem hiding this comment.
💡 Great! So this allows the go build within the docker build to use the gocache saved into github actions cache and shared across CI runs?
There was a problem hiding this comment.
that's the plan also it will use the local cache when run locally
Problem: - Build Operator image action running out of disk space during go mod download - Custom apollo-ci container limiting available space - Proto files regenerated twice unnecessarily Solution: - Add BuildKit cache mounts to Dockerfile for go modules and build cache - Configure cache paths to use local go env values for development - Add GitHub Actions cache persistence for Docker layers - Remove custom apollo-ci container, use ubuntu-latest (~60GB more space) - Skip redundant proto generation in second build with ROX_OPERATOR_SKIP_PROTO_GENERATED_SRCS Performance improvements: - Local builds: 0 module downloads (100% cache hit), 43% faster (~3.4min vs ~6min) - CI builds: Expected 60-70% faster on cached runs (~3-5min vs ~10-15min) - Disk space: ~60GB more available without custom container User request: Fix operator build disk space issues and optimize cache usage. Note: Changes include AI-assisted code generation and optimization.
306d14e to
73138a1
Compare
Replace type=local cache with type=gha (GitHub Actions cache backend): - Removes need for manual cache rotation (rm/mv dance) - Automatic cache size management by GitHub Actions - Per-architecture cache scoping - Simpler, cleaner code (-13 lines) The cache rotation was needed with type=local to prevent unbounded growth, but type=gha handles this automatically.
|
Images are ready for the commit at d682788. To use with deploy scripts, first |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19417 +/- ##
=======================================
Coverage 49.70% 49.71%
=======================================
Files 2701 2701
Lines 203453 203453
=======================================
+ Hits 101134 101141 +7
+ Misses 94790 94786 -4
+ Partials 7529 7526 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add setup-go action to install Go 1.25 (required by go.mod). The apollo-ci container had Go 1.25, but ubuntu-latest only has Go 1.24. Without this, the build fails with: go: go.mod requires go >= 1.25.0 (running go 1.24.13)
Enable cache: true in setup-go action and remove the custom cache-go-dependencies step. setup-go handles caching automatically and is simpler than the custom action. This removes 5 lines of configuration.
Only pass GOMODCACHE_PATH and GOCACHE_PATH build args if they're set as environment variables. This allows: - CI: Uses container defaults (/workspace/pkg/mod, /root/.cache/go-build) which work correctly with BuildKit cache mounts - Local dev: Can optionally set GOMODCACHE_PATH and GOCACHE_PATH env vars to use custom cache locations (e.g., /mnt/cache/go-mod) Previously, always passing $(shell go env GOMODCACHE) caused issues in CI because the host path didn't match the container path, breaking cache mounts and causing disk space exhaustion.
Add setup-python step to install Python 3.9 required by operator bundle build. The bundle build uses pip 21.3.1 and setuptools 59.6.0 which require Python 3.9 (they use pkgutil.ImpImporter, removed in Python 3.12). The apollo-ci container had Python 3.9, but ubuntu-latest has Python 3.12. Without this, bundle build fails with: AttributeError: module 'pkgutil' has no attribute 'ImpImporter'
Restore ./.github/actions/cache-go-dependencies instead of setup-go cache. The cache-go-dependencies action is architecture-aware and creates separate caches for each GOARCH (amd64, arm64, ppc64le, s390x), which is essential for cross-platform builds. setup-go's built-in cache doesn't handle GOARCH-specific caching well, causing cache collisions between different architecture builds.
Set ROX_OPERATOR_SKIP_PROTO_GENERATED_SRCS=true for all operator builds: - Bundle build - Unit tests - Native arch build (for dependencies) - Target arch build (already had this) Benefits: - Faster builds: No protoc download or code generation - Less disk space: No intermediate proto tools (~1-2GB saved) - More reliable: Fewer steps that can fail - Proto sources are already committed to the repo This reduces the bundle build from ~2 minutes to ~30 seconds.
Remove 'Operator unit tests' step from build-and-push-operator job. These tests are already run in the unit-tests.yaml workflow via 'make go-unit-tests', which includes all operator unit tests (operator/api/, operator/internal/, operator/controllers/, etc.). Running them again in the build job is redundant and wastes CI time. Benefits: - Faster builds: ~1-2 minutes saved per architecture - Less resource usage: Tests only run once instead of twice - Clearer separation: Tests in unit-tests.yaml, builds in build.yaml
Problem:
Solution:
Performance improvements:
Note: Changes include AI-assisted code generation and optimization.