Skip to content

fix(ci): resolve disk space exhaustion in operator build#19417

Draft
janisz wants to merge 9 commits intomasterfrom
clenup_operator_build
Draft

fix(ci): resolve disk space exhaustion in operator build#19417
janisz wants to merge 9 commits intomasterfrom
clenup_operator_build

Conversation

@janisz
Copy link
Contributor

@janisz janisz commented Mar 13, 2026

Problem:

  • Build Operator image action running out of disk space during go mod download
  • Custom apollo-ci container limiting available space
  • Proto files regenerated twice unnecessarily

Solution:

  • Add BuildKit cache mounts to Dockerfile for go modules and build cache
  • Configure cache paths to use local go env values for development
  • Add GitHub Actions cache persistence for Docker layers
  • Remove custom apollo-ci container, use ubuntu-latest (~60GB more space)
  • Skip redundant proto generation in second build with ROX_OPERATOR_SKIP_PROTO_GENERATED_SRCS

Performance improvements:

  • Local builds: 0 module downloads (100% cache hit), 43% faster (~3.4min vs ~6min)
  • CI builds: Expected 60-70% faster on cached runs (~3-5min vs ~10-15min)
  • Disk space: ~60GB more available without custom container

Note: Changes include AI-assisted code generation and optimization.

@janisz janisz requested review from a team as code owners March 13, 2026 14:57
@janisz janisz requested review from vladbologa and removed request for a team March 13, 2026 14:57
@janisz janisz marked this pull request as draft March 13, 2026 14:57
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In the operator Dockerfile you hard-code default cache paths (/workspace/pkg/mod and /root/.cache/go-build); consider wiring these to go env GOMODCACHE/go env GOCACHE (or at least documenting/aligning with the base image’s defaults) to avoid subtle breakage if the Go toolchain layout changes.
  • The Docker layer cache key in the workflow uses hashFiles('**/go.sum'), so any go module change anywhere in the repo will invalidate the operator cache; if you want tighter caching, consider scoping this to the operator-related go.sum files (for example go.sum and operator/go.sum).
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the operator Dockerfile you hard-code default cache paths (`/workspace/pkg/mod` and `/root/.cache/go-build`); consider wiring these to `go env GOMODCACHE`/`go env GOCACHE` (or at least documenting/aligning with the base image’s defaults) to avoid subtle breakage if the Go toolchain layout changes.
- The Docker layer cache key in the workflow uses `hashFiles('**/go.sum')`, so any go module change anywhere in the repo will invalidate the operator cache; if you want tighter caching, consider scoping this to the operator-related `go.sum` files (for example `go.sum` and `operator/go.sum`).

## Individual Comments

### Comment 1
<location path=".github/workflows/build.yaml" line_range="627-628" />
<code_context>
       - name: Set up Docker Buildx
         uses: docker/setup-buildx-action@v3
+        with:
+          driver-opts: |
+            image=moby/buildkit:latest
+
+      - name: Cache Docker layers
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Using `moby/buildkit:latest` can introduce non-determinism and unexpected breakage in CI builds.

Pulling `moby/buildkit:latest` makes CI behavior change as the image updates, which can cause flaky, hard-to-reproduce failures. Please pin to a specific BuildKit tag or digest so builds are deterministic and upgrades are controlled.

Suggested implementation:

```
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
        with:
          driver-opts: |
            image=moby/buildkit:v0.15.1

```

If your project has a documented set of pinned tool versions (e.g., in a README or tooling matrix), you may want to:
1. Align the `moby/buildkit` version here (`v0.15.1`) with whatever version is documented there.
2. Optionally further harden determinism by using a digest (`image=moby/buildkit@sha256:...`) instead of a tag, and updating it intentionally when upgrading BuildKit.
</issue_to_address>

### Comment 2
<location path="operator/Makefile" line_range="363-364" />
<code_context>
+	DOCKER_BUILDKIT=1 BUILDKIT_PROGRESS=plain ../scripts/docker-build.sh \
 		-t ${IMG} \
 		$(if $(GOARCH),--build-arg TARGET_ARCH=$(GOARCH)) \
+		--build-arg GOMODCACHE_PATH=$(shell go env GOMODCACHE) \
+		--build-arg GOCACHE_PATH=$(shell go env GOCACHE) \
+		$(if $(BUILDKIT_CACHE_FROM),--cache-from $(BUILDKIT_CACHE_FROM)) \
+		$(if $(BUILDKIT_CACHE_TO),--cache-to $(BUILDKIT_CACHE_TO)) \
</code_context>
<issue_to_address>
**issue (performance):** Passing host `go env` paths into the container may misalign cache paths with the container’s Go configuration.

Inside the Docker build, GOPATH is `/workspace`, so the container’s natural `GOMODCACHE`/`GOCACHE` differ from the host’s (e.g. `/home/runner/...`). Overriding Dockerfile defaults with host-derived paths means cache mounts may not match where `go mod download` writes, reducing cache effectiveness or sending writes to unexpected locations. Prefer using the Dockerfile defaults or deriving these paths inside the container from its own `go env` instead of the host’s.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@davdhacs davdhacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, and fast!

scripts/lib.sh retry 6 true make -C operator/ docker-build
# Move cache to avoid unbounded growth
rm -rf /tmp/.buildx-cache
mv /tmp/.buildx-cache-new /tmp/.buildx-cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the BUILDKIT_CACHE_TO is the actual used(re-freshed) cache (and the *_FROM is now not relevant)?

ARG roxpath
# Cache mount paths - can be overridden to match local go env
ARG GOMODCACHE_PATH=/workspace/pkg/mod
ARG GOCACHE_PATH=/root/.cache/go-build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Great! So this allows the go build within the docker build to use the gocache saved into github actions cache and shared across CI runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the plan also it will use the local cache when run locally

Problem:
- Build Operator image action running out of disk space during go mod download
- Custom apollo-ci container limiting available space
- Proto files regenerated twice unnecessarily

Solution:
- Add BuildKit cache mounts to Dockerfile for go modules and build cache
- Configure cache paths to use local go env values for development
- Add GitHub Actions cache persistence for Docker layers
- Remove custom apollo-ci container, use ubuntu-latest (~60GB more space)
- Skip redundant proto generation in second build with ROX_OPERATOR_SKIP_PROTO_GENERATED_SRCS

Performance improvements:
- Local builds: 0 module downloads (100% cache hit), 43% faster (~3.4min vs ~6min)
- CI builds: Expected 60-70% faster on cached runs (~3-5min vs ~10-15min)
- Disk space: ~60GB more available without custom container

User request: Fix operator build disk space issues and optimize cache usage.

Note: Changes include AI-assisted code generation and optimization.
@janisz janisz force-pushed the clenup_operator_build branch from 306d14e to 73138a1 Compare March 13, 2026 16:09
Replace type=local cache with type=gha (GitHub Actions cache backend):
- Removes need for manual cache rotation (rm/mv dance)
- Automatic cache size management by GitHub Actions
- Per-architecture cache scoping
- Simpler, cleaner code (-13 lines)

The cache rotation was needed with type=local to prevent unbounded
growth, but type=gha handles this automatically.
@rhacs-bot
Copy link
Contributor

rhacs-bot commented Mar 13, 2026

Images are ready for the commit at d682788.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-327-gd68278838b.

@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.71%. Comparing base (66a258b) to head (d682788).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #19417   +/-   ##
=======================================
  Coverage   49.70%   49.71%           
=======================================
  Files        2701     2701           
  Lines      203453   203453           
=======================================
+ Hits       101134   101141    +7     
+ Misses      94790    94786    -4     
+ Partials     7529     7526    -3     
Flag Coverage Δ
go-unit-tests 49.71% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

janisz added 7 commits March 13, 2026 18:05
Add setup-go action to install Go 1.25 (required by go.mod).

The apollo-ci container had Go 1.25, but ubuntu-latest only has Go 1.24.
Without this, the build fails with:
  go: go.mod requires go >= 1.25.0 (running go 1.24.13)
Enable cache: true in setup-go action and remove the custom
cache-go-dependencies step. setup-go handles caching automatically
and is simpler than the custom action.

This removes 5 lines of configuration.
Only pass GOMODCACHE_PATH and GOCACHE_PATH build args if they're set
as environment variables. This allows:

- CI: Uses container defaults (/workspace/pkg/mod, /root/.cache/go-build)
  which work correctly with BuildKit cache mounts
- Local dev: Can optionally set GOMODCACHE_PATH and GOCACHE_PATH env vars
  to use custom cache locations (e.g., /mnt/cache/go-mod)

Previously, always passing $(shell go env GOMODCACHE) caused issues in CI
because the host path didn't match the container path, breaking cache mounts
and causing disk space exhaustion.
Add setup-python step to install Python 3.9 required by operator bundle build.

The bundle build uses pip 21.3.1 and setuptools 59.6.0 which require
Python 3.9 (they use pkgutil.ImpImporter, removed in Python 3.12).

The apollo-ci container had Python 3.9, but ubuntu-latest has Python 3.12.
Without this, bundle build fails with:
  AttributeError: module 'pkgutil' has no attribute 'ImpImporter'
Restore ./.github/actions/cache-go-dependencies instead of setup-go cache.

The cache-go-dependencies action is architecture-aware and creates
separate caches for each GOARCH (amd64, arm64, ppc64le, s390x), which
is essential for cross-platform builds.

setup-go's built-in cache doesn't handle GOARCH-specific caching well,
causing cache collisions between different architecture builds.
Set ROX_OPERATOR_SKIP_PROTO_GENERATED_SRCS=true for all operator builds:
- Bundle build
- Unit tests
- Native arch build (for dependencies)
- Target arch build (already had this)

Benefits:
- Faster builds: No protoc download or code generation
- Less disk space: No intermediate proto tools (~1-2GB saved)
- More reliable: Fewer steps that can fail
- Proto sources are already committed to the repo

This reduces the bundle build from ~2 minutes to ~30 seconds.
Remove 'Operator unit tests' step from build-and-push-operator job.

These tests are already run in the unit-tests.yaml workflow via
'make go-unit-tests', which includes all operator unit tests
(operator/api/, operator/internal/, operator/controllers/, etc.).

Running them again in the build job is redundant and wastes CI time.

Benefits:
- Faster builds: ~1-2 minutes saved per architecture
- Less resource usage: Tests only run once instead of twice
- Clearer separation: Tests in unit-tests.yaml, builds in build.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants