Skip to content

ci: split Go unit test job into parallel shards#19702

Draft
guzalv wants to merge 3 commits intomasterfrom
ci/split-go-unit-tests
Draft

ci: split Go unit test job into parallel shards#19702
guzalv wants to merge 3 commits intomasterfrom
ci/split-go-unit-tests

Conversation

@guzalv
Copy link
Copy Markdown
Contributor

@guzalv guzalv commented Mar 30, 2026

Description

Split the single go unit test CI job into 4 parallel component shards to
reduce wall-clock feedback time. Each shard runs independently.

Measured impact

Wall-clock time (first start to last finish):

Metric Baseline (monolithic) Split (sharded) Delta
Average 35.2 min 19.5 min -45%
Range 32.0 - 39.1 min 19.5 min -

Baseline: 10 measurements from 5 recent master runs (2 gotags x 5 runs).
Split: first clean run on this PR.

Per-shard timings:

Shard Packages Time
central 469 18.8 - 19.5 min
pkg 244 17.8 - 18.7 min
sensor + compliance 90 10.8 - 11.0 min
other 137 10.9 - 11.4 min
extra (integration + operator) - 10.5 - 10.6 min

Baseline monolithic job step breakdown (one representative run):

  • Go Unit Tests step: 29.3 min
  • Go Integration + Operator tests: 3.3 min
  • Setup (container, checkout, cache): 3.0 min

Failure isolation (validated with deliberate test failure):

  • Introduced a failing test in sensor/common/sensor/
  • Only the sensor shard failed (~11 min); central, pkg, other, and extra shards all passed independently
  • Failure was immediately identifiable from the job name (Go Unit Tests (sensor, ...))

Trade-offs

  • Total runner-minutes increase ~2x (70 to 140 min) due to per-job setup overhead (~3 min each x 10 jobs = 30 min of extra setup). This is the main cost.
  • Codecov project check may need configuration to handle multiple partial coverage uploads (currently shows as failing).

Reasons for this change (validated)

  1. Faster feedback (validated): Wall-clock time drops 45% (35 min to 19.5 min). Developers see results ~15 minutes sooner.
  2. Faster failure identification (validated): A sensor test failure is surfaced in ~11 min vs ~35 min. The job name immediately tells you which component failed.
  3. Cheaper re-runs (validated): Re-running a failed sensor shard costs ~11 min vs re-running the full monolithic job at ~35 min. GitHub Actions re-run failed jobs only re-runs the failed matrix combinations.
  4. Narrower blast radius (structural): A flaky test in one component does not block or delay results from other components.

Reasons I initially claimed but are not clearly supported

  • "Better caching" — each shard restores the same full Go module cache. The cache is not smaller or faster per-shard. This claim is dropped.

What changed

  • Makefile: Added UNIT_TEST_PKG_INCLUDE/UNIT_TEST_PKG_EXCLUDE grep filter variables with no-op defaults. Local make test behavior is unchanged. Extracted go-log-level-tests into its own target (still included in make test).
  • Workflow: Replaced single go job with go-unit-tests (4-component matrix x 2 gotags = 8 jobs) and go-extra-tests (integration + operator tests, 2 jobs). Updated slack notification references.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  1. Verified filter patterns cover all 940 packages exactly: 469 + 244 + 90 + 137 = 940 (no gaps, no overlaps).
  2. First clean CI run: all 10 Go test jobs passed.
  3. Deliberate failure test: introduced a failing test in sensor/common/sensor/, confirmed only the sensor shard failed while all other shards passed.
  4. Compared wall-clock times against 5 recent master runs (10 data points).

The single `go` job running all ~940 Go unit test packages is split into
4 parallel shards (central, pkg, sensor, other) × 2 gotags variants = 8
parallel jobs. This reduces wall-clock feedback time significantly.

Changes:
- Makefile: add UNIT_TEST_PKG_INCLUDE/EXCLUDE filter variables (grep
  patterns) to go-unit-tests and go-postgres-unit-tests targets. Defaults
  are no-op so local `make test` behavior is unchanged.
- Makefile: extract go-log-level-tests into its own target (run only in
  the pkg shard in CI, still included in `make test`).
- Workflow: replace single `go` job with `go-unit-tests` (4-component
  matrix) and `go-extra-tests` (integration + operator tests).
- Workflow: update slack notification needs and mention_author references.

Package distribution across shards:
  central: 469, pkg: 244, sensor+compliance: 90, other: 137

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 30, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="Makefile" line_range="545-547" />
<code_context>
-	# Exercise the logging package for all supported logging levels to make sure that initialization works properly
+
+# Exercise the logging package for all supported logging levels to make sure that initialization works properly.
+.PHONY: go-log-level-tests
+go-log-level-tests:
 	@echo "Run log tests"
 	for encoding in console json; do \
</code_context>
<issue_to_address>
**suggestion:** The new `go-log-level-tests` target does not depend on `build-prep`/`test-prep`, which may cause issues when run directly.

When this logic lived under `go-unit-tests`, it inherited `build-prep` and `test-prep`. Now that it’s a separate `go-log-level-tests` target, that prep only runs when invoked via a higher-level target like `make test`. Invoking `make go-log-level-tests` directly could fail due to missing setup (e.g., env, `test-output/`, or other prep side effects). Please add `build-prep test-prep` as prerequisites (or factor out a shared prerequisite target) so it behaves correctly when run standalone.

```suggestion
# Exercise the logging package for all supported logging levels to make sure that initialization works properly.
.PHONY: go-log-level-tests
go-log-level-tests: build-prep test-prep
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +545 to +547
# Exercise the logging package for all supported logging levels to make sure that initialization works properly.
.PHONY: go-log-level-tests
go-log-level-tests:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The new go-log-level-tests target does not depend on build-prep/test-prep, which may cause issues when run directly.

When this logic lived under go-unit-tests, it inherited build-prep and test-prep. Now that it’s a separate go-log-level-tests target, that prep only runs when invoked via a higher-level target like make test. Invoking make go-log-level-tests directly could fail due to missing setup (e.g., env, test-output/, or other prep side effects). Please add build-prep test-prep as prerequisites (or factor out a shared prerequisite target) so it behaves correctly when run standalone.

Suggested change
# Exercise the logging package for all supported logging levels to make sure that initialization works properly.
.PHONY: go-log-level-tests
go-log-level-tests:
# Exercise the logging package for all supported logging levels to make sure that initialization works properly.
.PHONY: go-log-level-tests
go-log-level-tests: build-prep test-prep

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Split the original single go CI job into two jobs—go-unit-tests (matrix over gotags and component, with package include/exclude filters and conditional log-level step) and go-extra-tests (integration/helm targets). Makefile adds CI sharding variables and a go-log-level-tests target; a failing probe test was added under sensor CI shard.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/unit-tests.yaml
Replaced the single go job with go-unit-tests (matrix on gotags and component, parameterized envs and conditional go-log-level-tests step, Jira junit2jira reporting) and go-extra-tests (runs integration/helm Make targets, own matrix and Jira step). Updated downstream Slack failure dependencies and mention_author to use new job outputs.
Makefile Configuration
Makefile
Added UNIT_TEST_PKG_INCLUDE ?= . and UNIT_TEST_PKG_EXCLUDE ?= ^$$. Applied include/exclude filtering to go-unit-tests and go-postgres-unit-tests package discovery via grep -E/grep -Ev. Added .PHONY: go-log-level-tests, a go-log-level-tests target (runs logging package tests across encodings/levels), and added it to test dependencies.
Sensor CI Probe Test
sensor/common/sensor/ci_shard_probe_test.go
Added TestCIShardIsolationProbe which unconditionally fails (calls t.Fatal), intended as a CI shard probe test; no production code changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: splitting a monolithic Go unit test CI job into parallel shards to improve efficiency.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description is comprehensive and thorough, covering all required template sections with detailed explanations, measured impact data, validation methodology, and trade-offs analysis.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/split-go-unit-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/unit-tests.yaml:
- Around line 40-41: The workflow currently exposes outputs like new-jiras from
the junit2jira step but for matrix jobs (jobs named go-unit-tests,
go-extra-tests, go-postgres) GitHub Actions only preserves the last completed
shard’s outputs; change this by having each matrix shard upload its junit2jira
result (e.g., JSON or text) as a per-shard artifact (use unique names including
matrix variables), then add a non-matrix aggregation job that runs after the
matrix (needs depends-on: the matrix job), downloads all shard artifacts,
merges/concatenates them to produce the combined new-jiras value, and sets that
aggregated value as the workflow output (replacing the current direct step
output reference to junit2jira.outputs.new-jiras). Ensure you reference the
junit2jira step name in each shard for artifact creation and the aggregation job
reads those artifacts to compute the final output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 2a338411-0ecf-4981-8ebe-6244426c7d85

📥 Commits

Reviewing files that changed from the base of the PR and between 574b0ee and 41df802.

📒 Files selected for processing (2)
  • .github/workflows/unit-tests.yaml
  • Makefile

Comment on lines 40 to 41
outputs:
new-jiras: ${{ steps.junit2jira.outputs.new-jiras }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check GitHub Actions documentation or existing patterns in the repo for matrix output handling
rg -n "outputs:" --type yaml -A5 .github/workflows/ | head -60

Repository: stackrox/stackrox

Length of output: 3776


🏁 Script executed:

cat -n .github/workflows/unit-tests.yaml | head -150

Repository: stackrox/stackrox

Length of output: 5288


🏁 Script executed:

cat -n .github/workflows/unit-tests.yaml | tail -n +150 | head -200

Repository: stackrox/stackrox

Length of output: 7078


🏁 Script executed:

cat -n .github/workflows/unit-tests.yaml | tail -n +350

Repository: stackrox/stackrox

Length of output: 8503


🏁 Script executed:

# Verify how matrix job outputs work in GitHub Actions
# Search for any documentation or comments about this limitation
rg -i "matrix.*output|last.*completed" .github/workflows/ -A3 -B3

Repository: stackrox/stackrox

Length of output: 14667


🌐 Web query:

GitHub Actions matrix job outputs limitation last completed shard only

💡 Result:

Yes, GitHub Actions matrix jobs have a limitation where job-level outputs (defined in jobs.<job_id>.outputs) are overwritten by each matrix shard, and dependent jobs only receive the output from the last completed shard. This is a known behavior since matrix jobs share the same job ID, and outputs are not preserved per shard. Official documentation on job outputs and matrix strategies does not mention per-shard output preservation, confirming the limitation. A 2023 runner PR (#2477) aimed to support matrix context in output keys (e.g., outputs: ${{ matrix.os }}_result: ...), but server-side changes were pending, and no changelog entry confirms full rollout by 2026-03-30. Workarounds include: - Upload per-shard data as artifacts from matrix jobs, download and aggregate in a dependent job. - Use third-party actions like cloudposse/github-action-matrix-outputs-{write,read}. - Predefine fixed output names matching matrix values (not dynamic). Example artifact workaround: jobs: test: strategy: matrix: runner: [DEV, TEST] steps: - run: jq -cn --arg runner ${{ matrix.runner }} --arg outcome success '$ARGS.named' > outcome.json - uses: actions/upload-artifact@v4 with: name: outcome-${{ matrix.runner }} path: outcome.json aggregate: needs: test steps: - uses: actions/download-artifact@v4 - id: collect run: | jq -s . outcomes/*.json > all-outcomes.json echo "outcomes=$(cat all-outcomes.json)" >> $GITHUB_OUTPUT

Citations:


Matrix job outputs capture only the last completed shard, not all shards.

For matrix jobs, GitHub Actions only captures the output from the last completed matrix job, not an aggregation of all. Since go-unit-tests runs 8 parallel jobs (4 components × 2 gotags), the new-jiras output will only reflect one shard's result, potentially missing Jira tickets created by other shards. The same applies to go-extra-tests and go-postgres.

Consider implementing an aggregation job that collects outputs from all matrix shards using artifact uploads and aggregation, rather than relying on the single final shard output.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/unit-tests.yaml around lines 40 - 41, The workflow
currently exposes outputs like new-jiras from the junit2jira step but for matrix
jobs (jobs named go-unit-tests, go-extra-tests, go-postgres) GitHub Actions only
preserves the last completed shard’s outputs; change this by having each matrix
shard upload its junit2jira result (e.g., JSON or text) as a per-shard artifact
(use unique names including matrix variables), then add a non-matrix aggregation
job that runs after the matrix (needs depends-on: the matrix job), downloads all
shard artifacts, merges/concatenates them to produce the combined new-jiras
value, and sets that aggregated value as the workflow output (replacing the
current direct step output reference to junit2jira.outputs.new-jiras). Ensure
you reference the junit2jira step name in each shard for artifact creation and
the aggregation job reads those artifacts to compute the final output.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.65%. Comparing base (574b0ee) to head (38dbc41).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #19702       +/-   ##
===========================================
+ Coverage   49.02%   49.65%    +0.63%     
===========================================
  Files         768     2747     +1979     
  Lines       74760   207261   +132501     
===========================================
+ Hits        36648   102909    +66261     
- Misses      34756    96697    +61941     
- Partials     3356     7655     +4299     
Flag Coverage Δ
go-unit-tests 49.65% <ø> (+0.63%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rhacs-bot
Copy link
Copy Markdown
Contributor

rhacs-bot commented Mar 30, 2026

Images are ready for the commit at 38dbc41.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-491-g38dbc41367.

Temporary commit to prove that the split CI jobs correctly isolate failures
to the failing component's shard. Will be reverted after data collection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sensor/common/sensor/ci_shard_probe_test.go`:
- Around line 6-7: Update TestCIShardIsolationProbe to gate the deliberate
failure behind an explicit opt-in (default skip): at the top of the test
(function TestCIShardIsolationProbe) check an environment variable or test flag
(for example os.Getenv("RUN_CI_SHARD_PROBE") == "1") and call t.Skip("skipping
CI shard isolation probe; set RUN_CI_SHARD_PROBE=1 to run") when not enabled;
only call t.Fatal("deliberate failure to prove shard isolation") when the opt-in
is present so the shard does not fail by default.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b72d427e-891d-4313-afd6-709c628f5129

📥 Commits

Reviewing files that changed from the base of the PR and between 41df802 and 5d63808.

📒 Files selected for processing (1)
  • sensor/common/sensor/ci_shard_probe_test.go

Comment on lines +6 to +7
func TestCIShardIsolationProbe(t *testing.T) {
t.Fatal("deliberate failure to prove shard isolation")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard the deliberate failure to avoid persistent CI breakage.

Line 7 always fails, and this package is routed into the sensor shard by .github/workflows/unit-tests.yaml (Lines 33-35, 47-49) and Makefile (Lines 539-543). If this isn’t reverted immediately, that shard stays red on every run. Please gate the probe behind an explicit opt-in flag (default skip).

Proposed safe probe pattern
 import "testing"
+import "os"

 // Temporary test to validate CI shard isolation - will be reverted.
 func TestCIShardIsolationProbe(t *testing.T) {
+	if os.Getenv("CI_SHARD_ISOLATION_PROBE") != "1" {
+		t.Skip("set CI_SHARD_ISOLATION_PROBE=1 to run shard isolation probe")
+	}
 	t.Fatal("deliberate failure to prove shard isolation")
 }

As per coding guidelines, "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sensor/common/sensor/ci_shard_probe_test.go` around lines 6 - 7, Update
TestCIShardIsolationProbe to gate the deliberate failure behind an explicit
opt-in (default skip): at the top of the test (function
TestCIShardIsolationProbe) check an environment variable or test flag (for
example os.Getenv("RUN_CI_SHARD_PROBE") == "1") and call t.Skip("skipping CI
shard isolation probe; set RUN_CI_SHARD_PROBE=1 to run") when not enabled; only
call t.Fatal("deliberate failure to prove shard isolation") when the opt-in is
present so the shard does not fail by default.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

CodeRabbit chat interactions are restricted to organization members for this repository. Ask an organization member to interact with CodeRabbit, or set chat.allow_non_org_members: true in your configuration.

Shard isolation validated: sensor shard failed in ~11min while all other
shards (central, pkg, other, extra) passed independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants