ci: split Go unit test job into parallel shards#19702
Conversation
The single `go` job running all ~940 Go unit test packages is split into 4 parallel shards (central, pkg, sensor, other) × 2 gotags variants = 8 parallel jobs. This reduces wall-clock feedback time significantly. Changes: - Makefile: add UNIT_TEST_PKG_INCLUDE/EXCLUDE filter variables (grep patterns) to go-unit-tests and go-postgres-unit-tests targets. Defaults are no-op so local `make test` behavior is unchanged. - Makefile: extract go-log-level-tests into its own target (run only in the pkg shard in CI, still included in `make test`). - Workflow: replace single `go` job with `go-unit-tests` (4-component matrix) and `go-extra-tests` (integration + operator tests). - Workflow: update slack notification needs and mention_author references. Package distribution across shards: central: 469, pkg: 244, sensor+compliance: 90, other: 137 Partially generated by AI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="Makefile" line_range="545-547" />
<code_context>
- # Exercise the logging package for all supported logging levels to make sure that initialization works properly
+
+# Exercise the logging package for all supported logging levels to make sure that initialization works properly.
+.PHONY: go-log-level-tests
+go-log-level-tests:
@echo "Run log tests"
for encoding in console json; do \
</code_context>
<issue_to_address>
**suggestion:** The new `go-log-level-tests` target does not depend on `build-prep`/`test-prep`, which may cause issues when run directly.
When this logic lived under `go-unit-tests`, it inherited `build-prep` and `test-prep`. Now that it’s a separate `go-log-level-tests` target, that prep only runs when invoked via a higher-level target like `make test`. Invoking `make go-log-level-tests` directly could fail due to missing setup (e.g., env, `test-output/`, or other prep side effects). Please add `build-prep test-prep` as prerequisites (or factor out a shared prerequisite target) so it behaves correctly when run standalone.
```suggestion
# Exercise the logging package for all supported logging levels to make sure that initialization works properly.
.PHONY: go-log-level-tests
go-log-level-tests: build-prep test-prep
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # Exercise the logging package for all supported logging levels to make sure that initialization works properly. | ||
| .PHONY: go-log-level-tests | ||
| go-log-level-tests: |
There was a problem hiding this comment.
suggestion: The new go-log-level-tests target does not depend on build-prep/test-prep, which may cause issues when run directly.
When this logic lived under go-unit-tests, it inherited build-prep and test-prep. Now that it’s a separate go-log-level-tests target, that prep only runs when invoked via a higher-level target like make test. Invoking make go-log-level-tests directly could fail due to missing setup (e.g., env, test-output/, or other prep side effects). Please add build-prep test-prep as prerequisites (or factor out a shared prerequisite target) so it behaves correctly when run standalone.
| # Exercise the logging package for all supported logging levels to make sure that initialization works properly. | |
| .PHONY: go-log-level-tests | |
| go-log-level-tests: | |
| # Exercise the logging package for all supported logging levels to make sure that initialization works properly. | |
| .PHONY: go-log-level-tests | |
| go-log-level-tests: build-prep test-prep |
📝 WalkthroughWalkthroughSplit the original single Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/unit-tests.yaml:
- Around line 40-41: The workflow currently exposes outputs like new-jiras from
the junit2jira step but for matrix jobs (jobs named go-unit-tests,
go-extra-tests, go-postgres) GitHub Actions only preserves the last completed
shard’s outputs; change this by having each matrix shard upload its junit2jira
result (e.g., JSON or text) as a per-shard artifact (use unique names including
matrix variables), then add a non-matrix aggregation job that runs after the
matrix (needs depends-on: the matrix job), downloads all shard artifacts,
merges/concatenates them to produce the combined new-jiras value, and sets that
aggregated value as the workflow output (replacing the current direct step
output reference to junit2jira.outputs.new-jiras). Ensure you reference the
junit2jira step name in each shard for artifact creation and the aggregation job
reads those artifacts to compute the final output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 2a338411-0ecf-4981-8ebe-6244426c7d85
📒 Files selected for processing (2)
.github/workflows/unit-tests.yamlMakefile
| outputs: | ||
| new-jiras: ${{ steps.junit2jira.outputs.new-jiras }} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check GitHub Actions documentation or existing patterns in the repo for matrix output handling
rg -n "outputs:" --type yaml -A5 .github/workflows/ | head -60Repository: stackrox/stackrox
Length of output: 3776
🏁 Script executed:
cat -n .github/workflows/unit-tests.yaml | head -150Repository: stackrox/stackrox
Length of output: 5288
🏁 Script executed:
cat -n .github/workflows/unit-tests.yaml | tail -n +150 | head -200Repository: stackrox/stackrox
Length of output: 7078
🏁 Script executed:
cat -n .github/workflows/unit-tests.yaml | tail -n +350Repository: stackrox/stackrox
Length of output: 8503
🏁 Script executed:
# Verify how matrix job outputs work in GitHub Actions
# Search for any documentation or comments about this limitation
rg -i "matrix.*output|last.*completed" .github/workflows/ -A3 -B3Repository: stackrox/stackrox
Length of output: 14667
🌐 Web query:
GitHub Actions matrix job outputs limitation last completed shard only
💡 Result:
Yes, GitHub Actions matrix jobs have a limitation where job-level outputs (defined in jobs.<job_id>.outputs) are overwritten by each matrix shard, and dependent jobs only receive the output from the last completed shard. This is a known behavior since matrix jobs share the same job ID, and outputs are not preserved per shard. Official documentation on job outputs and matrix strategies does not mention per-shard output preservation, confirming the limitation. A 2023 runner PR (#2477) aimed to support matrix context in output keys (e.g., outputs: ${{ matrix.os }}_result: ...), but server-side changes were pending, and no changelog entry confirms full rollout by 2026-03-30. Workarounds include: - Upload per-shard data as artifacts from matrix jobs, download and aggregate in a dependent job. - Use third-party actions like cloudposse/github-action-matrix-outputs-{write,read}. - Predefine fixed output names matching matrix values (not dynamic). Example artifact workaround: jobs: test: strategy: matrix: runner: [DEV, TEST] steps: - run: jq -cn --arg runner ${{ matrix.runner }} --arg outcome success '$ARGS.named' > outcome.json - uses: actions/upload-artifact@v4 with: name: outcome-${{ matrix.runner }} path: outcome.json aggregate: needs: test steps: - uses: actions/download-artifact@v4 - id: collect run: | jq -s . outcomes/*.json > all-outcomes.json echo "outcomes=$(cat all-outcomes.json)" >> $GITHUB_OUTPUT
Citations:
- 1: Outputs from matrix returns only the last value actions/runner#1835
- 2: https://stackoverflow.com/questions/78915275/how-can-i-capture-the-results-of-matrix-jobs-in-a-github-actions-workflow
- 3: https://stackoverflow.com/questions/70287603/dynamic-outputs-for-job-with-strategy-matrix
- 4: https://stackoverflow.com/questions/77879832/retrieve-outputs-from-a-matrix-job
- 5: Preparation work for Supporting matrix context in output keys actions/runner#2477
- 6: Matrix jobs dynamic output actions/runner#2499
- 7: https://github.com/marketplace/actions/matrix-outputs-read
- 8: https://docs.github.com/en/actions/how-tos/writing-workflows/choosing-what-your-workflow-does/passing-information-between-jobs
- 9: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs
Matrix job outputs capture only the last completed shard, not all shards.
For matrix jobs, GitHub Actions only captures the output from the last completed matrix job, not an aggregation of all. Since go-unit-tests runs 8 parallel jobs (4 components × 2 gotags), the new-jiras output will only reflect one shard's result, potentially missing Jira tickets created by other shards. The same applies to go-extra-tests and go-postgres.
Consider implementing an aggregation job that collects outputs from all matrix shards using artifact uploads and aggregation, rather than relying on the single final shard output.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/unit-tests.yaml around lines 40 - 41, The workflow
currently exposes outputs like new-jiras from the junit2jira step but for matrix
jobs (jobs named go-unit-tests, go-extra-tests, go-postgres) GitHub Actions only
preserves the last completed shard’s outputs; change this by having each matrix
shard upload its junit2jira result (e.g., JSON or text) as a per-shard artifact
(use unique names including matrix variables), then add a non-matrix aggregation
job that runs after the matrix (needs depends-on: the matrix job), downloads all
shard artifacts, merges/concatenates them to produce the combined new-jiras
value, and sets that aggregated value as the workflow output (replacing the
current direct step output reference to junit2jira.outputs.new-jiras). Ensure
you reference the junit2jira step name in each shard for artifact creation and
the aggregation job reads those artifacts to compute the final output.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19702 +/- ##
===========================================
+ Coverage 49.02% 49.65% +0.63%
===========================================
Files 768 2747 +1979
Lines 74760 207261 +132501
===========================================
+ Hits 36648 102909 +66261
- Misses 34756 96697 +61941
- Partials 3356 7655 +4299
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Images are ready for the commit at 38dbc41. To use with deploy scripts, first |
Temporary commit to prove that the split CI jobs correctly isolate failures to the failing component's shard. Will be reverted after data collection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sensor/common/sensor/ci_shard_probe_test.go`:
- Around line 6-7: Update TestCIShardIsolationProbe to gate the deliberate
failure behind an explicit opt-in (default skip): at the top of the test
(function TestCIShardIsolationProbe) check an environment variable or test flag
(for example os.Getenv("RUN_CI_SHARD_PROBE") == "1") and call t.Skip("skipping
CI shard isolation probe; set RUN_CI_SHARD_PROBE=1 to run") when not enabled;
only call t.Fatal("deliberate failure to prove shard isolation") when the opt-in
is present so the shard does not fail by default.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: b72d427e-891d-4313-afd6-709c628f5129
📒 Files selected for processing (1)
sensor/common/sensor/ci_shard_probe_test.go
| func TestCIShardIsolationProbe(t *testing.T) { | ||
| t.Fatal("deliberate failure to prove shard isolation") |
There was a problem hiding this comment.
Guard the deliberate failure to avoid persistent CI breakage.
Line 7 always fails, and this package is routed into the sensor shard by .github/workflows/unit-tests.yaml (Lines 33-35, 47-49) and Makefile (Lines 539-543). If this isn’t reverted immediately, that shard stays red on every run. Please gate the probe behind an explicit opt-in flag (default skip).
Proposed safe probe pattern
import "testing"
+import "os"
// Temporary test to validate CI shard isolation - will be reverted.
func TestCIShardIsolationProbe(t *testing.T) {
+ if os.Getenv("CI_SHARD_ISOLATION_PROBE") != "1" {
+ t.Skip("set CI_SHARD_ISOLATION_PROBE=1 to run shard isolation probe")
+ }
t.Fatal("deliberate failure to prove shard isolation")
}As per coding guidelines, "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sensor/common/sensor/ci_shard_probe_test.go` around lines 6 - 7, Update
TestCIShardIsolationProbe to gate the deliberate failure behind an explicit
opt-in (default skip): at the top of the test (function
TestCIShardIsolationProbe) check an environment variable or test flag (for
example os.Getenv("RUN_CI_SHARD_PROBE") == "1") and call t.Skip("skipping CI
shard isolation probe; set RUN_CI_SHARD_PROBE=1 to run") when not enabled; only
call t.Fatal("deliberate failure to prove shard isolation") when the opt-in is
present so the shard does not fail by default.
|
CodeRabbit chat interactions are restricted to organization members for this repository. Ask an organization member to interact with CodeRabbit, or set |
Shard isolation validated: sensor shard failed in ~11min while all other shards (central, pkg, other, extra) passed independently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Description
Split the single
gounit test CI job into 4 parallel component shards toreduce wall-clock feedback time. Each shard runs independently.
Measured impact
Wall-clock time (first start to last finish):
Baseline: 10 measurements from 5 recent master runs (2 gotags x 5 runs).
Split: first clean run on this PR.
Per-shard timings:
Baseline monolithic job step breakdown (one representative run):
Failure isolation (validated with deliberate test failure):
sensor/common/sensor/Go Unit Tests (sensor, ...))Trade-offs
Reasons for this change (validated)
Reasons I initially claimed but are not clearly supported
What changed
UNIT_TEST_PKG_INCLUDE/UNIT_TEST_PKG_EXCLUDEgrep filter variables with no-op defaults. Localmake testbehavior is unchanged. Extractedgo-log-level-testsinto its own target (still included inmake test).gojob withgo-unit-tests(4-component matrix x 2 gotags = 8 jobs) andgo-extra-tests(integration + operator tests, 2 jobs). Updated slack notification references.User-facing documentation
Testing and quality
Automated testing
How I validated my change
sensor/common/sensor/, confirmed only the sensor shard failed while all other shards passed.