fix(ci): emailsender test (kind bug)#19050
Conversation
|
Images are ready for the commit at d42a41d. To use with deploy scripts, first |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19050 +/- ##
==========================================
- Coverage 49.51% 49.51% -0.01%
==========================================
Files 2670 2670
Lines 201549 201549
==========================================
- Hits 99803 99802 -1
Misses 94293 94293
- Partials 7453 7454 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Instead of disabling the Go modules cache entirely, consider using
cache-dependency-path: stackrox/go.sumwithactions/setup-goso you still benefit from caching while accommodating thestackrox/subdirectory layout.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Instead of disabling the Go modules cache entirely, consider using `cache-dependency-path: stackrox/go.sum` with `actions/setup-go` so you still benefit from caching while accommodating the `stackrox/` subdirectory layout.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
4c669b6 to
4d85bc0
Compare
8a0de8f to
bdb1934
Compare
bdb1934 to
708a174
Compare
kind load docker-image fails on multi-arch OCI images because docker save produces an incomplete archive (index references both platforms but only one platform's blobs are present), and kind's ctr import --all-platforms fails on the missing blobs. Pre-load main and central-db directly into the Kind node via ctr pull --platform linux/amd64, so kind load finds them already present and skips the broken docker save path. Also replace broken n1hility/cancel-previous-runs with native concurrency group, and disable setup-go cache (repo in subdir). See kubernetes-sigs/kind#3795 Partially generated by AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
708a174 to
950fd9a
Compare
Avoid intermediate QUAY_CREDS variable; pass secrets directly to --user like sensor integration tests do in #19016. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
until docker exec ...loop in thePre-load multi-arch images into Kindstep can run indefinitely on persistent failures; consider adding a maximum number of retries or an overall timeout to avoid hanging the workflow. - The pre-load step hardcodes
--platform linux/amd64; if this workflow might ever run on non-amd64 runners, consider deriving the platform from the runner or a matrix variable to keep it portable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `until docker exec ...` loop in the `Pre-load multi-arch images into Kind` step can run indefinitely on persistent failures; consider adding a maximum number of retries or an overall timeout to avoid hanging the workflow.
- The pre-load step hardcodes `--platform linux/amd64`; if this workflow might ever run on non-amd64 runners, consider deriving the platform from the runner or a matrix variable to keep it portable.
## Individual Comments
### Comment 1
<location> `.github/workflows/emailsender-central-compatibility.yaml:22-24` </location>
<code_context>
- 'central/auth/m2m/**'
- '.github/workflows/emailsender-central-compatibility.yaml'
+concurrency:
+ group: ${{ github.workflow }}
+ cancel-in-progress: true
+
jobs:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Concurrency group should likely include the ref to avoid cross-branch cancellations
`group: ${{ github.workflow }}` will cancel in-progress runs for this workflow across all branches. If you only want to cancel previous runs for the same branch/PR (like the old `cancel-previous-runs` action), scope the group with the ref, e.g. `group: ${{ github.workflow }}-${{ github.ref }}` or `group: ${{ github.workflow }}-${{ github.head_ref || github.ref }}` so cancellations stay per-branch instead of global.
```suggestion
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref }}
cancel-in-progress: true
```
</issue_to_address>
### Comment 2
<location> `.github/workflows/emailsender-central-compatibility.yaml:77-79` </location>
<code_context>
+ # See https://github.com/kubernetes-sigs/kind/issues/3795
+ # Same approach as #19016 for sensor integration tests.
+ TAG=$(make --no-print-directory -C stackrox tag)
+ for img in quay.io/rhacs-eng/main:$TAG quay.io/rhacs-eng/central-db:$TAG; do
+ echo "Waiting for $img ..."
+ until docker exec kind-control-plane ctr -n k8s.io images pull \
+ --user "${{ secrets.QUAY_RHACS_ENG_RO_USERNAME }}:${{ secrets.QUAY_RHACS_ENG_RO_PASSWORD }}" \
+ --platform linux/amd64 \
</code_context>
<issue_to_address>
**issue (bug_risk):** Add a maximum retry or timeout around the image pull loop to avoid hanging the job indefinitely
This `until ...; do sleep 30; done` construct will loop indefinitely if the image never becomes available (e.g., bad tag or credentials), effectively hanging the workflow until the global job timeout. Please bound this with either a max retry counter or a time-based cutoff (for example, tracking a `retries` variable and exiting with non-zero once it exceeds a threshold) so failures surface earlier and more predictably.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
The pre-load via ctr pull works but kind load docker-image still runs (called by acs-fleet-manager entrypoint) and fails due to digest mismatch between host Docker and node containerd. Wrap kind binary so load failures are non-fatal warnings. Locally-built single-arch images (emailsender) still load normally. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bound the until loop to 60 retries (30 minutes) so failures surface earlier than the global job timeout. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@porridge I tested/researched more, and it looks like the bug is really in kind itself. For the emailsender test, I threw together this workaround but I would be okay with alternative fixes for it. |
porridge
left a comment
There was a problem hiding this comment.
This is a rather elaborate workaround, too "magic" for my taste, and I think the fix should go into the other repo instead.
It seems something like
would be a lot simpler...
|
stackrox/acs-fleet-manager#2577 I tried kubernetes-sigs/kind#3795 (comment) and it appears to work for us in the acs-fleet-manager repo. Once that is merged, the test in this repo should automatickly pick it up. We could still merge the other changes:
|
|
@johannes94 that's great! Thank you. Here is the cleanup/update only. |
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `.github/workflows/emailsender-central-compatibility.yaml:22-24` </location>
<code_context>
- 'central/auth/m2m/**'
- '.github/workflows/emailsender-central-compatibility.yaml'
+concurrency:
+ group: ${{ github.workflow }}
+ cancel-in-progress: true
+
jobs:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Concurrency group keyed only by workflow name may cause runs from different branches/PRs to cancel each other.
Using only `${{ github.workflow }}` here means all runs of this workflow across all branches/PRs share one concurrency group. With `cancel-in-progress: true`, a run on one branch/PR can cancel a run on another, which can disrupt parallel PR validation. Consider including branch/ref in the group, e.g. `${{ github.workflow }}-${{ github.ref }}` or `${{ github.workflow }}-${{ github.head_ref || github.ref }}` so only runs for the same branch/PR cancel each other.
```suggestion
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref }}
cancel-in-progress: true
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@davdhacs: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
johannes94
left a comment
There was a problem hiding this comment.
Looks like the merge of stackrox/acs-fleet-manager#2577 fixed the test.
The remaining changes LGTM.
Description
Update to emailsender workflow (cleanup, fix/workaround for kind bug moved to acs-fleet-manager).
Fixed in stackrox/acs-fleet-manager#2577
kind load docker-imagefails when importing multi-arch images because it usesctr import --all-platformsbut Docker only pulls content for the local platform. This started failing when the rox-ci-image image docker client was upgraded.See kubernetes-sigs/kind#3795Changes:
n1hility/cancel-previous-runs@v3(broken credentials) with nativeconcurrencygroupcache: falsetosetup-go(repo is checked out tostackrox/subdir,setup-gocan't findgo.sumat workspace root)Testing and quality
How I validated my change
DOCKER_DEFAULT_PLATFORM=linux/amd64alone fails: https://github.com/stackrox/stackrox/actions/runs/22102838003kind load --all-platformserror: https://github.com/stackrox/stackrox/actions/runs/22087026367, https://github.com/stackrox/stackrox/actions/runs/22086333792