Skip to content

ROX-29771: Unflake TestPods and TestContainerInstances#17216

Merged
vikin91 merged 7 commits intomasterfrom
piotr/ROX-29771-flake
Oct 24, 2025
Merged

ROX-29771: Unflake TestPods and TestContainerInstances#17216
vikin91 merged 7 commits intomasterfrom
piotr/ROX-29771-flake

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Oct 9, 2025

Description

I approached this flake and it turned out to be a collection of problems, not a single root cause. Here are the issues that I have found:

  1. Problem 1: Exiting main detection loop in a way that assertion failure stops the test, instead of making another retry.
  2. Problem 2: Docker hub rate limiting: the images used for the TestPods and TestContainerInstances were not taken from quay.io and thus prone to rate limiting.
  3. Problem 3: Coupling of tests/yamls/multi-container-pod.yaml to tests/yamls/roxctl_verification.sh. One cannot edit the one without the risk that the other would fail.
  4. Problem 4: (related to 3) Count-based assertions in tests/yamls/roxctl_verification.sh - we were not checking which policies were violated but just counted the number of violations.
  5. Problem 5: Process detection does not work reliably on busybox. (This was not affecting the state before this PR but I discovered during the work on this PR, so I am persisting it for the future.)
  6. Problem 6: Collector struggles to reliably detect some processes on selected operating systems (e.g., /bin/date on Ubuntu) - see internal discussion: https://redhat-internal.slack.com/archives/CFMQ5C2TT/p1760538397043239 and ticket: https://issues.redhat.com/browse/ROX-31331

⚠️ Problem 2 was extracted into #17327
⚠️ 3 & 4 were solved in: #17294
⚠️ Refactoring from solving Problem 1 was also extracted to a separate PR and applied to other nongroovy tests in: #17374

Problem 1: Custom Event Detection Loop

I am approaching the following flaw in this PR:

Manual Event Detection Loop - Major Issue

// Current problematic code
for {
    events = getEvents(retryT, pod)
    testT.Logf("%d: Events: %+v", loopCount, events)
    if len(events) == 4 {
        break
    }
    loopCount++
    require.LessOrEqual(retryT, loopCount, 20)  // ❌ FAILS IMMEDIATELY
    time.Sleep(4 * time.Second)
}

Problem: require.LessOrEqual(retryT, loopCount, 20) fails immediately when reached, preventing the outer retry from working. The outer retry (testutils.Retry(testT, 3, 5*time.Second, func(retryT testutils.T)) can't help because the test has already terminated.

I changed the for-loop into another testutils.Retry. Note that calling require.Something inside the testutils.Retry is actually safe and does not terminate the test.
Example to verify this:

func TestRetry_StopsWhenNoLongerRequiring(t *testing.T) {
    retryCount := 0
    Retry(t, 10, 0, func(t T) {
        retryCount++
        require.Equal(t, 5, retryCount)  // Uses require inside retry!
    })
    assert.Equal(t, retryCount, 5)  // Proves it retried 5 times
}

Problem 2: Docker Hub Rate Limiting

The test was also failing due to Docker Hub rate limits when pulling nginx and debian images. Switched to pre-fetched quay.io/rhacs-eng images to avoid this issue.
See: #17216 (comment)

Problem 3 & 4: Inelasticity of roxctl_verification.sh

Coupling of tests/yamls/*.yaml files to tests/yamls/roxctl_verification.sh. One cannot edit the one without the risk that the other would fail. A change to tests/yamls/multi-container-pod.yaml to not rely on the docker.io and the latest tag, caused the roxctl_verification.sh to fail. Thus, I refactored the roxctl_verification.sh in a way that there is an expectation for each yaml file and the violated policies are specified by their names.

See also this solution separately under: #17294

Problem 5: Detecting processes on Busybox

I noticed that the date and sleep processes cannot be reliably detected by Collector when using busybox. Based on a conversation with Collector colleagues, busybox may alias the commands and thus make the detection not working correctly as the original process cannot be detected when it is an alias.
I abandoned the idea of running busybox and went back to a debian image (actually Ubuntu as it was available on quay.io/rhacs-eng).

AI Use

I let AI analyze the problem based on the logs from ROX-29771 and propose the retry-loop fix.
I have checked and corrected the generated code manually.

Additionally

I go a bit more into defensive side and validate that the pod-under-test has started and all containers are up.
I also move the collector-related checks to the end, so that we can exit early when NO_COLLECTION is used.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • Only repeated CI runs can give us confidence.
  • I let the test pass on CI 20 times (commit 534d729) and discover the docker.io rate-limit issue (1 failure over 20 runs)
  • I let the test pass on CI X times after that (commit d881e4d)
  • I let the test pass on CI 6 times on the commit (8f7fa00)
  • Last commit passes CI (failures are unrelated - see comments)

@vikin91 vikin91 added the auto-retest PRs with this label will be automatically retested if prow checks fails label Oct 9, 2025
@rhacs-bot
Copy link
Contributor

Images are ready for the commit at 7b2868b.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-1031-g7b2868b83f.

@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.81%. Comparing base (abce4b4) to head (ed04971).
⚠️ Report is 33 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #17216   +/-   ##
=======================================
  Coverage   48.81%   48.81%           
=======================================
  Files        2723     2723           
  Lines      203103   203088   -15     
=======================================
  Hits        99139    99139           
+ Misses      96190    96181    -9     
+ Partials     7774     7768    -6     
Flag Coverage Δ
go-unit-tests 48.81% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 requested review from a team and janisz October 9, 2025 11:16
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@janisz
Copy link
Contributor

janisz commented Oct 9, 2025

Copy link
Contributor

@janisz janisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to rebase and unskip GetPod test

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Oct 9, 2025

Images are ready for the commit at d881e4d.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-1038-gd881e4deb6.

@red-hat-konflux

This comment was marked as off-topic.

@vikin91 vikin91 force-pushed the piotr/ROX-29771-flake branch from 7b2868b to ce3a1ab Compare October 9, 2025 12:07
@vikin91
Copy link
Contributor Author

vikin91 commented Oct 9, 2025

You may need to rebase and unskip GetPod test

I have just rebased. Will run it few times hoping that we see a better behavior. If gets merged and will be failing, we may disable it again (CC @porridge).

@porridge
Copy link
Contributor

porridge commented Oct 9, 2025

Thanks @vikin91 .
When you merge, please remove the CI_Disabled label from the related Jira ticket.

@stackrox stackrox deleted a comment from rhacs-bot Oct 10, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 10, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 10, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 10, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 10, 2025
@vikin91
Copy link
Contributor Author

vikin91 commented Oct 10, 2025

/retest-times 10 gke-nongroovy-e2e-tests

@vikin91
Copy link
Contributor Author

vikin91 commented Oct 13, 2025

The failure from #17216 (comment) is different to this flake.

@vikin91
Copy link
Contributor Author

vikin91 commented Oct 13, 2025

This is still a valid failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/stackrox_stackrox/17216/pull-ci-stackrox-stackrox-master-ocp-4-19-nongroovy-e2e-tests/1977044661092487168
Saturday 6 PM, so the current state still flakes :(

pkg/testutils: 2025/10/11 17:43:12.206920 retry.go:64: Info: Final test attempt
    common.go:372: [DEBUG] GET https://api.rox-ci-92487168.ocp.ci.rox.systems:6443/api/v1/namespaces/default/pods/end-to-end-api-test-pod-multi-container?timeout=10s
    pods_test.go:65: Pod phase: Pending, Reason: , Message: 
    pods_test.go:79: 
        	Error Trace:	/go/src/github.com/stackrox/stackrox/tests/pods_test.go:79
        	            				/go/src/github.com/stackrox/stackrox/pkg/testutils/retry.go:65
        	            				/go/src/github.com/stackrox/stackrox/tests/pods_test.go:59
        	            				/go/src/github.com/stackrox/stackrox/pkg/testutils/retry.go:65
        	            				/go/src/github.com/stackrox/stackrox/tests/pods_test.go:48
        	            				/usr/local/go/src/runtime/asm_amd64.s:1700
        	Error:      	pod not in Running phase
        	Test:       	TestPod
        	Messages:   	Pod end-to-end-api-test-pod-multi-container is in Pending phase (expected Running)
        	            	Containers:
        	            	  - 1st: ready=false, started=false, waiting: ErrImagePull - initializing source docker://nginx@sha256:cc54bf7fa755cebebbe98e11da2ff3626852fc5a9db3397bdbec74339da9ff72: reading manifest sha256:cc54bf7fa755cebebbe98e11da2ff3626852fc5a9db3397bdbec74339da9ff72 in docker.io/library/nginx: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
        	            	  - 2nd: ready=false, started=false, waiting: ErrImagePull - initializing source docker://debian@sha256:1e5f2d70c9441c971607727f56d0776fb9eecf23cd37b595b26db7a974b2301d: reading manifest sha256:1e5f2d70c9441c971607727f56d0776fb9eecf23cd37b595b26db7a974b2301d in docker.io/library/debian: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
        	            	Pod Reason: 
        	            	Pod Message: 
    common.go:372: [DEBUG] DELETE https://api.rox-ci-92487168.ocp.ci.rox.systems:6443/api/v1/namespaces/default/pods/end-to-end-api-test-pod-multi-container?timeout=10s
--- FAIL: TestPod (957.74s)

^^ I think this explains much about that flake - we are hitting the docker.io rate limit.

@stackrox stackrox deleted a comment from rhacs-bot Oct 13, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 13, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 13, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 13, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 13, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 13, 2025
@stackrox stackrox deleted a comment from rhacs-bot Oct 13, 2025
@rhacs-bot
Copy link
Contributor

/test ocp-4-19-nongroovy-e2e-tests

@rhacs-bot
Copy link
Contributor

/retest

@rhacs-bot
Copy link
Contributor

/retest

1 similar comment
@rhacs-bot
Copy link
Contributor

/retest

@vikin91
Copy link
Contributor Author

vikin91 commented Oct 21, 2025

Recent failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/stackrox_stackrox/17216/pull-ci-stackrox-stackrox-master-ocp-4-19-nongroovy-e2e-tests/1980566210068090880

pkg/testutils: 2025/10/21 11:22:47.762182 retry.go:69: Info: Final test attempt
main: 2025/10/21 11:22:47.796382 pods_test.go:245: Info: Get Events: {Pod:{IDStruct:{ID:aeee4183-4af8-5a78-b020-499696b3b961} Name:end-to-end-api-test-pod-multi-container ContainerCount:2 Started:2025-10-21 11:20:10 +0000 UTC Events:[{IDStruct:{ID:fe135d1f-2792-5f9a-942d-cb165f1a8aca} Name:/usr/sbin/nginx Timestamp:2025-10-21 11:20:14.136041913 +0000 UTC} {IDStruct:{ID:fc5fab37-806c-5830-bec3-bd6d4766c110} Name:/usr/sbin/nginx Timestamp:2025-10-21 11:20:14.209134983 +0000 UTC} {IDStruct:{ID:c6a53b7f-0df4-5609-b7fb-72ee3dcba7da} Name:/bin/sh Timestamp:2025-10-21 11:20:14.425185538 +0000 UTC}]}}
    pods_test.go:117: Found 3 events: [{IDStruct:{ID:fe135d1f-2792-5f9a-942d-cb165f1a8aca} Name:/usr/sbin/nginx Timestamp:2025-10-21 11:20:14.136041913 +0000 UTC} {IDStruct:{ID:fc5fab37-806c-5830-bec3-bd6d4766c110} Name:/usr/sbin/nginx Timestamp:2025-10-21 11:20:14.209134983 +0000 UTC} {IDStruct:{ID:c6a53b7f-0df4-5609-b7fb-72ee3dcba7da} Name:/bin/sh Timestamp:2025-10-21 11:20:14.425185538 +0000 UTC}]
    pods_test.go:126: Event names: [/usr/sbin/nginx /usr/sbin/nginx /bin/sh]
    pods_test.go:132: 
        	Error Trace:	/go/src/github.com/stackrox/stackrox/tests/pods_test.go:132
        	            				/go/src/github.com/stackrox/stackrox/pkg/testutils/retry.go:70
        	            				/go/src/github.com/stackrox/stackrox/tests/pods_test.go:115
        	Error:      	[]string{"/usr/sbin/nginx", "/usr/sbin/nginx", "/bin/sh"} does not contain "/bin/sleep"
        	Test:       	TestPod
        	Messages:   	Pod: required processes: [/bin/sh /usr/sbin/nginx /bin/sleep] not found in events: [/usr/sbin/nginx /usr/sbin/nginx /bin/sh]
    common.go:435: [DEBUG] DELETE https://api.rox-ci-68090880.ocp.ci.rox.systems:6443/api/v1/namespaces/default/pods/end-to-end-api-test-pod-multi-container?timeout=30s
--- FAIL: TestPod (158.38s)

I will put it into the https://issues.redhat.com/browse/ROX-31331 bucket.

@vikin91 vikin91 force-pushed the piotr/ROX-29771-flake branch from bad577d to 46a6d68 Compare October 21, 2025 13:16
@vikin91
Copy link
Contributor Author

vikin91 commented Oct 21, 2025

I am close to giving up after seeing this failure:

pkg/testutils: 2025/10/21 15:07:21.929450 retry.go:69: Info: Final test attempt
main: 2025/10/21 15:07:21.967999 pods_test.go:245: Info: Get Events: {Pod:{IDStruct:{ID:642aa96e-23b1-57e2-9387-31d70ea09823} Name:end-to-end-api-test-pod-multi-container ContainerCount:2 Started:2025-10-21 15:04:44 +0000 UTC Events:[]}}
    pods_test.go:117: Found 0 events: []
    pods_test.go:126: Event names: []
    pods_test.go:132: 
        	Error Trace:	/go/src/github.com/stackrox/stackrox/tests/pods_test.go:132
        	            				/go/src/github.com/stackrox/stackrox/pkg/testutils/retry.go:70
        	            				/go/src/github.com/stackrox/stackrox/tests/pods_test.go:115
        	Error:      	[]string{} does not contain "/bin/sh"
        	Test:       	TestPod
        	Messages:   	Pod: required processes: [/bin/sh /usr/sbin/nginx] not found in events: []
    common.go:435: [DEBUG] DELETE https://api.rox-ci-26175232.ocp.ci.rox.systems:6443/api/v1/namespaces/default/pods/end-to-end-api-test-pod-multi-container?timeout=30s}

@rhacs-bot
Copy link
Contributor

/retest

b1d1e5d6fc Use Retry instead of custom for loop
91aff75524 Unskip test
b6b1211fff Wait longer for pod to be ready and add extra info when it fails
930ff8d64d Use pre-fetched images from quay instead of docker.io
7b8c79ff46 Use correct T object
3e78c6b0d6 Improve retry patterns
3316b7f1b5 Add image prefetching to all qa-nongroovy-e2e tests
d85cd2d6f4 Style: Remove blank line at the end of file
9d5afd4304 Add imagePullPolicy to multi-container-pod.yaml
7431c36a0c Pin test image versions in multi-container-pod.yaml
9d156b348c Fix the tests/yamls/roxctl_verification.sh removing the hardcoded things
9d9855d8a3 Change assertion to 'at least' for processes
05bc2479a1 Don't expect to always catch /bin/date as it is short lived
ec41e98510 Add comments explaining details of investigation
87b15cf9c5 Try with ubuntu instead of busybox
58f4d33024 Address review comments
c51fdf5b61 Ensure getPodCountInCentral is called
98e413bd7a Don't expect Collector to detect /bin/date
62fa4e43aa Leave reference to ROX-31331 in comments
@vikin91 vikin91 force-pushed the piotr/ROX-29771-flake branch from 1387ec6 to 2a2434a Compare October 22, 2025 14:15
@vikin91
Copy link
Contributor Author

vikin91 commented Oct 23, 2025

All recent failures are unrelated (last: "CertExpiryTest > Test Scanner cert expiry FAILED").

I am merging this as the original issue seem to be resolved. If ROX-31331 is causing troubles, we may reconsider disabling this test knowing that the source of the problem is not in the flakiness of the test, but unexpected behavior of our product.

@vikin91 vikin91 changed the title ROX-29771: Unflake TestPods ROX-29771: Unflake TestPods and TestContainerInstances Oct 23, 2025
@vikin91
Copy link
Contributor Author

vikin91 commented Oct 23, 2025

@dcaravel I am noticing this branch flaking the TestDelegatedScanning in a way that I don't see elsewhere.

I cannot wrap my head around that, as my changes should not influence the behavior of that test. I saw that test failing in different ways:

  1. Not being able to connect to Central here
  2. Timing out waiting for pod to stop here while the pod starts shortly after this timeout (see ROX-29771: Use prefetched images in non-groovy tests #17354 (comment)). That is different branch but also related to prefetching images.

I don't want to break that test by simply merging that. Do you see any relation mabye?

@vikin91
Copy link
Contributor Author

vikin91 commented Oct 23, 2025

/retest

@openshift-ci
Copy link

openshift-ci bot commented Oct 23, 2025

@vikin91: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/gke-qa-e2e-tests ed04971 link false /test gke-qa-e2e-tests
ci/prow/ocp-4-12-qa-e2e-tests ed04971 link false /test ocp-4-12-qa-e2e-tests
ci/prow/ocp-4-19-qa-e2e-tests ed04971 link false /test ocp-4-19-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dcaravel
Copy link
Contributor

@dcaravel I am noticing this branch flaking the TestDelegatedScanning in a way that I don't see elsewhere.

I cannot wrap my head around that, as my changes should not influence the behavior of that test. I saw that test failing in different ways:

First impression - I'm not seeing how these changes would influence the delegated scanning tests either.

  1. Not being able to connect to Central here

The scan request made it to Central and Sensor just fine, the error grpc: the client connection is closing is happening because in the middle of the Sensor processing the scan request the grpc connection to Scanner was closed (based on error and presence of the Reset scanner client log message just before the error)

The connection was closed because of the Central/Sensor state change:

common/detector: 2025/10/23 09:49:16.451865 detector.go:272: Info: Component runs now in Online mode
common/sensor: 2025/10/23 09:49:16.451908 notifiable.go:26: Info: Component 'Kernel probe server handler' runs now in Online mode
common/sensor: 2025/10/23 09:49:16.451936 notifiable.go:26: Info: Component 'Kernel object cache' runs now in Online mode
common/scannerdefinitions: 2025/10/23 09:49:16.451992 http_handler.go:47: Info: Component 'Scanner definitions handler' runs now in Online mode
common/scannerclient: 2025/10/23 09:49:16.452023 reset_notifiable.go:33: Info: Component runs now in Online mode
...
common/scannerclient: 2025/10/23 09:49:16.452714 reset_notifiable.go:37: Debug: Reset scanner client

When the state changes to online mode the connection is reset.

func (r *resetNotifiable) Notify(e common.SensorComponentEvent) {
log.Info(common.LogSensorComponentEvent(e))
switch e {
case common.SensorComponentEventCentralReachable:
resetGRPCClient()
log.Debug("Reset scanner client")
}
}

The subsequent scan for the same image executed OK (so this is a bit of a race / unlucky timing) - we could do better (assuming separate from this PR) - top of mind:

  • Before resetting the client, determine the previous connection state was and only reset the client as necessary.
    • Similarly, be smarter about the client reset and only reset it if the 'changes' from prior state warrant it (this may be brittle as new "stuff" is added)
  • Delay processing scan requests until Sensor full initialization / sync's completed.
  • In test setup change it to "more aggressively" ensure the Central/Sensor connection is stable before running.
  • etc. ...
  1. Timing out waiting for pod to stop here while the pod starts shortly after this timeout (see ROX-29771: Use prefetched images in non-groovy tests #17354 (comment)). That is different branch but also related to prefetching images.

I don't want to break that test by simply merging that. Do you see any relation mabye?

I'm not seeing an obvious relation here, also not seeing any events, logs, etc. related to why deployment deletion timed out - that seems to be unrelated to the delegated scanning tests - perhaps k8s / environmental.

  • The deployment teardown is not important for the test, it is there to be a good citizen - could change that so the test doesn't wait on the result.
  • Also could change the image to one that 'does less' on startup if that is indeed the cause.

I'm happy to create some followup's to improve this, but agree that these changes do not seem like they should be impacting the test results.

@vikin91
Copy link
Contributor Author

vikin91 commented Oct 24, 2025

Many thanks for your input @dcaravel !

@vikin91
Copy link
Contributor Author

vikin91 commented Oct 24, 2025

All 3 recent failures are for CertExpiryTest: Test Scanner cert expiry

@vikin91 vikin91 merged commit 8c2c913 into master Oct 24, 2025
96 of 99 checks passed
@vikin91 vikin91 deleted the piotr/ROX-29771-flake branch October 24, 2025 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review auto-retest PRs with this label will be automatically retested if prow checks fails

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants