ROX-29771: Unflake TestPods and TestContainerInstances#17216
Conversation
|
Images are ready for the commit at 7b2868b. To use with deploy scripts, first |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17216 +/- ##
=======================================
Coverage 48.81% 48.81%
=======================================
Files 2723 2723
Lines 203103 203088 -15
=======================================
Hits 99139 99139
+ Misses 96190 96181 -9
+ Partials 7774 7768 -6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
janisz
left a comment
There was a problem hiding this comment.
You may need to rebase and unskip GetPod test
|
Images are ready for the commit at d881e4d. To use with deploy scripts, first |
This comment was marked as off-topic.
This comment was marked as off-topic.
7b2868b to
ce3a1ab
Compare
I have just rebased. Will run it few times hoping that we see a better behavior. If gets merged and will be failing, we may disable it again (CC @porridge). |
|
Thanks @vikin91 . |
|
/retest-times 10 gke-nongroovy-e2e-tests |
|
The failure from #17216 (comment) is different to this flake. |
|
This is still a valid failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/stackrox_stackrox/17216/pull-ci-stackrox-stackrox-master-ocp-4-19-nongroovy-e2e-tests/1977044661092487168 ^^ I think this explains much about that flake - we are hitting the docker.io rate limit. |
|
/test ocp-4-19-nongroovy-e2e-tests |
|
/retest |
a815f1b to
847efe3
Compare
|
/retest |
1 similar comment
|
/retest |
I will put it into the https://issues.redhat.com/browse/ROX-31331 bucket. |
bad577d to
46a6d68
Compare
|
I am close to giving up after seeing this failure: |
|
/retest |
b1d1e5d6fc Use Retry instead of custom for loop 91aff75524 Unskip test b6b1211fff Wait longer for pod to be ready and add extra info when it fails 930ff8d64d Use pre-fetched images from quay instead of docker.io 7b8c79ff46 Use correct T object 3e78c6b0d6 Improve retry patterns 3316b7f1b5 Add image prefetching to all qa-nongroovy-e2e tests d85cd2d6f4 Style: Remove blank line at the end of file 9d5afd4304 Add imagePullPolicy to multi-container-pod.yaml 7431c36a0c Pin test image versions in multi-container-pod.yaml 9d156b348c Fix the tests/yamls/roxctl_verification.sh removing the hardcoded things 9d9855d8a3 Change assertion to 'at least' for processes 05bc2479a1 Don't expect to always catch /bin/date as it is short lived ec41e98510 Add comments explaining details of investigation 87b15cf9c5 Try with ubuntu instead of busybox 58f4d33024 Address review comments c51fdf5b61 Ensure getPodCountInCentral is called 98e413bd7a Don't expect Collector to detect /bin/date 62fa4e43aa Leave reference to ROX-31331 in comments
1387ec6 to
2a2434a
Compare
|
All recent failures are unrelated (last: "CertExpiryTest > Test Scanner cert expiry FAILED"). I am merging this as the original issue seem to be resolved. If ROX-31331 is causing troubles, we may reconsider disabling this test knowing that the source of the problem is not in the flakiness of the test, but unexpected behavior of our product. |
|
@dcaravel I am noticing this branch flaking the I cannot wrap my head around that, as my changes should not influence the behavior of that test. I saw that test failing in different ways:
I don't want to break that test by simply merging that. Do you see any relation mabye? |
|
/retest |
|
@vikin91: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
First impression - I'm not seeing how these changes would influence the delegated scanning tests either.
The scan request made it to Central and Sensor just fine, the error The connection was closed because of the Central/Sensor state change: When the state changes to online mode the connection is reset. stackrox/sensor/common/scannerclient/reset_notifiable.go Lines 32 to 39 in 88ee0b0 The subsequent scan for the same image executed OK (so this is a bit of a race / unlucky timing) - we could do better (assuming separate from this PR) - top of mind:
I'm not seeing an obvious relation here, also not seeing any events, logs, etc. related to why deployment deletion timed out - that seems to be unrelated to the delegated scanning tests - perhaps k8s / environmental.
I'm happy to create some followup's to improve this, but agree that these changes do not seem like they should be impacting the test results. |
|
Many thanks for your input @dcaravel ! |
|
All 3 recent failures are for |
Description
I approached this flake and it turned out to be a collection of problems, not a single root cause. Here are the issues that I have found:
TestPodsandTestContainerInstanceswere not taken from quay.io and thus prone to rate limiting.Coupling oftests/yamls/multi-container-pod.yamltotests/yamls/roxctl_verification.sh. One cannot edit the one without the risk that the other would fail.(related to 3) Count-based assertions intests/yamls/roxctl_verification.sh- we were not checking which policies were violated but just counted the number of violations./bin/dateon Ubuntu) - see internal discussion: https://redhat-internal.slack.com/archives/CFMQ5C2TT/p1760538397043239 and ticket: https://issues.redhat.com/browse/ROX-31331Problem 1: Custom Event Detection Loop
I am approaching the following flaw in this PR:
Manual Event Detection Loop - Major Issue
Problem:
require.LessOrEqual(retryT, loopCount, 20)fails immediately when reached, preventing the outer retry from working. The outer retry (testutils.Retry(testT, 3, 5*time.Second, func(retryT testutils.T)) can't help because the test has already terminated.I changed the for-loop into another
testutils.Retry. Note that callingrequire.Somethinginside thetestutils.Retryis actually safe and does not terminate the test.Example to verify this:
Problem 2: Docker Hub Rate Limiting
The test was also failing due to Docker Hub rate limits when pulling
nginxanddebianimages. Switched to pre-fetchedquay.io/rhacs-engimages to avoid this issue.See: #17216 (comment)
Problem 3 & 4: Inelasticity of
roxctl_verification.shCoupling of
tests/yamls/*.yamlfiles totests/yamls/roxctl_verification.sh. One cannot edit the one without the risk that the other would fail. A change totests/yamls/multi-container-pod.yamlto not rely on the docker.io and thelatesttag, caused theroxctl_verification.shto fail. Thus, I refactored theroxctl_verification.shin a way that there is an expectation for each yaml file and the violated policies are specified by their names.See also this solution separately under: #17294
Problem 5: Detecting processes on Busybox
I noticed that the
dateandsleepprocesses cannot be reliably detected by Collector when using busybox. Based on a conversation with Collector colleagues, busybox may alias the commands and thus make the detection not working correctly as the original process cannot be detected when it is an alias.I abandoned the idea of running busybox and went back to a debian image (actually Ubuntu as it was available on quay.io/rhacs-eng).
AI Use
I let AI analyze the problem based on the logs from ROX-29771 and propose the retry-loop fix.
I have checked and corrected the generated code manually.
Additionally
I go a bit more into defensive side and validate that the pod-under-test has started and all containers are up.
I also move the collector-related checks to the end, so that we can exit early when
NO_COLLECTIONis used.User-facing documentation
Testing and quality
Automated testing
How I validated my change
I let the test pass on CI X times after that (commit d881e4d)