ROX-19064: Scanner V4 CI Wait for Vulns to Load#19836
ROX-19064: Scanner V4 CI Wait for Vulns to Load#19836
Conversation
|
Skipping CI for Draft Pull Request. |
3c47842 to
3d0cf01
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19836 +/- ##
==========================================
- Coverage 49.60% 49.56% -0.04%
==========================================
Files 2763 2764 +1
Lines 208339 208357 +18
==========================================
- Hits 103341 103269 -72
- Misses 97331 97436 +105
+ Partials 7667 7652 -15
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
🚀 Build Images ReadyImages are ready for commit 8beb887. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-562-g8beb887623 |
🚀 Build Images ReadyImages are ready for commit 3c47842. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-561-g3c47842981 |
|
/test all |
| # (i.e. database connectivity). Call this separately in jobs that verify scan | ||
| # results, after deploy_stackrox has returned. | ||
| wait_for_scanner_v4_vuln_load() { | ||
| local max_seconds="${SCANNER_V4_VULN_LOAD_TIMEOUT:-2400}" |
There was a problem hiding this comment.
Is it 40 minutes wait in CI? Can we add a tag for scanner v4 tests and run everything else instead of waiting?
There was a problem hiding this comment.
Alternatively load smaller vulnerability list so it will be ready in a second. We have list of images in prefetcher config so let's load only what's really needed.
There was a problem hiding this comment.
Please see #19835 - Scanner V4 isn't enabled yet - working on reducing this as much as possible while managing scope.
Can we add a tag for scanner v4 tests and run everything else instead of waiting?
Possibly - many tests (UI, compliance, deployment, policy, etc.) rely on the ability to scan images, so this may add complexity and not buy much - in my testing the overall CI runs were completing in a similar timeframe as today with the other optimizations in review - will continue to find optimizations. FWIW CI today waits for Scanner V2 vulns to load (via pod readiness) so this isn't a 'new' concept.
| # This is distinct from wait_for_scanner_V4, which only waits for pod readiness | ||
| # (i.e. database connectivity). Call this separately in jobs that verify scan | ||
| # results, after deploy_stackrox has returned. | ||
| wait_for_scanner_v4_vuln_load() { |
There was a problem hiding this comment.
Going ove the reasons as to not use readiness for this:
The polling approach was favor because it does not require making changes in CI for each different install type (manifest, helm, operator, etc.)
The change required is an environment variable. That type of customization already exists for different install types, why for this particular case a custom vars in CI is not desired? For example, weight adding the custom var with the bash script that you're proposing.
the cause of timeouts would be 'less obvious' when jobs fail - with polling the failure reason is directly in the build logs (amongst other things).
Pods not getting ready in time for other reasons are still part of the failure path. We would still have to investigate them regardless. Is it the case that timeouts due to readiness tied to vulnerability updates are completely obscure? If that the case, improving the outcome (status code + body message) of the readiness probe would be better compared to re-implementing the same job kubernetes does in the bash script?
There was a problem hiding this comment.
Here is the alternative: #19930
Please give that an approve if preferred, I see benefits of both approaches (will admit the bash 'magic' in this PR isn't my favorite)
Some thoughts on this approach:
- Quicker to debug, especially while iterating on optimizing load times
- Can see the timings right on the prow landing page. Loading failures are highlighted and it's clear when due to taking too long vs. other causes of pod not being ready (such DB connect failures)
- This logic doesn't need to change if installers change (albeit trivial)
- It tests the same mechanism UI/users use to determine if vulns are loaded / up to date.
- It keeps matcher behavior consistent with production installs. The interactions between Central and Matcher can be tested while in a loading state (if desired).
|
/test gke-nongroovy-e2e-tests |
Description
Alternative: #19930 (only one is needed)
Adds the rails for CI jobs to wait for vuln loads to finish before starting tests.
This PR polls the Central API to determine if vulns are loaded (same API that is used by System Health).
Another option was considered to use the 'readiness' setting in Scanner V4 matcher so that the pod does not reach a readiness state until vulns are loaded. The polling approach was favor because it does not require making changes in CI for each different install type (manifest, helm, operator, etc.) and the cause of timeouts would be 'less obvious' when jobs fail - with polling the failure reason is directly in the build logs (amongst other things).
Prior to polling the available storage classes are listed for the cluster to assist troubleshooting if loads are slow (to verify if the DB PVC is using an SSD).
User-facing documentation
Testing and quality
Automated testing
The changes themselves are tests
How I validated my change
Against StackRox Scanner these changes will be tested by CI as part of this PR
Against Scanner V4 these changes were validated in #19236 and will be validated again in a future PR when Scanner V4 is officially turned on in CI.