Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions qa-tests-backend/scripts/run-part-1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ config_part_1() {
image_prefetcher_system_await

deploy_stackrox "$ROOT/$DEPLOY_DIR/client_TLS_certs"

if [[ "${ROX_SCANNER_V4:-}" == "true" ]]; then
wait_for_scanner_v4_vuln_load
fi

deploy_optional_e2e_components
setup_workload_identities

Expand Down
54 changes: 54 additions & 0 deletions tests/e2e/lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -1167,6 +1167,60 @@ wait_for_scanner_V4() {
wait_for_ready_deployment "$namespace" "scanner-v4-matcher" "$max_seconds"
}

# wait_for_scanner_v4_vuln_load waits until the Scanner V4 matcher has completed
# its initial vulnerability load, verified via Central's integration health API.
# This is distinct from wait_for_scanner_V4, which only waits for pod readiness
# (i.e. database connectivity). Call this separately in jobs that verify scan
# results, after deploy_stackrox has returned.
wait_for_scanner_v4_vuln_load() {
Copy link
Copy Markdown
Contributor

@jvdm jvdm Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going ove the reasons as to not use readiness for this:

The polling approach was favor because it does not require making changes in CI for each different install type (manifest, helm, operator, etc.)

The change required is an environment variable. That type of customization already exists for different install types, why for this particular case a custom vars in CI is not desired? For example, weight adding the custom var with the bash script that you're proposing.

the cause of timeouts would be 'less obvious' when jobs fail - with polling the failure reason is directly in the build logs (amongst other things).

Pods not getting ready in time for other reasons are still part of the failure path. We would still have to investigate them regardless. Is it the case that timeouts due to readiness tied to vulnerability updates are completely obscure? If that the case, improving the outcome (status code + body message) of the readiness probe would be better compared to re-implementing the same job kubernetes does in the bash script?

Copy link
Copy Markdown
Contributor Author

@dcaravel dcaravel Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the alternative: #19930

Please give that an approve if preferred, I see benefits of both approaches (will admit the bash 'magic' in this PR isn't my favorite)

Some thoughts on this approach:

  • Quicker to debug, especially while iterating on optimizing load times
  • Can see the timings right on the prow landing page. Loading failures are highlighted and it's clear when due to taking too long vs. other causes of pod not being ready (such DB connect failures)
  • This logic doesn't need to change if installers change (albeit trivial)
  • It tests the same mechanism UI/users use to determine if vulns are loaded / up to date.
  • It keeps matcher behavior consistent with production installs. The interactions between Central and Matcher can be tested while in a loading state (if desired).

local max_seconds="${SCANNER_V4_VULN_LOAD_TIMEOUT:-2400}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it 40 minutes wait in CI? Can we add a tag for scanner v4 tests and run everything else instead of waiting?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively load smaller vulnerability list so it will be ready in a second. We have list of images in prefetcher config so let's load only what's really needed.

Copy link
Copy Markdown
Contributor Author

@dcaravel dcaravel Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #19835 - Scanner V4 isn't enabled yet - working on reducing this as much as possible while managing scope.

Can we add a tag for scanner v4 tests and run everything else instead of waiting?

Possibly - many tests (UI, compliance, deployment, policy, etc.) rely on the ability to scan images, so this may add complexity and not buy much - in my testing the overall CI runs were completing in a similar timeframe as today with the other optimizations in review - will continue to find optimizations. FWIW CI today waits for Scanner V2 vulns to load (via pod readiness) so this isn't a 'new' concept.

info "Waiting for Scanner V4 to finish loading vulnerabilities..."

info "--- Available storage classes on this cluster ---"
kubectl describe storageclasses 2>/dev/null || true
info "SCANNER_V4_DB_STORAGE_CLASS=${SCANNER_V4_DB_STORAGE_CLASS:-<unset>}"

require_environment "API_ENDPOINT"
require_environment "ROX_ADMIN_PASSWORD"

local start_time elapsed
start_time="$(date '+%s')"
while true; do
# -w '\n%{http_code}' appends a newline and the HTTP status code after the
# response body, letting us capture both in a single variable.
local response http_code body
response=$(curl -sk \
--config <(curl_cfg user "admin:${ROX_ADMIN_PASSWORD}") \
-w '\n%{http_code}' \
"https://${API_ENDPOINT}/v1/integrationhealth/vulndefinitions?component=SCANNER_V4")
# ##*$'\n' strips up to and including the last newline (leaving just the 3-digit code).
http_code="${response##*$'\n'}"
# %$'\n'* strips the trailing newline+code (leaving just the body).
body="${response%$'\n'*}"

elapsed=$(( $(date '+%s') - start_time ))

if [[ "${http_code}" != "200" ]]; then
# Try to extract the message field from a JSON error body; fall back to
# the raw body if it's not JSON or the field is absent.
local err_detail
err_detail=$(jq -r '.message // empty' <<< "${body}" 2>/dev/null)
info "Scanner V4 vuln load check: HTTP ${http_code} (${elapsed}s/${max_seconds}s): ${err_detail:-${body}}"
elif echo "${body}" | jq -e '.lastUpdatedTimestamp != null' >/dev/null 2>&1; then
info "Scanner V4 vulnerability loading complete (${elapsed}s elapsed)."
return
else
info "Scanner V4 vuln load check: HTTP 200, lastUpdatedTimestamp not set yet (${elapsed}s/${max_seconds}s)"
fi

if (( elapsed > max_seconds )); then
die "wait_for_scanner_v4_vuln_load() timed out after ${max_seconds}s."
fi

sleep 30
done
}

# shellcheck disable=SC2120
wait_for_api() {
local central_namespace=${1:-stackrox}
Expand Down
4 changes: 4 additions & 0 deletions tests/e2e/run-ui-e2e.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ test_ui_e2e() {
deploy_optional_e2e_components
deploy_stackrox

if [[ "${ROX_SCANNER_V4:-}" == "true" ]]; then
wait_for_scanner_v4_vuln_load
fi

run_ui_e2e_tests
}

Expand Down
4 changes: 4 additions & 0 deletions tests/e2e/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ test_e2e() {
deploy_optional_e2e_components
deploy_stackrox

if [[ "${ROX_SCANNER_V4:-}" == "true" ]]; then
wait_for_scanner_v4_vuln_load
fi

rm -f FAIL

prepare_for_endpoints_test
Expand Down
Loading