Skip to content

ROX-32420: Add relay vsock retry when KubeVirt is unavailable#19391

Draft
vikin91 wants to merge 1 commit intomasterfrom
piotr/ROX-32420-vsock-retry
Draft

ROX-32420: Add relay vsock retry when KubeVirt is unavailable#19391
vikin91 wants to merge 1 commit intomasterfrom
piotr/ROX-32420-vsock-retry

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Mar 12, 2026

Description

This PR removes a timing dependency in VM relay startup when StackRox starts before KubeVirt with vsock support is installed.

Before:

  • VM relay attempted a one-time vsock bind during startup.
  • If bind failed (vsock unsupported/not ready), relay startup aborted permanently for the pod lifetime.

After:

  • VM relay stream creation retries with exponential backoff until success or context cancellation.
  • Once KubeVirt/vsock becomes available, relay starts without requiring a compliance pod restart.
  • Added focused tests for retry success, immediate success, and cancellation behavior.

Code changes:

  • Add createVMRelayStreamWithRetry(...) in compliance and use it from VM relay startup path.
  • Add stream.NewWithListener(...) test helper constructor for deterministic stream tests without real vsock.
  • Add comments clarifying one-shot bind behavior and retry ownership.

AI-assisted contribution:

  • AI generated the retry helper, related test scaffolding, and documentation comments.
  • User provided problem framing, reviewed test flakiness risk, and corrected the cancellation test assertion strategy (moved from wall-clock threshold to deterministic retry-attempt assertion).

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • Added unit tests
  • On a cluster where I deployed ACS before enabling VSOCK

When StackRox starts before KubeVirt, the VM relay's vsock bind fails once
and never retries, leaving the relay disabled for the pod lifetime. Add
exponential backoff retry around stream creation so the relay recovers
when vsock becomes available after KubeVirt is installed.

- Extract createVMRelayStreamWithRetry() with context-aware retry loop
- Add stream.NewWithListener() for tests
- Add unit tests for retry success, cancellation, and immediate success
- Document bind-failure flow in compliance, stream, and vsock packages

AI-generated: retry logic, tests, NewWithListener, and documentation.
User-reviewed: design and implementatio.
@vikin91
Copy link
Contributor Author

vikin91 commented Mar 12, 2026

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link

openshift-ci bot commented Mar 12, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

Images are ready for the commit at e56dae0.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-310-ge56dae03ff.

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 45.45455% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.60%. Comparing base (52688db) to head (e56dae0).

Files with missing lines Patch % Lines
compliance/compliance.go 0.00% 15 Missing ⚠️
...machines/relay/stream/vsock_index_report_stream.go 0.00% 8 Missing ⚠️
compliance/vm_relay_retry.go 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19391      +/-   ##
==========================================
- Coverage   49.68%   49.60%   -0.09%     
==========================================
  Files        2700     2702       +2     
  Lines      203278   203655     +377     
==========================================
+ Hits       100999   101013      +14     
- Misses      94753    95116     +363     
  Partials     7526     7526              
Flag Coverage Δ
go-unit-tests 49.60% <45.45%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants