Skip to content

ROX-32848: Wire VM relay ACK flow with rate limiting and UMH#19321

Draft
vikin91 wants to merge 1 commit intopiotr/ROX-32316-umh-node-ackfrom
piotr/ROX-32316-vm-relay-ack-flow
Draft

ROX-32848: Wire VM relay ACK flow with rate limiting and UMH#19321
vikin91 wants to merge 1 commit intopiotr/ROX-32316-umh-node-ackfrom
piotr/ROX-32316-vm-relay-ack-flow

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Mar 6, 2026

Description

Integrates the VM relay with the per-resource UMH (from the parent PR) and replaces the sender's
inline retry loop with a single-attempt send. Retry responsibility now lives in the UMH, which
tracks ACK state per VSOCK ID.

What changed:

  • Relay (compliance/virtualmachines/relay/relay.go): Added per-VSOCK rate limiting (leaky
    bucket via golang.org/x/time/rate), UMH integration (ObserveSending/OnACK), and a metadata
    cache that tracks updatedAt / lastAckedAt per VM for stale-ACK detection. Reports that exceed
    the rate limit are dropped with a metric — the agent will resubmit on its own schedule.
  • Sender (index_report_sender.go): Removed the 10-retry retry.WithRetry loop and
    isRetryableGRPCError helper. The sender now makes a single gRPC call; failures are reported
    back so the UMH can schedule a retry at the appropriate backoff interval. Added per-attempt
    latency and result metrics.
  • Compliance (compliance.go): Added umhVMIndex field d handleVMIndexACK to forward
    ComplianceACK messages for VM_INDEX_REPORT to the relay's UMH. The VM relay startup now reads
    ROX_VM_RELAY_MAX_REPORTS_PER_MINUTE and ROX_VM_RELAY_STALE_ACK_THRESHOLD from env.
  • Metrics (relay/metrics/metrics.go): New counters/histograms for send attempts, rate limiting,
    and ACKs received.
  • Env vars (pkg/env/virtualmachine.go): ROX_VM_RELAY_MAX_REPORTS_PER_MINUTE (default 1.0)
    and ROX_VM_RELAY_STALE_ACK_THRESHOLD (default 4h).

Bug fix during split: The feature branch passed *v1.IndexReport to sender.Send() which
expects *v1.VMReport. Fixed handleIncomingReport to carry the full VMReport through.

Depends on: piotr/ROX-32316-umh-node-ack (UMH per-resource refactor).

AI-assisted: code was extracted and adapted from a larger feature branch by AI,
with a type mismatch bug fix applied during the split. Reviewed and verified by the author.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • Unit tests
  • On a cluster

@vikin91
Copy link
Contributor Author

vikin91 commented Mar 6, 2026

@openshift-ci
Copy link

openshift-ci bot commented Mar 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Mar 6, 2026

Images are ready for the commit at 0ad5d24.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-262-g0ad5d24127.

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 74.64789% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.58%. Comparing base (7faf2a3) to head (0ad5d24).

Files with missing lines Patch % Lines
compliance/virtualmachines/relay/relay.go 80.20% 17 Missing and 2 partials ⚠️
compliance/compliance.go 41.37% 17 Missing ⚠️
Additional details and impacted files
@@                      Coverage Diff                      @@
##           piotr/ROX-32316-umh-node-ack   #19321   +/-   ##
=============================================================
  Coverage                         49.58%   49.58%           
=============================================================
  Files                              2690     2690           
  Lines                            202962   203047   +85     
=============================================================
+ Hits                             100633   100687   +54     
- Misses                            94834    94862   +28     
- Partials                           7495     7498    +3     
Flag Coverage Δ
go-unit-tests 49.58% <74.64%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Integrates the VM relay with the per-resource UMH from the previous commit.
The relay now rate-limits reports per VSOCK ID (leaky bucket), tracks ACK
metadata for stale-ACK detection, and delegates retry responsibility to UMH
instead of retrying inline in the sender. The sender is simplified to a
single-attempt send. Adds handleVMIndexACK in compliance to forward
ComplianceACK messages to the VM relay's UMH.

Also fixes type mismatch in relay where handleIncomingReport passed
*IndexReport to sender.Send() which expects *VMReport.

AI-assisted: code was extracted from the feature branch by AI, with bug
fixes applied during the split. Reviewed and verified by the author.
@vikin91 vikin91 force-pushed the piotr/ROX-32316-vm-relay-ack-flow branch from 63a45ce to 0ad5d24 Compare March 6, 2026 13:29
@vikin91 vikin91 changed the title ROX-32316: Wire VM relay ACK flow with rate limiting and UMH ROX-32848: Wire VM relay ACK flow with rate limiting and UMH Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants