Skip to content

ROX-30485,ROX-30486,ROX-30470: Use go-retryablehttp in k8s client#16725

Merged
janisz merged 6 commits intomasterfrom
fix/rox-30485-add-retry-logic-to-pod-operations
Sep 11, 2025
Merged

ROX-30485,ROX-30486,ROX-30470: Use go-retryablehttp in k8s client#16725
janisz merged 6 commits intomasterfrom
fix/rox-30485-add-retry-logic-to-pod-operations

Conversation

@janisz
Copy link
Contributor

@janisz janisz commented Sep 9, 2025

This PR replaces pod-level retry logic with HTTP client-level retries using hashicorp/go-retryablehttp to provide better network
resilience for GKE API connectivity issues while working within existing 10-second context timeouts.

Problem

Our CI tests are experiencing frequent network timeout failures, particularly in GKE environments:

  • ROX-30485, ROX-30486, ROX-30470: context deadline exceeded errors on GKE endpoints (34.133.244.129, 34.56.5.71)
  • ROX-29713, ROX-29356: GRPC connection resolution failures and TLS establishment issues
  • General network instability: Transient network issues causing test flakiness

Current approach was to add retry logic at the pod creation/deletion level, but this is less efficient than handling retries at the HTTP
transport layer.

Solution

HTTP Client Retry Implementation

  • go-retryablehttp integration into the Kubernetes client HTTP transport using the existing WrapTransport pattern
  • Fast retry strategy designed to work within existing 10-second context timeouts
  • Smart retry logic that only retries on network errors and server errors (5xx, 429)

Key Configuration

// Individual request timeout: 3 seconds
retryClient.HTTPClient.Timeout = 3 * time.Second

// Retry policy: 3 retries with exponential backoff
retryClient.RetryMax = 3
retryClient.RetryWaitMin = 500 * time.Millisecond
retryClient.RetryWaitMax = 2 * time.Second

Timeline within 10-second context:

  • Attempt 1: 0s → 3s (timeout) → 0.5s delay
  • Attempt 2: 3.5s → 6.5s (timeout) → 1s delay
  • Attempt 3: 7.5s → 10s (context deadline)

This allows up to 3 retry attempts within the existing 10-second context window.

Implementation Details

Custom Retry Policy

  retryClient.CheckRetry = func(ctx context.Context, resp *http.Response, err error) (bool, error) {
      if err != nil {
          // Retry on network errors (timeouts, connection refused, etc.)
          logf(t, "K8s API network error, will retry: %v", err)
          return true, nil
      }

      if resp != nil {
          // Retry on 5xx server errors and 429 (rate limit)
          if resp.StatusCode >= 500 || resp.StatusCode == 429 {
              logf(t, "K8s API server error %d, will retry", resp.StatusCode)
              return true, nil
          }
      }

      return false, nil
  }

Transport Integration

Uses the established WrapTransport pattern found in:

  • pkg/k8sintrospect/collector.go:58-64
  • sensor/kubernetes/upgrade/process.go
  • sensor/upgrader/upgradectx/context.go

Benefits

✅ Network Resilience: Automatic retry for transient network issues
✅ Backward Compatibility: Works with existing 10-second context timeouts
✅ Smart Retry Logic: Only retries appropriate errors, not client mistakes
✅ Comprehensive Logging: Debug visibility into retry attempts
✅ Existing Dependency: Uses hashicorp/go-retryablehttp v0.7.8 already in go.mod
✅ Proven Pattern: Follows established WrapTransport usage in codebase

Testing

  • ✅ Code compiles successfully with go build -tags test_e2e
  • ✅ Uses existing dependency (hashicorp/go-retryablehttp v0.7.8)
  • ✅ Follows established patterns in the codebase
  • 🧪 Ready for CI testing against problematic GKE endpoints

Related Issues

This implementation directly addresses:

  • ROX-30485: GKE API timeout with endpoint 34.133.244.129
  • ROX-30486: GKE API timeout with endpoint 34.56.5.71
  • ROX-30470: Context deadline exceeded in GKE tests
  • ROX-29713: GRPC connection resolution failures
  • ROX-29356: Service discovery and TLS establishment issues

Test Plan

  1. Monitor CI runs for reduced timeout failures in GKE environments
  2. Check retry logs to verify appropriate retry behavior
  3. Verify no regression in test execution times
  4. Confirm fast failure for genuine API errors (4xx responses)

This change adds robust retry logic to createPod and teardownPod
functions to handle transient Kubernetes API failures that were
causing TestPod failures in gke-nongroovy-compatibility-tests.

Changes:
- Replace single-attempt operations with 3-attempt retry logic
- Use existing pkg/retry package for consistent retry behavior
- Add 2-second backoff between retry attempts
- Maintain 10-second timeout per individual attempt
- Add informative logging for retry attempts

The retry logic provides better fault tolerance for:
- Transient network connectivity issues
- Kubernetes API server temporary unavailability
- Resource contention in test environments

Total maximum time per operation: ~36 seconds (3×10s + 2×2s backoff)
vs previous 10-second single attempt, significantly improving
reliability while maintaining reasonable test execution times.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@janisz janisz requested a review from mtodor September 9, 2025 11:28
@rhacs-bot
Copy link
Contributor

rhacs-bot commented Sep 9, 2025

Images are ready for the commit at 12f681f.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-713-g12f681f5ae.

@codecov
Copy link

codecov bot commented Sep 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.80%. Comparing base (d3afc34) to head (12f681f).
⚠️ Report is 46 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #16725      +/-   ##
==========================================
+ Coverage   48.67%   48.80%   +0.13%     
==========================================
  Files        2675     2679       +4     
  Lines      199760   200331     +571     
==========================================
+ Hits        97235    97775     +540     
- Misses      94918    94947      +29     
- Partials     7607     7609       +2     
Flag Coverage Δ
go-unit-tests 48.80% <ø> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@parametalol parametalol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sad to see a need for such a change.

janisz and others added 2 commits September 9, 2025 18:16
…port

Replace pod-level retry logic with HTTP client-level retries using
hashicorp/go-retryablehttp. This provides better network resilience
for GKE API connectivity issues while working within existing 10s
context timeouts.

Key changes:
- Configure retryablehttp with 3 retries, 3s timeout per request
- Custom retry policy for network errors and 5xx/429 responses
- Use WrapTransport pattern to integrate with k8s client-go
- Fast retry timing (0.5s-2s backoff) to fit multiple attempts in 10s

Addresses timeout issues with GKE endpoints:
- 34.133.244.129, 34.56.5.71 (ROX-30485, ROX-30486, ROX-30470)
- GRPC connection resolution failures (ROX-29713, ROX-29356)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@janisz janisz changed the title ROX-30485: Add retry logic to pod creation and deletion operations ROX-30485: Use go-retryablehttp in Kubernetes client Sep 9, 2025
@janisz janisz requested a review from parametalol September 9, 2025 18:07
janisz and others added 2 commits September 9, 2025 20:12
Increase HTTP client timeout from 3s to 8s per request based on test
results showing GKE endpoint 34.86.46.176 needs more time to respond.
This still allows for retry within 10s context window while giving
individual requests sufficient time to complete.

Timeline: 8s attempt + 0.5s delay + 1.5s retry = fits in 10s context

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Increase context timeout in createPod and teardownPod functions from
10s to 30s to allow more retry attempts with the 8s HTTP timeout.
This provides up to 4 retry attempts within the context window:
- Attempt 1: 0s → 8s
- Attempt 2: 8.5s → 16.5s
- Attempt 3: 17.5s → 25.5s
- Attempt 4: 27.5s → 30s

Better accommodates slow GKE endpoints while maintaining reasonable
overall test execution time.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@janisz janisz requested a review from a team September 10, 2025 11:42
@janisz
Copy link
Contributor Author

janisz commented Sep 10, 2025

/retest

@red-hat-konflux
Copy link
Contributor

Caution

There are some errors in your PipelineRun template.

PipelineRun Error
central-db-on-push CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
main-on-push CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
operator-on-push CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
operator-bundle-on-push CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
retag-collector CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
retag-scanner-db-slim CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
retag-scanner-db CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
retag-scanner-slim CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
retag-scanner CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
roxctl-on-push CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
scanner-v4-on-push CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request
scanner-v4-db-on-push CEL expression evaluation error: expression "(\n event == \"push\" && target_branch.matches(\"^(master|release-.*|refs/tags/.*)$\")\n) || (\n event == \"pull_request\" && (\n target_branch.startsWith(\"release-\") ||\n source_branch.matches(\"(konflux|renovate|appstudio|rhtap)\") ||\n (has(body.pull_request.labels) && body.pull_request.labels.exists(l, l.name == \"konflux-build\"))\n ) && body.action != \"ready_for_review\"\n)\n" failed to evaluate: no such key: pull_request

@janisz janisz mentioned this pull request Sep 10, 2025
@janisz janisz changed the title ROX-30485: Use go-retryablehttp in Kubernetes client ROX-30485,ROX-30486,ROX-30470: Use go-retryablehttp in Kubernetes client Sep 11, 2025
@janisz janisz changed the title ROX-30485,ROX-30486,ROX-30470: Use go-retryablehttp in Kubernetes client ROX-30485,ROX-30486,ROX-30470: Use go-retryablehttp in k8s client Sep 11, 2025
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
@janisz
Copy link
Contributor Author

janisz commented Sep 11, 2025

Added a unit test for retry:

=== RUN   TestCreateK8sClientWithConfig_RetriesOnFailure
    common.go:364: [DEBUG] GET https://mock-k8s-api.example.com/version?timeout=8s
    common_test.go:80: Mock transport call #1 to https://mock-k8s-api.example.com/version?timeout=8s
    common.go:364: [ERR] GET https://mock-k8s-api.example.com/version?timeout=8s request failed: Get "https://mock-k8s-api.example.com/version?timeout=8s": network error: connection refused
    common.go:364: [DEBUG] GET https://mock-k8s-api.example.com/version?timeout=8s: retrying in 500ms (3 left)
    common_test.go:80: Mock transport call #2 to https://mock-k8s-api.example.com/version?timeout=8s
    common.go:364: [ERR] GET https://mock-k8s-api.example.com/version?timeout=8s request failed: Get "https://mock-k8s-api.example.com/version?timeout=8s": network error: connection refused
    common.go:364: [DEBUG] GET https://mock-k8s-api.example.com/version?timeout=8s: retrying in 1s (2 left)
    common_test.go:80: Mock transport call #3 to https://mock-k8s-api.example.com/version?timeout=8s
    common_test.go:130: Successfully completed after 3 calls (including retries)
--- PASS: TestCreateK8sClientWithConfig_RetriesOnFailure (1.50s)
PASS
ok  	github.com/stackrox/rox/tests	1.634s

@janisz janisz merged commit fdd6300 into master Sep 11, 2025
90 checks passed
@janisz janisz deleted the fix/rox-30485-add-retry-logic-to-pod-operations branch September 11, 2025 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants