Skip to content

ROX-29382: Improve ReconciliationTest timing robustness#16707

Merged
janisz merged 1 commit intomasterfrom
fix/ROX-29382-reconciliation-test-improvements
Sep 8, 2025
Merged

ROX-29382: Improve ReconciliationTest timing robustness#16707
janisz merged 1 commit intomasterfrom
fix/ROX-29382-reconciliation-test-improvements

Conversation

@janisz
Copy link
Contributor

@janisz janisz commented Sep 8, 2025

This change addresses the flaky ReconciliationTest failure where the sensor fails to detect resource deletions that occurred during sensor downtime.

Problem Description

ROX-29382 represents a recurring flaky test failure in ReconciliationTest where the sensor fails to detect resource deletions that occurred during sensor downtime. The test intermittently reports 0 deletions when ≥1 deletions are expected, specifically affecting *central.SensorEvent_Secret and *central.SensorEvent_NetworkPolicy resources.

Root Cause: Timing/race condition in sensor reconciliation logic where the test checks reconciliation statistics before the sensor completes full resource state comparison and deletion detection.

Solution Implemented

  • Extract reconciliation waiting logic into dedicated @Retry annotated function
  • Wait for actual deletion detection, not just reconciliation completion status
  • Add enhanced logging for debugging timing issues
  • Use StackRox @Retry annotation with 30 attempts and 5-second delays
  • Improve error messages with expected vs actual values

The race condition occurred because the test checked reconciliationDone=true but didn't ensure all deletions had been detected yet. This implementation waits for both reconciliation completion AND non-empty deletion counts.

Technical Analysis

Test Workflow

def "Verify the Sensor reconciles after being restarted"() {
    1. Create test resources (Secret, NetworkPolicy, Deployment, Pod)
    2. Scale sensor deployment to 0 (sensor goes offline)
    3. Delete test resources while sensor is down
    4. Scale sensor deployment to 1 (sensor restarts)
    5. Wait for reconciliation completion
    6. Verify reconciliation stats show ≥1 deletions detected
}

Expected vs Actual Results

// From ROX-29382 failure
EXPECTED_MIN_DELETIONS_BY_KEY = [
    "*central.SensorEvent_Secret": 1,           // Expected ≥1, Got 0
    "*central.SensorEvent_NetworkPolicy": 1,    // Expected ≥1, Got 0
    "*central.SensorEvent_Pod": 1,              // Expected ≥1, Got 0
    "*central.SensorEvent_Deployment": 1,       // Expected ≥1, Got 0
    "*central.SensorEvent_Namespace": 1,        // Expected ≥1, Got 0
]

Root Cause Analysis

The reconciliation process happens in this order:

  1. Reconciliation starts: Sensor sends deduper sync complete event
  2. Pipeline.Reconcile() is called - processes all resource differences
  3. reconciliationMap.Close() is called - populates deletion counts
  4. reconciliationDone becomes true

The race condition occurs due to asynchronous nature of connection handling and potential delays between when reconciliationMap.Close() is called and when the API endpoint reflects the changes.

Historical Context

Pattern Evolution

  1. 2022-2023: "Overly aggressive reconciliation" - sensor detected MORE deletions than expected
  2. 2025: "Under-detection reconciliation" - sensor detected FEWER deletions than expected

This pattern shift suggests the reconciliation mechanism has changed behavior over time.

Environment Specifics

  • Only fails on GKE compatibility tests
  • Skipped on OpenShift (@IgnoreIf({ Env.mustGetOrchestratorType() == OrchestratorTypes.OPENSHIFT }))
  • Environment-specific timing differences

🤖 Generated with Claude Code

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Sep 8, 2025

Images are ready for the commit at 3564066.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-694-g3564066a0d.

@codecov
Copy link

codecov bot commented Sep 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.63%. Comparing base (017d800) to head (3564066).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #16707   +/-   ##
=======================================
  Coverage   48.62%   48.63%           
=======================================
  Files        2664     2664           
  Lines      199343   199336    -7     
=======================================
+ Hits        96928    96939   +11     
+ Misses      94819    94805   -14     
+ Partials     7596     7592    -4     
Flag Coverage Δ
go-unit-tests 48.63% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

This change addresses the flaky ReconciliationTest failure where the sensor
fails to detect resource deletions that occurred during sensor downtime.

ROX-29382 represents a recurring flaky test failure in `ReconciliationTest` where
the sensor fails to detect resource deletions that occurred during sensor downtime.
The test intermittently reports 0 deletions when ≥1 deletions are expected,
specifically affecting `*central.SensorEvent_Secret` and `*central.SensorEvent_NetworkPolicy` resources.

**Root Cause**: Timing/race condition in sensor reconciliation logic where the test
checks reconciliation statistics before the sensor completes full resource state
comparison and deletion detection.

- Extract reconciliation waiting logic into dedicated @Retry annotated function
- Wait for actual deletion detection, not just reconciliation completion status
- Add enhanced logging for debugging timing issues
- Use StackRox @Retry annotation with 30 attempts and 5-second delays
- Improve error messages with expected vs actual values

The race condition occurred because the test checked `reconciliationDone=true`
but didn't ensure all deletions had been detected yet. This implementation
waits for both reconciliation completion AND non-empty deletion counts.

<details>
<summary>Technical Analysis</summary>

```groovy
def "Verify the Sensor reconciles after being restarted"() {
    1. Create test resources (Secret, NetworkPolicy, Deployment, Pod)
    2. Scale sensor deployment to 0 (sensor goes offline)
    3. Delete test resources while sensor is down
    4. Scale sensor deployment to 1 (sensor restarts)
    5. Wait for reconciliation completion
    6. Verify reconciliation stats show ≥1 deletions detected
}
```

```groovy
// From ROX-29382 failure
EXPECTED_MIN_DELETIONS_BY_KEY = [
    "*central.SensorEvent_Secret": 1,           // Expected ≥1, Got 0
    "*central.SensorEvent_NetworkPolicy": 1,    // Expected ≥1, Got 0
    "*central.SensorEvent_Pod": 1,              // Expected ≥1, Got 0
    "*central.SensorEvent_Deployment": 1,       // Expected ≥1, Got 0
    "*central.SensorEvent_Namespace": 1,        // Expected ≥1, Got 0
]
```

The reconciliation process happens in this order:
1. **Reconciliation starts**: Sensor sends deduper sync complete event
2. **Pipeline.Reconcile()** is called - processes all resource differences
3. **reconciliationMap.Close()** is called - populates deletion counts
4. **reconciliationDone becomes true**

The race condition occurs due to asynchronous nature of connection handling
and potential delays between when reconciliationMap.Close() is called and
when the API endpoint reflects the changes.

</details>

<details>
<summary>Historical Context</summary>

1. **2022-2023**: "Overly aggressive reconciliation" - sensor detected MORE deletions than expected
2. **2025**: "Under-detection reconciliation" - sensor detected FEWER deletions than expected

This pattern shift suggests the reconciliation mechanism has changed behavior over time.

- Only fails on GKE compatibility tests
- Skipped on OpenShift (`@IgnoreIf({ Env.mustGetOrchestratorType() == OrchestratorTypes.OPENSHIFT })`)
- Environment-specific timing differences

</details>

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
@janisz janisz force-pushed the fix/ROX-29382-reconciliation-test-improvements branch from 8a2bd3a to 3564066 Compare September 8, 2025 09:55
@janisz janisz changed the title Fix ROX-29382: Improve ReconciliationTest timing robustness ROX-29382: Improve ReconciliationTest timing robustness Sep 8, 2025
@janisz janisz requested review from a team and mtodor September 8, 2025 11:58
@janisz janisz mentioned this pull request Sep 8, 2025
@janisz janisz merged commit ddf371e into master Sep 8, 2025
89 of 90 checks passed
@janisz janisz deleted the fix/ROX-29382-reconciliation-test-improvements branch September 8, 2025 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants