Skip to content

ROX-29771: Set minimal process filtering for E2E tests#19890

Closed
janisz wants to merge 1 commit intomasterfrom
ROX-29771/fix-testpod-process-filtering
Closed

ROX-29771: Set minimal process filtering for E2E tests#19890
janisz wants to merge 1 commit intomasterfrom
ROX-29771/fix-testpod-process-filtering

Conversation

@janisz
Copy link
Copy Markdown
Contributor

@janisz janisz commented Apr 8, 2026

Description

The TestPod E2E test was failing because the default process filter (introduced in ROX-32679) is too aggressive for short-lived processes. The test expects to observe /bin/date and /bin/sleep processes that execute for milliseconds to 1 second in a loop, but the default filter treats these as "spam" and drops them after initial detections.

Root cause:

  • Default filter: MaxExactPathMatches=5, FanOutLevels=[8,6,4,2]
  • Test processes: /bin/date (runs for ms), /bin/sleep 1 (runs 1s)
  • Filter sees repeated short-lived processes as container spamming

Solution:
Set ROX_PROCESS_FILTER_MODE=minimal for all E2E tests to ensure comprehensive process event capture for validation. This gives tests:

  • MaxExactPathMatches: 100 (vs default: 5)
  • FanOutLevels: [20,15,10,5] (vs default: [8,6,4,2])
  • MaxProcessPaths: 20000 (vs default: 5000)

This allows E2E tests to validate process detection as customers would experience it with minimal filtering enabled.

Related issues: ROX-31331, ROX-32679

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

CI

Details # TestPod Investigation - Build 2040047831439380480

Executive Summary

The TestPod E2E test failed because process filtering introduced in March 2026 is too aggressive and drops short-lived processes (/bin/date, /bin/sleep) that the test expects to observe. This is a regression introduced by commit df1f1e84aa (ROX-32679) after the test was previously "unflaked" in October 2025.

Timeline of TestPod History

Phase 1: Early Flakiness (Pre-Oct 2025)

  • Test was historically flaky due to unreliable process event detection
  • Multiple skip/unskip cycles: cab7af93bb, c3a9e96f1c added skips

Phase 2: Major Unflake Effort (Oct 2025)

  • Oct 24, 2025 - 8c2c913c9f - ROX-29771: Unflake TestPods (major refactor)

    • Removed testT.Skip()
    • Added setupMultiContainerPodTest() helper for robust setup
    • Added retry logic (30 attempts over 150 seconds)
    • Changed from exact match to "at least" semantics for process events
    • Initially only required /usr/sbin/nginx (commented out /bin/sh, /bin/date, /bin/sleep)
  • Oct 28, 2025 - 39ca19f7da - ROX-29771: Ensure Sensor is up

    • Added waitForSensorHealthy() check before test
    • Prevents pipeline delays from previous tests
  • Oct 30, 2025 - 76e375a63a - ROX-31331: Require Collector reporting all processes

    • Made test STRICTER: removed workaround, now requires ALL 4 processes:
      requiredProcesses := []string{"/usr/sbin/nginx", "/bin/sh", "/bin/date", "/bin/sleep"}
    • This assumed Collector would reliably detect short-lived processes

Phase 3: Flakiness Returns (Feb 2026)

  • Feb 10, 2026 - 5515da9706 - Re-disabled test as flaky
    • Added testT.Skip("Flaky: https://issues.redhat.com/browse/ROX-29771")
    • Test was failing again despite October's unflake efforts

Phase 4: Breaking Change (Mar 2026)

  • Mar 5, 2026 - df1f1e84aa - ROX-32679: Preset process filter levels
    • Introduced ROX_PROCESS_FILTER_MODE with three presets:
      • aggressive: MaxExactPathMatches=1, FanOutLevels=[]
      • default: MaxExactPathMatches=5, FanOutLevels=[8,6,4,2]
      • minimal: MaxExactPathMatches=100, FanOutLevels=[20,15,10,5]
    • DEFAULT MODE is too aggressive for short-lived processes
    • Designed to prevent "container spamming" by limiting repeated process executions
    • This is the likely culprit for the current failures

Phase 5: Current State (Apr 2026)

  • Apr 3, 2026 - Build b61d01eaddcb8e79eb758847bfaeec191a37efef tested
    • Test was ENABLED (skip removed at some point between Feb 10 and Apr 3)
    • Test FAILED with missing processes: /bin/date and /bin/sleep
    • Only saw: /usr/sbin/nginx, /bin/sh, /usr/sbin/nginx (duplicate)

Root Cause Analysis

Primary Issue: Process Filter Too Aggressive

The default process filter configuration introduced in ROX-32679:

  • MaxExactPathMatches: 5 - limits exact path matches
  • FanOutLevels: [8,6,4,2] - limits fan-out tree depth
  • Designed to prevent spam from containers that spawn many short-lived processes

Why it breaks the test:

  1. /bin/date executes for milliseconds in a loop
  2. /bin/sleep 1 executes for 1 second repeatedly
  3. Filter sees these as "spam" and drops them after initial detections
  4. Test expects to see these processes but they're filtered out

Secondary Issue: Event Collection Delays

The investigation showed a ~100-second delay before ANY events appeared in Central:

  • Pod created: 13:50:33 UTC
  • First events: ~13:52:12 UTC (~1min 39sec delay)
  • Test timeout: 13:53:03 UTC (exhausted all retries)

This suggests pipeline issues beyond just filtering.

Test Expectations vs Reality

Test expects (from tests/yamls/multi-container-pod.yaml):

  • Container "1st" (nginx): /usr/sbin/nginx (long-lived) ✅
  • Container "2nd" (ubuntu): /bin/sh running loop ✅
    • Loop executes: date >> /html/index.html; sleep 1 repeatedly
    • Expected: /bin/date ❌ FILTERED
    • Expected: /bin/sleep ❌ FILTERED

Responsible Team

@stackrox/sensor-ecosystem

This involves:

  • Runtime process event collection (Collector → Sensor → Central pipeline)
  • Process filtering implementation (sensor/common/processfilter/, pkg/process/filter/)
  • Event ingestion and processing

Permanent Solution

Immediate Fix (Option A): Disable Filtering in CI

# In .openshift-ci/dispatch.sh or test deployment configuration
export ROX_PROCESS_FILTER_MODE=minimal

This ensures E2E tests capture all process events for validation.

Immediate Fix (Option B): Make Test More Robust

Change test to use longer-lived processes that won't be filtered:

# In tests/yamls/multi-container-pod.yaml
command: ["/bin/sh"]
args:
  - -c
  - |
    while true; do
      /usr/bin/uptime >> /html/index.html
      sleep 30
    done

Then update expected processes to match.

Long-term Improvements

  1. Test Improvements (tests/pods_test.go):

    • Add diagnostics to verify process filter mode is compatible with test
    • Add explicit pre-check that event pipeline is working
    • Better error messages showing which processes are missing and why
    • Consider testing with different filter modes
  2. Process Filter Improvements (pkg/process/filter/):

    • Add debug logging when processes are filtered (at configurable level)
    • Add metrics for filter hit rates and dropped processes
    • Consider exempting certain "test" namespaces from aggressive filtering
    • Document filter behavior and tuning guidance
  3. Event Pipeline Investigation:

    • Investigate 100-second delay in event collection
    • Check for buffering, resource constraints, or backpressure issues
    • Add metrics for end-to-end event latency (Collector → Central)
  4. Documentation:

    • Document process filtering behavior in AGENTS.md or test docs
    • Add guidance on when to use different filter modes
    • Explain CI environment configuration

Key Files and Locations

Test Code:

  • tests/pods_test.go:134-169 - TestPod function
  • tests/yamls/multi-container-pod.yaml - Test pod definition
  • tests/common.go - Test helper functions

Process Filtering:

  • pkg/env/process_filter_mode.go - Filter mode configuration (ROX_PROCESS_FILTER_MODE)
  • pkg/env/process_filter.go - Individual filter settings
  • pkg/process/filter/filter.go - Core filter implementation
  • sensor/common/processfilter/filter.go - Sensor filter singleton
  • central/processindicator/filter/singleton.go - Central filter singleton

CI Configuration:

  • .openshift-ci/dispatch.sh - CI test dispatch script
  • Environment variables for test configuration

Related Issues

  • ROX-29771 - Original flaky test tracking issue
  • ROX-31331 - Collector process detection improvements
  • ROX-32679 - Process filter preset modes (introduced regression)

Recommendations

  1. Immediate: Set ROX_PROCESS_FILTER_MODE=minimal in CI test environments
  2. Short-term: Update test to use longer-lived processes or verify filter compatibility
  3. Medium-term: Add observability to process filtering (logs, metrics)
  4. Long-term: Improve filter logic to handle test scenarios better

Investigation Artifacts

The TestPod E2E test was failing because the default process filter
(introduced in ROX-32679) is too aggressive for short-lived processes.
The test expects to observe /bin/date and /bin/sleep processes that
execute for milliseconds to 1 second in a loop, but the default filter
treats these as "spam" and drops them after initial detections.

Root cause:
- Default filter: MaxExactPathMatches=5, FanOutLevels=[8,6,4,2]
- Test processes: /bin/date (runs for ms), /bin/sleep 1 (runs 1s)
- Filter sees repeated short-lived processes as container spamming

Solution:
Set ROX_PROCESS_FILTER_MODE=minimal for all E2E tests to ensure
comprehensive process event capture for validation. This gives tests:
- MaxExactPathMatches: 100 (vs default: 5)
- FanOutLevels: [20,15,10,5] (vs default: [8,6,4,2])
- MaxProcessPaths: 20000 (vs default: 5000)

This allows E2E tests to validate process detection as customers would
experience it with minimal filtering enabled.

Build failure investigated: 2040047831439380480
Related issues: ROX-31331, ROX-32679

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@janisz janisz requested review from JoukoVirtanen and vikin91 April 8, 2026 08:44
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🚀 Build Images Ready

Images are ready for commit fb04fc6. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-582-gfb04fc63d2

Copy link
Copy Markdown
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bad for a start, but we can tune the filter even more by setting:

  • ROX_PROCESS_FILTER_MAX_PROCESS_PATHS
  • ROX_PROCESS_FILTER_FAN_OUT_LEVELS
  • ROX_PROCESS_FILTER_MAX_EXACT_PATH_MATCHES

separately with even more permissive settings than those for the minimal preset.

The minimal preset sets them as follows:

  • ROX_PROCESS_FILTER_MAX_PROCESS_PATHS = 20000
  • ROX_PROCESS_FILTER_FAN_OUT_LEVELS = [20, 15, 10, 5]
  • ROX_PROCESS_FILTER_MAX_EXACT_PATH_MATCHES = 100

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.58%. Comparing base (19fcb9e) to head (fb04fc6).
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19890      +/-   ##
==========================================
- Coverage   49.58%   49.58%   -0.01%     
==========================================
  Files        2766     2766              
  Lines      208530   208540      +10     
==========================================
+ Hits       103398   103401       +3     
- Misses      97457    97464       +7     
  Partials     7675     7675              
Flag Coverage Δ
go-unit-tests 49.58% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@JoukoVirtanen
Copy link
Copy Markdown
Contributor

JoukoVirtanen commented Apr 8, 2026

#18556 should not have had any impact on e2e tests, as it did not actually change the filtering used in e2e tests. It made it easier to change the process filtering settings, but it did not change the default process filtering settings. I don't think the tests should be run with minimal filtering, because that will not be used in production very much.

The PR description has the following:

The TestPod E2E test was failing because the default process filter (introduced in ROX-32679) is too aggressive for short-lived processes.

The filter is not impacted by how long the processes live. Collector does not report when processes terminate, so sensor and central do not know how long or short-lived processes are. The filter operates according to how frequently it sees the same process and its arguments, not how short-lived the process is.

I think more understanding of why the test is flaky is needed before a fix can be implemented.

@JoukoVirtanen
Copy link
Copy Markdown
Contributor

When did this test become flaky?

# Use minimal process filtering for E2E tests to ensure short-lived test processes
# (like /bin/date, /bin/sleep) are captured for validation in tests like TestPod.
# See ROX-29771 for details on process filtering impact on test reliability.
export ROX_PROCESS_FILTER_MODE=minimal
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding this PR open.

@janisz janisz closed this Apr 8, 2026
@janisz
Copy link
Copy Markdown
Contributor Author

janisz commented Apr 8, 2026

Closing as this needs a proper investigation. Maybe the solution will be to sleep longer and removing date

@vikin91
Copy link
Copy Markdown
Contributor

vikin91 commented Apr 9, 2026

Closing as this needs a proper investigation. Maybe the solution will be to sleep longer and removing date

I tried it this way in the past and that was not the right way to go.
The best outcomes resulted from ensuring that sensor won't get restarted or cycle through offline mode during the test as those may cause losses of messages from Collector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants