Skip to content

ROX-28259: Reenable entities store history & add enrichment purger#14538

Merged
vikin91 merged 9 commits intomasterfrom
piotr/ROX-28259-the-sensor-fix
Apr 30, 2025
Merged

ROX-28259: Reenable entities store history & add enrichment purger#14538
vikin91 merged 9 commits intomasterfrom
piotr/ROX-28259-the-sensor-fix

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Mar 7, 2025

Description

This PR implements solutions for the following goals:

  • Stop accumulating network flows and endpoints in Sensor's memory when running with fake workloads
  • Periodically purge network flows and endpoints from Sensor's memory that were never removed by the enrichment process.
  • Implement & refactor a large group of metrics around the enrichment to get a better picture around the process.

How this part of the code works

The enrichment process is there to add data known to Sensor to the data sent by Collector.
Sensor keeps track of open connections (network flows) and endpoints that participate in connections.
When Collector sends an update (every 30s), Sensor analyzes the data and does the following procedure:

  1. Stores the data from Collector in the connectionsByHost map. This map can be understood as a queue of updates to be enriched.
  2. The data is periodically (every 30s) checked by Sensor according to the following schema:
  • calculate a diff between the update from Collector and the state of connections and endpoints known in Sensor.
  • if there is any new information from Collector or the result of the enrichment is different than in the last cycle: send update to Central.
  • all connections and endpoints that are active (not closed) are stored in Sensor's memory (activeConnections & activeEndpoints), because further updates from Collector are expected.
  • if there is data missing required for enrichment, Sensor puts back the item to the connectionsByHost map, so that the enrichment can be tried again in the next cycle.
  • all successfully enriched entities are removed from the connectionsByHost map.
  • additionally, Sensor may delete the data from the connectionsByHost map if it knows that the data is faulty or incomplete and there is no chance for it to work in the next cycle.

The goal is that at the end of each 30s cycle, the connectionsByHost map contains only those entries that should be retried again in the next cycle. It is expected that over longer time periods the size of connectionsByHost map fluctuates around a known (relatively small) number.

Adding the purger

Unfortunately, we observed that in some cases connections and endpoints were never deleted from the connectionsByHost map. Moreover, in each enrichment cycle they were copied to the activeConnections & activeEndpoints slices and were also stuck there. The enrichment process is expected to handle those data structures so that data stays there only for a limited time, but that was not the case.

To make sure that no data is stuck in Sensor, I add a purger in this PR. The purger periodically (30m) checks the three data structures (activeConnections, activeEndpoints, and connectionsByHost) and checks a list of rules whether a given entry can be removed. The ultimate rule is the max-age where the purger checks the time the data resides in Sensor and deletes it if it has not been enriched for a given time (default 4h).

In clusters with high workload, the purger takes up to 2s (based on metrics collected for a long-running cluster) to complete the cleanup. During that time, it locks two mutexes, so I made sure that it runs rarely (twice per hour). On smaller clusters the runtime of purger is negligible.

User-facing documentation

  • CHANGELOG is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed

Testing and quality

  • the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Mainly by long-running cluster and checking metrics. I have been observing long-running clusters for about 2 weeks until I got to the point where:

  • activeEndpoints, activeConnections, and connectionsByHost in the networkFlowManager are not constantly growing
  • memory consumption of Sensor over longer time periods is not constantly growing (the way it used to before).

BEFORE: Applying purger only to activeConnections & activeEndpoints

Enrichment-of-Endpoints-Dashboards-Grafana-04-04-2025_10_04_AM

AFTER: Purger cleans activeConnections, activeEndpoints, and connectionsByHost

Enrichment-of-Endpoints-Dashboards-Grafana-04-07-2025_11_00_AM

BEFORE: Typical CPU & memory consumption for fake workloads long-running cluster

Screenshot 2025-04-07 at 12 29 48

AFTER: CPU & memory consumption for fake workloads long-running cluster (with purger)

Screenshot 2025-04-07 at 12 30 11

Summary by Sourcery

Implement a purger mechanism to prevent memory accumulation in Sensor's network flow tracking by periodically cleaning up stale network flows, endpoints, and connections

New Features:

  • Introduced a purger mechanism that periodically checks and removes stale network flows, endpoints, and connections from Sensor's memory
  • Added configurable max age and cycle time for the purger via environment variables
  • Implemented detailed metrics tracking for purger events and performance

Enhancements:

  • Refactored network flow tracking to support historical container tracking
  • Added more granular metrics for network flow enrichment and tracking
  • Improved memory management in Sensor's network flow processing

CI:

  • Updated CI configurations to support new environment variables for purger settings

Tests:

  • Added unit tests for purger functionality across different scenarios
  • Implemented test cases for purging host connections, active endpoints, and active connections

Chores:

  • Updated interfaces and mocks to support historical container tracking
  • Added comprehensive test coverage for the new purger mechanism

@vikin91
Copy link
Contributor Author

vikin91 commented Mar 7, 2025

@openshift-ci
Copy link

openshift-ci bot commented Mar 7, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Mar 7, 2025

Images are ready for the commit at 2c27b6f.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.8.x-566-g2c27b6f83d.

@codecov
Copy link

codecov bot commented Mar 7, 2025

Codecov Report

Attention: Patch coverage is 83.44828% with 72 lines in your changes missing coverage. Please review.

Project coverage is 49.06%. Comparing base (428f3be) to head (2c27b6f).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
sensor/common/networkflow/manager/manager_impl.go 72.80% 59 Missing and 6 partials ⚠️
sensor/common/networkflow/manager/purger.go 98.39% 2 Missing and 1 partial ⚠️
...or/kubernetes/listener/resources/store_provider.go 25.00% 2 Missing and 1 partial ⚠️
sensor/common/processsignal/enricher.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #14538      +/-   ##
==========================================
+ Coverage   49.00%   49.06%   +0.06%     
==========================================
  Files        2550     2551       +1     
  Lines      187280   187595     +315     
==========================================
+ Hits        91768    92035     +267     
- Misses      88256    88299      +43     
- Partials     7256     7261       +5     
Flag Coverage Δ
go-unit-tests 49.06% <83.44%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from f2033e5 to bc9c4b5 Compare March 10, 2025 08:35
@vikin91 vikin91 added the turbo-build Uses a faster path to images label Mar 11, 2025
@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch 2 times, most recently from 85380f7 to 353c7a9 Compare March 13, 2025 09:17
@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch 3 times, most recently from 063753c to 6dbdc5c Compare April 1, 2025 20:08
@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from 6ab33a6 to 633ea1a Compare April 7, 2025 13:00
@vikin91 vikin91 mentioned this pull request Apr 7, 2025
10 tasks
@vikin91 vikin91 changed the base branch from master to piotr/upgrade-grafana April 7, 2025 13:22
@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from 633ea1a to cbc809c Compare April 7, 2025 13:22
@vikin91 vikin91 force-pushed the piotr/upgrade-grafana branch from bfec36a to e7b173b Compare April 7, 2025 15:22
@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from 9aba1ec to 2f5c078 Compare April 7, 2025 15:22
Base automatically changed from piotr/upgrade-grafana to master April 8, 2025 09:08
Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review (I haven't looked at the metrics in too many detail yet). Looks good, my main concern is not fully understanding why we need to also purge activeEndpoints if we are purging the hostConnection.endpoint`.

@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch 2 times, most recently from 27bd009 to 7e8149c Compare April 9, 2025 10:20
@vikin91 vikin91 marked this pull request as ready for review April 9, 2025 15:00
@vikin91 vikin91 requested a review from a team as a code owner April 9, 2025 15:00
@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from cdc7141 to b91a26a Compare April 22, 2025 14:06
@vikin91 vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from b91a26a to 6324301 Compare April 22, 2025 14:10
@vikin91
Copy link
Contributor Author

vikin91 commented Apr 23, 2025

/retest

@vikin91
Copy link
Contributor Author

vikin91 commented Apr 23, 2025

/retest

1 similar comment
@vikin91
Copy link
Contributor Author

vikin91 commented Apr 23, 2025

/retest

@vikin91
Copy link
Contributor Author

vikin91 commented Apr 24, 2025

The gke-upgrade-test fails in a strange way:

PolicyConfigurationTest > Verify env var policy configuration for source #envVarSource fails validation > Verify env var policy configuration for source RESOURCE_FIELD fails validation STANDARD_OUT
    18:41:22 | INFO  | PolicyConfigurationTest   | PolicyConfigurationTest   | Starting testcase: Verify env var policy configuration for source RESOURCE_FIELD fails validation
    18:41:23 | ERROR | PolicyConfigurationTest   | PolicyService             | error creating new policy
    io.grpc.StatusRuntimeException: INVALID_ARGUMENT: policy invalid error: error validating lifecycle stage error: policy configuration is invalid for deploy time: policy validation error: validation of section "" error: policy criteria "Environment Variable" has invalid value[0]="KEY=VALUE" must match regex "((?m)^(?i:(UNSET|RAW|UNKNOWN|^)=([^=]*)=.*)|((SECRET_KEY|CONFIG_MAP_KEY|FIELD|RESOURCE_FIELD)=([^=]*)=$)$)": invalid arguments
    	at io.stackrox.proto.api.v1.PolicyServiceGrpc$PolicyServiceBlockingStub.postPolicy(PolicyServiceGrpc.java:1022) [3 skipped]
    	at services.PolicyService.createNewPolicy(PolicyService.groovy:31) [1 skipped]
    	at PolicyConfigurationTest.$spock_feature_1_3(PolicyConfigurationTest.groovy:714) [9 skipped]
    	at util.OnFailureInterceptor.intercept(OnFailure.groovy:72) [9 skipped]
    	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151) [7 skipped]
    	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
    	at util.OnFailureInterceptor.intercept(OnFailure.groovy:72) [6 skipped]
    	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) [4 skipped]
     [4 skipped]
    18:41:23 | INFO  | PolicyConfigurationTest   | PolicyConfigurationTest   | Ending testcase
PolicyConfigurationTest > Verify env var policy configuration for source #envVarSource fails validation > Verify env var policy configuration for source RESOURCE_FIELD fails validation PASSED

This looks unrelated.

All other *-qa-e2e-tests fail due to VulnScanWithGraphQLTest: Verify image info from CVE-2017-12611 in GraphQL.

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not giving it a green yet in case you decided to implement my suggestion about disabling the purger. If you'd rather not implement it, let me know and I'll give it a green.

@openshift-ci
Copy link

openshift-ci bot commented Apr 25, 2025

@vikin91: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-12-qa-e2e-tests 2c27b6f link false /test ocp-4-12-qa-e2e-tests
ci/prow/ocp-4-12-nongroovy-e2e-tests 2c27b6f link false /test ocp-4-12-nongroovy-e2e-tests
ci/prow/ocp-4-12-operator-e2e-tests 2c27b6f link false /test ocp-4-12-operator-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for all the improvements 🙂

Copy link
Contributor

@mtodor mtodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! The changes look good to me. I didn't find anything major that could cause issues. There may be a few nitpicks, but I didn't see them as relevant to further delay PR.

PR is a bit difficult to review. It's simply a lot of changes in one PR. Since this PR blocks other tasks, I would suggest merging it.

One proposal for future work. Let's try to split PR into smaller ones. That could help with reviews. In this case, I can see three parts:

  • refactor manager logic
  • introduce additional metrics collection
  • introduce purger

@vikin91 vikin91 merged commit 253c521 into master Apr 30, 2025
86 of 89 checks passed
@vikin91 vikin91 deleted the piotr/ROX-28259-the-sensor-fix branch April 30, 2025 06:24
janisz added a commit that referenced this pull request May 13, 2025
janisz added a commit that referenced this pull request May 13, 2025
…urger (#14538)"

This reverts commit 253c521.

# Conflicts:
#	pkg/env/sensor.go
#	sensor/common/networkflow/manager/manager_impl.go
#	sensor/common/networkflow/manager/purger.go
#	sensor/common/networkflow/manager/purger_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/sensor turbo-build Uses a faster path to images

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants