ROX-28259: Reenable entities store history & add enrichment purger by vikin91 · Pull Request #14538 · stackrox/stackrox

vikin91 · 2025-03-07T11:54:46Z

Description

This PR implements solutions for the following goals:

Stop accumulating network flows and endpoints in Sensor's memory when running with fake workloads
Periodically purge network flows and endpoints from Sensor's memory that were never removed by the enrichment process.
Implement & refactor a large group of metrics around the enrichment to get a better picture around the process.

How this part of the code works

The enrichment process is there to add data known to Sensor to the data sent by Collector.
Sensor keeps track of open connections (network flows) and endpoints that participate in connections.
When Collector sends an update (every 30s), Sensor analyzes the data and does the following procedure:

Stores the data from Collector in the connectionsByHost map. This map can be understood as a queue of updates to be enriched.
The data is periodically (every 30s) checked by Sensor according to the following schema:

calculate a diff between the update from Collector and the state of connections and endpoints known in Sensor.
if there is any new information from Collector or the result of the enrichment is different than in the last cycle: send update to Central.
all connections and endpoints that are active (not closed) are stored in Sensor's memory (activeConnections & activeEndpoints), because further updates from Collector are expected.
if there is data missing required for enrichment, Sensor puts back the item to the connectionsByHost map, so that the enrichment can be tried again in the next cycle.
all successfully enriched entities are removed from the connectionsByHost map.
additionally, Sensor may delete the data from the connectionsByHost map if it knows that the data is faulty or incomplete and there is no chance for it to work in the next cycle.

The goal is that at the end of each 30s cycle, the connectionsByHost map contains only those entries that should be retried again in the next cycle. It is expected that over longer time periods the size of connectionsByHost map fluctuates around a known (relatively small) number.

Adding the purger

Unfortunately, we observed that in some cases connections and endpoints were never deleted from the connectionsByHost map. Moreover, in each enrichment cycle they were copied to the activeConnections & activeEndpoints slices and were also stuck there. The enrichment process is expected to handle those data structures so that data stays there only for a limited time, but that was not the case.

To make sure that no data is stuck in Sensor, I add a purger in this PR. The purger periodically (30m) checks the three data structures (activeConnections, activeEndpoints, and connectionsByHost) and checks a list of rules whether a given entry can be removed. The ultimate rule is the max-age where the purger checks the time the data resides in Sensor and deletes it if it has not been enriched for a given time (default 4h).

In clusters with high workload, the purger takes up to 2s (based on metrics collected for a long-running cluster) to complete the cleanup. During that time, it locks two mutexes, so I made sure that it runs rarely (twice per hour). On smaller clusters the runtime of purger is negligible.

User-facing documentation

CHANGELOG is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

Mainly by long-running cluster and checking metrics. I have been observing long-running clusters for about 2 weeks until I got to the point where:

activeEndpoints, activeConnections, and connectionsByHost in the networkFlowManager are not constantly growing
memory consumption of Sensor over longer time periods is not constantly growing (the way it used to before).

BEFORE: Applying purger only to activeConnections & activeEndpoints

AFTER: Purger cleans activeConnections, activeEndpoints, and connectionsByHost

BEFORE: Typical CPU & memory consumption for fake workloads long-running cluster

AFTER: CPU & memory consumption for fake workloads long-running cluster (with purger)

Summary by Sourcery

Implement a purger mechanism to prevent memory accumulation in Sensor's network flow tracking by periodically cleaning up stale network flows, endpoints, and connections

New Features:

Introduced a purger mechanism that periodically checks and removes stale network flows, endpoints, and connections from Sensor's memory
Added configurable max age and cycle time for the purger via environment variables
Implemented detailed metrics tracking for purger events and performance

Enhancements:

Refactored network flow tracking to support historical container tracking
Added more granular metrics for network flow enrichment and tracking
Improved memory management in Sensor's network flow processing

CI:

Updated CI configurations to support new environment variables for purger settings

Tests:

Added unit tests for purger functionality across different scenarios
Implemented test cases for purging host connections, active endpoints, and active connections

Chores:

Updated interfaces and mocks to support historical container tracking
Added comprehensive test coverage for the new purger mechanism

vikin91 · 2025-03-07T11:54:48Z

This change is part of the following stack:

ROX-28859: Upgrade Grafana to v10 #14901
- ROX-28259: Reenable entities store history & add enrichment purger #14538 ◀
  - ROX-28259: Add Grafana dashboard for endpoints enrichment in Sensor #14899

_{Change managed by git-spice.}

openshift-ci · 2025-03-07T11:54:50Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

rhacs-bot · 2025-03-07T12:20:09Z

Images are ready for the commit at 2c27b6f.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.8.x-566-g2c27b6f83d.

codecov · 2025-03-07T13:33:09Z

Codecov Report

Attention: Patch coverage is 83.44828% with 72 lines in your changes missing coverage. Please review.

Project coverage is 49.06%. Comparing base (428f3be) to head (2c27b6f).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
sensor/common/networkflow/manager/manager_impl.go	72.80%	59 Missing and 6 partials ⚠️
sensor/common/networkflow/manager/purger.go	98.39%	2 Missing and 1 partial ⚠️
...or/kubernetes/listener/resources/store_provider.go	25.00%	2 Missing and 1 partial ⚠️
sensor/common/processsignal/enricher.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #14538      +/-   ##
==========================================
+ Coverage   49.00%   49.06%   +0.06%     
==========================================
  Files        2550     2551       +1     
  Lines      187280   187595     +315     
==========================================
+ Hits        91768    92035     +267     
- Misses      88256    88299      +43     
- Partials     7256     7261       +5

Flag	Coverage Δ
go-unit-tests	`49.06% <83.44%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lvalerom

Partial review (I haven't looked at the metrics in too many detail yet). Looks good, my main concern is not fully understanding why we need to also purge activeEndpoints if we are purging the hostConnection.endpoint`.

sensor/common/clusterentities/store_container_id_test.go

sensor/common/networkflow/manager/manager_impl.go

sensor/common/networkflow/manager/purger.go

sensor/common/networkflow/manager/manager_impl.go

sensor/common/networkflow/manager/purger.go

vikin91 · 2025-04-23T06:33:02Z

/retest

sensor/common/networkflow/manager/manager_impl.go

vikin91 · 2025-04-23T13:50:15Z

/retest

vikin91 · 2025-04-23T16:57:23Z

/retest

vikin91 · 2025-04-24T08:48:31Z

The gke-upgrade-test fails in a strange way:

PolicyConfigurationTest > Verify env var policy configuration for source #envVarSource fails validation > Verify env var policy configuration for source RESOURCE_FIELD fails validation STANDARD_OUT
    18:41:22 | INFO  | PolicyConfigurationTest   | PolicyConfigurationTest   | Starting testcase: Verify env var policy configuration for source RESOURCE_FIELD fails validation
    18:41:23 | ERROR | PolicyConfigurationTest   | PolicyService             | error creating new policy
    io.grpc.StatusRuntimeException: INVALID_ARGUMENT: policy invalid error: error validating lifecycle stage error: policy configuration is invalid for deploy time: policy validation error: validation of section "" error: policy criteria "Environment Variable" has invalid value[0]="KEY=VALUE" must match regex "((?m)^(?i:(UNSET|RAW|UNKNOWN|^)=([^=]*)=.*)|((SECRET_KEY|CONFIG_MAP_KEY|FIELD|RESOURCE_FIELD)=([^=]*)=$)$)": invalid arguments
    	at io.stackrox.proto.api.v1.PolicyServiceGrpc$PolicyServiceBlockingStub.postPolicy(PolicyServiceGrpc.java:1022) [3 skipped]
    	at services.PolicyService.createNewPolicy(PolicyService.groovy:31) [1 skipped]
    	at PolicyConfigurationTest.$spock_feature_1_3(PolicyConfigurationTest.groovy:714) [9 skipped]
    	at util.OnFailureInterceptor.intercept(OnFailure.groovy:72) [9 skipped]
    	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151) [7 skipped]
    	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
    	at util.OnFailureInterceptor.intercept(OnFailure.groovy:72) [6 skipped]
    	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) [4 skipped]
     [4 skipped]
    18:41:23 | INFO  | PolicyConfigurationTest   | PolicyConfigurationTest   | Ending testcase
PolicyConfigurationTest > Verify env var policy configuration for source #envVarSource fails validation > Verify env var policy configuration for source RESOURCE_FIELD fails validation PASSED

This looks unrelated.

All other *-qa-e2e-tests fail due to VulnScanWithGraphQLTest: Verify image info from CVE-2017-12611 in GraphQL.

lvalerom

LGTM. I'm not giving it a green yet in case you decided to implement my suggestion about disabling the purger. If you'd rather not implement it, let me know and I'll give it a green.

sensor/common/networkflow/manager/manager_impl.go

sensor/common/networkflow/manager/purger.go

sensor/common/networkflow/manager/manager_impl.go

sensor/common/networkflow/manager/purger.go

sensor/common/networkflow/manager/manager_impl.go

openshift-ci · 2025-04-25T17:34:51Z

@vikin91: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ocp-4-12-qa-e2e-tests	`2c27b6f`	link	false	`/test ocp-4-12-qa-e2e-tests`
ci/prow/ocp-4-12-nongroovy-e2e-tests	`2c27b6f`	link	false	`/test ocp-4-12-nongroovy-e2e-tests`
ci/prow/ocp-4-12-operator-e2e-tests	`2c27b6f`	link	false	`/test ocp-4-12-operator-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

lvalerom

LGTM. Thank you for all the improvements 🙂

mtodor

Nice work! The changes look good to me. I didn't find anything major that could cause issues. There may be a few nitpicks, but I didn't see them as relevant to further delay PR.

PR is a bit difficult to review. It's simply a lot of changes in one PR. Since this PR blocks other tasks, I would suggest merging it.

One proposal for future work. Let's try to split PR into smaller ones. That could help with reviews. In this case, I can see three parts:

refactor manager logic
introduce additional metrics collection
introduce purger

…14538)

…urger (#14538)" This reverts commit 253c521.

…urger (#14538)" This reverts commit 253c521. # Conflicts: # pkg/env/sensor.go # sensor/common/networkflow/manager/manager_impl.go # sensor/common/networkflow/manager/purger.go # sensor/common/networkflow/manager/purger_test.go

openshift-ci bot added the do-not-merge/work-in-progress label Mar 7, 2025

github-actions bot added the area/sensor label Mar 7, 2025

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from f2033e5 to bc9c4b5 Compare March 10, 2025 08:35

lvalerom mentioned this pull request Mar 10, 2025

ROX-28315: Add UpdatedAt field to the NetFlows table #14483

Merged

15 tasks

vikin91 added the turbo-build Uses a faster path to images label Mar 11, 2025

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch 2 times, most recently from 85380f7 to 353c7a9 Compare March 13, 2025 09:17

vikin91 mentioned this pull request Mar 25, 2025

ROX-28479: Add external IPs to fake workload generation #14574

Merged

5 tasks

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch 3 times, most recently from 063753c to 6dbdc5c Compare April 1, 2025 20:08

This was referenced Apr 4, 2025

Merge PLOP enrichment with endpoint enrichment #14880

Closed

ROX-28259: Add Grafana dashboard for endpoints enrichment in Sensor #14899

Closed

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from 6ab33a6 to 633ea1a Compare April 7, 2025 13:00

vikin91 mentioned this pull request Apr 7, 2025

ROX-28859: Upgrade Grafana to v10 #14901

Merged

10 tasks

vikin91 changed the base branch from master to piotr/upgrade-grafana April 7, 2025 13:22

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from 633ea1a to cbc809c Compare April 7, 2025 13:22

vikin91 mentioned this pull request Apr 7, 2025

feat(fake-workload): Add bool option for generating unclosed endpoints #14904

Merged

10 tasks

vikin91 force-pushed the piotr/upgrade-grafana branch from bfec36a to e7b173b Compare April 7, 2025 15:22

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from 9aba1ec to 2f5c078 Compare April 7, 2025 15:22

Base automatically changed from piotr/upgrade-grafana to master April 8, 2025 09:08

lvalerom reviewed Apr 8, 2025

View reviewed changes

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch 2 times, most recently from 27bd009 to 7e8149c Compare April 9, 2025 10:20

vikin91 marked this pull request as ready for review April 9, 2025 15:00

vikin91 requested a review from a team as a code owner April 9, 2025 15:00

lvalerom reviewed Apr 22, 2025

View reviewed changes

sensor/common/networkflow/manager/manager_impl.go Outdated Show resolved Hide resolved

sensor/common/networkflow/manager/purger.go Show resolved Hide resolved

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from cdc7141 to b91a26a Compare April 22, 2025 14:06

Reenable history and add purger to enrichment in Sensor

6324301

vikin91 force-pushed the piotr/ROX-28259-the-sensor-fix branch from b91a26a to 6324301 Compare April 22, 2025 14:10

rhybrillou reviewed Apr 23, 2025

View reviewed changes

sensor/common/networkflow/manager/manager_impl.go Show resolved Hide resolved

sensor/common/networkflow/manager/manager_impl.go Show resolved Hide resolved

vikin91 added 3 commits April 23, 2025 12:45

Enrsure that purger starts when using a ticker

6c22d10

Refactor: Move cutOff calculation out of the for loop

efa57d2

Minor: Update comments

e6e7f52

lvalerom reviewed Apr 24, 2025

View reviewed changes

vikin91 mentioned this pull request Apr 24, 2025

ROX-29060: Refactor network flow enrichment #15076

Merged

12 tasks

vikin91 added 4 commits April 24, 2025 16:51

Refactor: Unify ordering of lines in purger

8d3b7d6

Validate and provide fallback for EnrichmentPurgerTickerMaxAge

fdbfa3c

Add WithManager, make WithPurgerTicker test-only option

875fd26

Don't start purger if misconfigured

1bf977c

rhybrillou reviewed Apr 25, 2025

View reviewed changes

sensor/common/networkflow/manager/manager_impl.go Show resolved Hide resolved

Merge branch 'master' into piotr/ROX-28259-the-sensor-fix

2c27b6f

lvalerom approved these changes Apr 28, 2025

View reviewed changes

mtodor approved these changes Apr 29, 2025

View reviewed changes

vikin91 merged commit 253c521 into master Apr 30, 2025
86 of 89 checks passed

vikin91 deleted the piotr/ROX-28259-the-sensor-fix branch April 30, 2025 06:24

vikin91 added a commit that referenced this pull request Apr 30, 2025

ROX-28259: Reenable entities store history & add enrichment purger (#…

06d52cd

…14538)

This was referenced Apr 30, 2025

ROX-28259: Refactor and test connection enrichment in Sensor #14464

Closed

ROX-24313: Disable test "Verify two flows co-exist if ..." #15127

Merged

janisz added a commit that referenced this pull request May 13, 2025

Revert "ROX-28259: Reenable entities store history & add enrichment p…

6746501

…urger (#14538)" This reverts commit 253c521.

Conversation

vikin91 commented Mar 7, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How this part of the code works

Adding the purger

User-facing documentation

Testing and quality

Automated testing

How I validated my change

BEFORE: Applying purger only to activeConnections & activeEndpoints

AFTER: Purger cleans activeConnections, activeEndpoints, and connectionsByHost

BEFORE: Typical CPU & memory consumption for fake workloads long-running cluster

AFTER: CPU & memory consumption for fake workloads long-running cluster (with purger)

Summary by Sourcery

Uh oh!

vikin91 commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 7, 2025

Uh oh!

rhacs-bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lvalerom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vikin91 commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

vikin91 commented Apr 23, 2025

Uh oh!

vikin91 commented Apr 23, 2025

Uh oh!

vikin91 commented Apr 24, 2025

Uh oh!

lvalerom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvalerom left a comment

Choose a reason for hiding this comment

Uh oh!

mtodor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vikin91 commented Mar 7, 2025 •

edited by sourcery-ai bot

Loading

vikin91 commented Mar 7, 2025 •

edited

Loading

rhacs-bot commented Mar 7, 2025 •

edited

Loading

codecov bot commented Mar 7, 2025 •

edited

Loading

openshift-ci bot commented Apr 25, 2025 •

edited

Loading