ROX-30390: Add TransitionBased update computer by vikin91 · Pull Request #16522 · stackrox/stackrox

vikin91 · 2025-08-25T11:53:03Z

🐑 Shepherd of this stack is @lvalerom and @rhybrillou

Suggestion on how to review

The "meat" of this PR is the transition_based.go ~500 lines.
Most lines are unit tests and comments.
I moved metric NumUpdated to the updatecomputer pkg and added a gauge variant to it.
Rest is plugging the new updateComputer to tests and the production code (boilerplate)

Description

In this description, whenever I write about enriched entity (EE), I mean a connection, endpoint, or process.

This PR adds a new update computer called "TransitionBased". This implementation is optimized for calculating updates based on state transitions of enriched entities.

What problem does it solve?

The new update computer is meant to simplify calculating the updates that are to be sent to Central. This is done by:

Not relying on the updates sent in the previous tick - that allows Sensor to drop some of the enriched entities (EE) without triggering an update to Central. This is to give Sensor the power of making a decision whether Central should know about something, or not.
Decreasing the memory pressure by storing less data in memory - currently, the EE being enriched are stored in three data structures that may grow to many hundred-millions of entries in one day (for particular workloads - e.g., the fake workload used in release testing). Delegating one of those three data structures to the update computer, allows us to control it better and test various implementations.

Benefits of TransitionBased update computer

Reduction of memory pressure by 50%* when using the CPU-optimized keyString() functions for respective EE indicators. Using this function also reduces CPU usage, so that a single enrichment is faster when compared against legacy implementation.
Reduction of memory pressure by 75%* when using the memory-optimized keyHash() functions for the respective EE indicators.
Reduction of the number of redundant updates do Central by roughly 20% (difficult to measure precisely) for the fake workloads.

*) The memory saving was measured when comparing the byte-size of transitionBased.deduper (3 instances - one for each EE) plus transitionBased.closedConnTimestamps against legacy.enrichedConnsLastSentState + legacy.enrichedEndpointsLastSentState + enrichedProcessesLastSentState.

Disadvantages of TransitionBased update computer

Increased CPU usage when using the memory-optimized keyHash() functions for the respective EE indicators. Difficult to estimate, based on the cpu profiles, this may be 20% longer execution of the transitionBased.ComputeUpdatedConns than legacy.ComputeUpdatedConns.
- Option: Allow customers select whether they optimize for CPU and use the keyString() function (good for small clusters) or for memory with the keyHash() function (good for large clusters with Sensor getting OOM kills).

Changes in behavior

The transition-based update computer offers many performance improvements, but there are some consequences.
The most important parts were consulted with the Collector team internal Slack link to discussion.

1️⃣ Deleted open EEs are no longer triggering an update to Central

For legacy implementation, an update to Central was generated when an EE was opened and then disappeared (without Collector sending the closed timestamp for that EE). In that case, Sensor would assume that the EE was closed in the moment when it disappeared from memory.

In the transition-based implementation, we do not send an update to Central when an open EE disappears from Sensor memory (e.g., deleted by purger, or some future algorithm that would manipulate the enrichment queue).

(There are unit test to document that behavior).

2️⃣ No more filtering of closing messages if multiple of them arrive

(This change was decided by the author and is not a limitation of the transition-based implementation. Moreover, it is applied to endpoints and processes, but not to connections, as it looks like it plays a slightly more important role for connections than for other EEs.)

If collector emits multiple close messages with different timestamps for the same EE, then we would send updates to Central only if the current close timestamp is more into the future than the previous one. In other words, any update that would tell Central that a given EE was living longer was sent to Central.
The goal of that was to limit the number of updates to Central, in case multiple close updates happen.

In the transition-based implementation, this behavior is preserved for connections but dropped for entities and processes.

Why kept for connections? Because of features that rely on the timestamps and the connection information (e.g., the baselines), it may be beneficial for them when the timestamps would not jump back in time.
Why dropping that behavior? Keeping it, requires an additional map - e.g., closedConnTimestamps - that costs memory. The memory cost is negligible, but the complexity in code remains.

📝 I am open for reviewer suggestions whether: (1) keep it the way it is, (2) implement the past behavior for all EEs, (3) drop this behavior for all EEs (i.e., remove it for connections).

3️⃣ Enables the biggest memory-win change

See #16532, where we change the rule for removing the already enriched EEs from the queue. Before that, we kept the open EEs in hostConns for as long as they were open. In #16532, we change the hostConns to be a real enrichment queue, so that it holds only those items that have not been enriched yet or the enrichment must be retried.
That allows to remove really a lot of items from the enrichment queue (memory saving) and also saves significant amount of CPU as all of the items from the queue are processed on each tick.

The memory and CPU savings are a bit tricky to estimate using metrics (we cannot separate the memory savings and CPU increase won in this PR from the ones introduced in #16532), but I will try to provide some rough estimate.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed
- This PR description can serve as internal documentation of that change

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

Extensive unit test coverage
Existing integration test are passing
Extensive benchmarking of crucial functions where memory and CPU consumption play a role (shall I commit the benchmarks as well?)
CPU and Memory profiling for comparing the behavior of keyString() and keyHashed() methods.
Multiple measurements using metrics and Grafana dashboard for the estimation of memory consumption during fake workloads
- I implemented a combined update computer that runs both legacy and transition-based in parallel for the purpose of gathering the metrics (see: ROX-30390: Leftovers from the prototype & code instrumentation #16238 - some parts are only for measurement and should never be merged to master).
- Compared the metrics visually on a dashboard.

Memory saved by using the new updateComputer

The graph shows the number of bytes consumed by the legacy and transition-based update computers (generated from #16238). The cluster was running fake-workloads over a weekend.

The new update computer uses only 25% of the memory for endpoints when running under fake workloads (the workload is endpoint-heavy).

Note that the memory usage is higher for processes - visible in the legend. This is something specific to the fake workloads and does not happen in production. The effect is caused by generating the load using random processes attached to a limited poll of endpoints. To sensor, it looks like the same endpoint starts multiple different processes without closing the previous ones. The deduper sees that as a new process indicator running on the same endpoint and skips deduping. In reality, this happens really rarely, as we usually stop a process before starting a new one. However, one cannot exclude that a malicious actor may start a benign process listening on a port to later attach a malicious process to the same port without closing the previous process. This forces us to use the process data in the deduper and thus pay the price of more memory - see discussion with Collector team in internal Slack.

The way to go is to change the fake workload generator to not reuse existing endpoints and generate unique endpoints for unique processes. This is being addressed in followups #16629 (prefactor) and #16628 (change how processes are generated).

Number of updates skipped (not sent to Central)

The secondary goal of the enricher is to filter-out unnecessary updates to Central. In the graph we can see that over 7 billion updates were not sent (over 3 days) - note skip in the legend. The workload was heavy on endpoints and processesListening and very light on connections.
This data is only available for the transition-based update computer.
I will try to generate another graph where we compare the number of updates sent to central for both computers.

CPU and memory consumption

Note that in the experiment both update computers were running in parallel - it means that we see memory consumption of both legacy and transition-based in the graph.
Sensor was assigned with 16GB of memory limit and didn't go OOM over 3 days.

openshift-ci · 2025-08-25T11:53:08Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

vikin91 · 2025-08-25T11:59:14Z

This change is part of the following stack:

ROX-30390: Prefactor: Extract indicators into pkg #16447
- ROX-30390: Extract Update Computer #16452
  - ROX-30390: Add indicator key functions #16621
    - ROX-30390: Refactor netFlowManager metrics #16624
      - ROX-30390: Add TransitionBased update computer #16522 ◀
        
        ROX-30390: Don't queue open enriched-entities forever #16532
        
        ROX-30390: Leftovers from the prototype & code instrumentation #16238

_{Change managed by git-spice.}

rhacs-bot · 2025-08-25T12:22:27Z

Images are ready for the commit at c972ee5.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-640-gc972ee5e67.

codecov · 2025-08-25T12:31:46Z

Codecov Report

❌ Patch coverage is 81.87135% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.80%. Comparing base (ab86e03) to head (15ce78d).
⚠️ Report is 17 commits behind head on master.

Files with missing lines	Patch %	Lines
...mon/networkflow/updatecomputer/transition_based.go	81.67%	58 Missing and 1 partial ⚠️
sensor/common/networkflow/manager/manager_impl.go	84.61%	2 Missing ⚠️
sensor/common/networkflow/updatecomputer/legacy.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #16522      +/-   ##
==========================================
+ Coverage   48.67%   48.80%   +0.13%     
==========================================
  Files        2676     2679       +3     
  Lines      199819   200331     +512     
==========================================
+ Hits        97264    97776     +512     
- Misses      94944    94946       +2     
+ Partials     7611     7609       -2

Flag	Coverage Δ
go-unit-tests	`48.80% <81.87%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The new Categorized update computer is quite large; consider splitting it into smaller, focused modules (e.g., deduper management, categorization logic, and metrics) to improve readability and maintainability.
There’s substantial duplication in transition categorization (categorizeUpdate vs categorizeUpdateNoPast) and in keyString/keyHash implementations—extract shared logic into reusable helpers to reduce repetition.
The name ‘Categorized’ and the use of raw strings for metrics labels can be confusing; consider choosing a more descriptive name for the update computer and using constants for label values to improve clarity and avoid typos.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- The new Categorized update computer is quite large; consider splitting it into smaller, focused modules (e.g., deduper management, categorization logic, and metrics) to improve readability and maintainability.
- There’s substantial duplication in transition categorization (categorizeUpdate vs categorizeUpdateNoPast) and in keyString/keyHash implementations—extract shared logic into reusable helpers to reduce repetition.
- The name ‘Categorized’ and the use of raw strings for metrics labels can be confusing; consider choosing a more descriptive name for the update computer and using constants for label values to improve clarity and avoid typos.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

# Conflicts: # sensor/common/networkflow/manager/manager_impl.go

…nup calls

vikin91 · 2025-09-10T08:22:38Z

It looks like a plop e2e test is failing in ocp-4-12-qa-e2e-tests and I saw there an OOM kill. I am investigating this.

Edit: I confirmed that the entire e2e test ./gradlew test --tests="ProcessesListeningOnPortsTest" is passing when I run this locally against a remote cluster. The failure must have been a flake caused by Sensor crash. Let's see if CI can reproduce it.

rhacs-bot · 2025-09-10T09:59:18Z

Images are ready for the commit at c435711.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-737-gc4357118ba.

rhacs-bot · 2025-09-10T10:59:32Z

Images are ready for the commit at 15ce78d.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-738-g15ce78d821.

lvalerom

Looks really good. I added a comment regarding making having a way to rollback to legacy mode just in case.

sensor/kubernetes/sensor/sensor.go

vikin91 · 2025-09-10T16:02:56Z

/retest

vikin91 · 2025-09-11T07:10:53Z

The two recent failures of ocp-4-12-qa-e2e-tests were unrelated. Looks like no more OOM kills. Also the test ProcessesListeningOnPortsTest is passing.

openshift-ci · 2025-09-11T09:06:11Z

@vikin91: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ocp-4-12-qa-e2e-tests	`15ce78d`	link	false	`/test ocp-4-12-qa-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rhacs-bot · 2025-09-11T09:16:04Z

/retest

vikin91 · 2025-09-11T09:30:57Z

Unfortunately the run https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/stackrox_stackrox/16522/pull-ci-stackrox-stackrox-master-ocp-4-12-qa-e2e-tests/1966038639137787904 had OOM kill for Sensor.

It maybe a better idea to merge this together with #16532 - wdyt @lvalerom ?

Ups, and it got auto-merged 😅 I will investigate this OOM, as it makes no sense to me that 8GB of memory would be consumed so fast.

openshift-ci bot added the do-not-merge/work-in-progress label Aug 25, 2025

This comment was marked as off-topic.

Sign in to view

github-actions bot added the area/sensor label Aug 25, 2025

vikin91 force-pushed the piotr/ROX-30390-update-computer branch from 40a11cd to 3a951ff Compare August 25, 2025 13:04

vikin91 force-pushed the piotr/ROX-30390-categorized branch from c2fddcf to 4908166 Compare August 25, 2025 13:04

github-actions bot added the ai-review label Aug 25, 2025

This was referenced Aug 25, 2025

ROX-30390: Leftovers from the prototype & code instrumentation #16238

Closed

ROX-30390: Extract Update Computer #16452

Merged

vikin91 added ci-all-qa-tests Tells CI to run all API tests (not just BAT). ci-race-tests Uses a `-race` build for all e2e tests labels Aug 26, 2025

vikin91 mentioned this pull request Aug 26, 2025

ROX-30390: Don't queue open enriched-entities forever #16532

Merged

9 tasks

vikin91 force-pushed the piotr/ROX-30390-categorized branch 4 times, most recently from 3fd11bf to 67d9d64 Compare August 29, 2025 14:42

vikin91 force-pushed the piotr/ROX-30390-update-computer branch from b136877 to 9f663c4 Compare August 29, 2025 16:18

vikin91 force-pushed the piotr/ROX-30390-categorized branch from 3b07945 to 4dfb3c0 Compare August 29, 2025 16:19

Base automatically changed from piotr/ROX-30390-update-computer to master September 1, 2025 08:53

vikin91 force-pushed the piotr/ROX-30390-categorized branch from 4dfb3c0 to fd59e5a Compare September 1, 2025 08:54

vikin91 marked this pull request as ready for review September 1, 2025 16:21

vikin91 requested a review from a team as a code owner September 1, 2025 16:21

openshift-ci bot removed the do-not-merge/work-in-progress label Sep 1, 2025

vikin91 requested review from lvalerom and rhybrillou September 1, 2025 16:21

sourcery-ai bot reviewed Sep 1, 2025

View reviewed changes

vikin91 added the auto-retest PRs with this label will be automatically retested if prow checks fails label Sep 2, 2025

vikin91 added 14 commits September 10, 2025 09:39

Add categorized update computer

5d8def6

# Conflicts: # sensor/common/networkflow/manager/manager_impl.go

AI-review: add test case

52059aa

Run periodic cleanup synchronously to avoid unbounded concurrent clea…

dcd62a9

…nup calls

Add periodicCleanup duration metric as it is now called synchronously

3fee1ab

Rename Categorized to Transition-Based

c9dfd0a

Use seconds and prom.Timer for metric periodicCleanupDurationSeconds

b90fd64

Fix issue of gauge metric behaving like counter

7b538e4

Reset the gauge metric instead of zeroing it

6251fc9

Use or remove unused consts

57aa535

Make TransitionType and deduperAction an iota int

9187a9b

Ensure that returning cachedUpdates has a comment

2daf461

Allow selecting deduper hashing algorithm through env

6f79806

Extract common code to computeUpdatedEntitiesNoPast

0fc581b

Call UpdateEventsGauge.Reset() only once

526b33f

vikin91 force-pushed the piotr/ROX-30390-categorized branch from 48cb81b to 526b33f Compare September 10, 2025 07:41

Avoid excesive copying in computeUpdatedEntitiesNoPast

c435711

lvalerom approved these changes Sep 10, 2025

View reviewed changes

sensor/kubernetes/sensor/sensor.go Show resolved Hide resolved

Fix accidentally discovered race

15ce78d

vikin91 enabled auto-merge (squash) September 11, 2025 07:32

vikin91 merged commit 9678d9b into master Sep 11, 2025
140 of 174 checks passed

vikin91 deleted the piotr/ROX-30390-categorized branch September 11, 2025 09:32

vikin91 mentioned this pull request Sep 11, 2025

ROX-30390: Remove gauge metric for update types #16769

Merged

9 tasks

Conversation

vikin91 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Suggestion on how to review

Description

What problem does it solve?

Benefits of TransitionBased update computer

Disadvantages of TransitionBased update computer

Changes in behavior

1️⃣ Deleted open EEs are no longer triggering an update to Central

2️⃣ No more filtering of closing messages if multiple of them arrive

3️⃣ Enables the biggest memory-win change

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Memory saved by using the new updateComputer

Number of updates skipped (not sent to Central)

CPU and memory consumption

Uh oh!

openshift-ci bot commented Aug 25, 2025

Uh oh!

This comment was marked as off-topic.

vikin91 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhacs-bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

vikin91 commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhacs-bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhacs-bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvalerom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vikin91 commented Sep 10, 2025

Uh oh!

vikin91 commented Sep 11, 2025

Uh oh!

openshift-ci bot commented Sep 11, 2025

Uh oh!

rhacs-bot commented Sep 11, 2025

Uh oh!

vikin91 commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vikin91 commented Aug 25, 2025 •

edited

Loading

vikin91 commented Aug 25, 2025 •

edited

Loading

rhacs-bot commented Aug 25, 2025 •

edited

Loading

codecov bot commented Aug 25, 2025 •

edited

Loading

vikin91 commented Sep 10, 2025 •

edited

Loading

rhacs-bot commented Sep 10, 2025 •

edited

Loading

rhacs-bot commented Sep 10, 2025 •

edited

Loading

vikin91 commented Sep 11, 2025 •

edited

Loading