Skip to content

ROX-30390: Add TransitionBased update computer#16522

Merged
vikin91 merged 16 commits intomasterfrom
piotr/ROX-30390-categorized
Sep 11, 2025
Merged

ROX-30390: Add TransitionBased update computer#16522
vikin91 merged 16 commits intomasterfrom
piotr/ROX-30390-categorized

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Aug 25, 2025

🐑 Shepherd of this stack is @lvalerom and @rhybrillou

Suggestion on how to review

  1. The "meat" of this PR is the transition_based.go ~500 lines.
  2. Most lines are unit tests and comments.
  3. I moved metric NumUpdated to the updatecomputer pkg and added a gauge variant to it.
  4. Rest is plugging the new updateComputer to tests and the production code (boilerplate)

Description

In this description, whenever I write about enriched entity (EE), I mean a connection, endpoint, or process.

This PR adds a new update computer called "TransitionBased". This implementation is optimized for calculating updates based on state transitions of enriched entities.

What problem does it solve?

The new update computer is meant to simplify calculating the updates that are to be sent to Central. This is done by:

  1. Not relying on the updates sent in the previous tick - that allows Sensor to drop some of the enriched entities (EE) without triggering an update to Central. This is to give Sensor the power of making a decision whether Central should know about something, or not.
  2. Decreasing the memory pressure by storing less data in memory - currently, the EE being enriched are stored in three data structures that may grow to many hundred-millions of entries in one day (for particular workloads - e.g., the fake workload used in release testing). Delegating one of those three data structures to the update computer, allows us to control it better and test various implementations.

Benefits of TransitionBased update computer

  1. Reduction of memory pressure by 50%* when using the CPU-optimized keyString() functions for respective EE indicators. Using this function also reduces CPU usage, so that a single enrichment is faster when compared against legacy implementation.
  2. Reduction of memory pressure by 75%* when using the memory-optimized keyHash() functions for the respective EE indicators.
  3. Reduction of the number of redundant updates do Central by roughly 20% (difficult to measure precisely) for the fake workloads.

*) The memory saving was measured when comparing the byte-size of transitionBased.deduper (3 instances - one for each EE) plus transitionBased.closedConnTimestamps against legacy.enrichedConnsLastSentState + legacy.enrichedEndpointsLastSentState + enrichedProcessesLastSentState.

Disadvantages of TransitionBased update computer

  1. Increased CPU usage when using the memory-optimized keyHash() functions for the respective EE indicators. Difficult to estimate, based on the cpu profiles, this may be 20% longer execution of the transitionBased.ComputeUpdatedConns than legacy.ComputeUpdatedConns.
    • Option: Allow customers select whether they optimize for CPU and use the keyString() function (good for small clusters) or for memory with the keyHash() function (good for large clusters with Sensor getting OOM kills).

Changes in behavior

The transition-based update computer offers many performance improvements, but there are some consequences.
The most important parts were consulted with the Collector team internal Slack link to discussion.

1️⃣ Deleted open EEs are no longer triggering an update to Central

For legacy implementation, an update to Central was generated when an EE was opened and then disappeared (without Collector sending the closed timestamp for that EE). In that case, Sensor would assume that the EE was closed in the moment when it disappeared from memory.

In the transition-based implementation, we do not send an update to Central when an open EE disappears from Sensor memory (e.g., deleted by purger, or some future algorithm that would manipulate the enrichment queue).

(There are unit test to document that behavior).

2️⃣ No more filtering of closing messages if multiple of them arrive

(This change was decided by the author and is not a limitation of the transition-based implementation. Moreover, it is applied to endpoints and processes, but not to connections, as it looks like it plays a slightly more important role for connections than for other EEs.)

If collector emits multiple close messages with different timestamps for the same EE, then we would send updates to Central only if the current close timestamp is more into the future than the previous one. In other words, any update that would tell Central that a given EE was living longer was sent to Central.
The goal of that was to limit the number of updates to Central, in case multiple close updates happen.

In the transition-based implementation, this behavior is preserved for connections but dropped for entities and processes.

  1. Why kept for connections? Because of features that rely on the timestamps and the connection information (e.g., the baselines), it may be beneficial for them when the timestamps would not jump back in time.
  2. Why dropping that behavior? Keeping it, requires an additional map - e.g., closedConnTimestamps - that costs memory. The memory cost is negligible, but the complexity in code remains.

📝 I am open for reviewer suggestions whether: (1) keep it the way it is, (2) implement the past behavior for all EEs, (3) drop this behavior for all EEs (i.e., remove it for connections).

3️⃣ Enables the biggest memory-win change

See #16532, where we change the rule for removing the already enriched EEs from the queue. Before that, we kept the open EEs in hostConns for as long as they were open. In #16532, we change the hostConns to be a real enrichment queue, so that it holds only those items that have not been enriched yet or the enrichment must be retried.
That allows to remove really a lot of items from the enrichment queue (memory saving) and also saves significant amount of CPU as all of the items from the queue are processed on each tick.

The memory and CPU savings are a bit tricky to estimate using metrics (we cannot separate the memory savings and CPU increase won in this PR from the ones introduced in #16532), but I will try to provide some rough estimate.

User-facing documentation

  • CHANGELOG.md is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed
    • This PR description can serve as internal documentation of that change

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • Extensive unit test coverage
  • Existing integration test are passing
  • Extensive benchmarking of crucial functions where memory and CPU consumption play a role (shall I commit the benchmarks as well?)
  • CPU and Memory profiling for comparing the behavior of keyString() and keyHashed() methods.
  • Multiple measurements using metrics and Grafana dashboard for the estimation of memory consumption during fake workloads

Memory saved by using the new updateComputer

The graph shows the number of bytes consumed by the legacy and transition-based update computers (generated from #16238). The cluster was running fake-workloads over a weekend.

The new update computer uses only 25% of the memory for endpoints when running under fake workloads (the workload is endpoint-heavy).

Screenshot 2025-09-01 at 17 33 05

Note that the memory usage is higher for processes - visible in the legend. This is something specific to the fake workloads and does not happen in production. The effect is caused by generating the load using random processes attached to a limited poll of endpoints. To sensor, it looks like the same endpoint starts multiple different processes without closing the previous ones. The deduper sees that as a new process indicator running on the same endpoint and skips deduping. In reality, this happens really rarely, as we usually stop a process before starting a new one. However, one cannot exclude that a malicious actor may start a benign process listening on a port to later attach a malicious process to the same port without closing the previous process. This forces us to use the process data in the deduper and thus pay the price of more memory - see discussion with Collector team in internal Slack.

The way to go is to change the fake workload generator to not reuse existing endpoints and generate unique endpoints for unique processes. This is being addressed in followups #16629 (prefactor) and #16628 (change how processes are generated).

Number of updates skipped (not sent to Central)

The secondary goal of the enricher is to filter-out unnecessary updates to Central. In the graph we can see that over 7 billion updates were not sent (over 3 days) - note skip in the legend. The workload was heavy on endpoints and processesListening and very light on connections.
This data is only available for the transition-based update computer.
I will try to generate another graph where we compare the number of updates sent to central for both computers.
Screenshot 2025-09-01 at 17 44 29

CPU and memory consumption

Note that in the experiment both update computers were running in parallel - it means that we see memory consumption of both legacy and transition-based in the graph.
Sensor was assigned with 16GB of memory limit and didn't go OOM over 3 days.

Screenshot 2025-09-01 at 17 48 23

@openshift-ci
Copy link

openshift-ci bot commented Aug 25, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@red-hat-konflux

This comment was marked as off-topic.

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Aug 25, 2025

Images are ready for the commit at c972ee5.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-640-gc972ee5e67.

@codecov
Copy link

codecov bot commented Aug 25, 2025

Codecov Report

❌ Patch coverage is 81.87135% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.80%. Comparing base (ab86e03) to head (15ce78d).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
...mon/networkflow/updatecomputer/transition_based.go 81.67% 58 Missing and 1 partial ⚠️
sensor/common/networkflow/manager/manager_impl.go 84.61% 2 Missing ⚠️
sensor/common/networkflow/updatecomputer/legacy.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #16522      +/-   ##
==========================================
+ Coverage   48.67%   48.80%   +0.13%     
==========================================
  Files        2676     2679       +3     
  Lines      199819   200331     +512     
==========================================
+ Hits        97264    97776     +512     
- Misses      94944    94946       +2     
+ Partials     7611     7609       -2     
Flag Coverage Δ
go-unit-tests 48.80% <81.87%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 force-pushed the piotr/ROX-30390-update-computer branch from 40a11cd to 3a951ff Compare August 25, 2025 13:04
@vikin91 vikin91 force-pushed the piotr/ROX-30390-categorized branch from c2fddcf to 4908166 Compare August 25, 2025 13:04
@vikin91 vikin91 added ci-all-qa-tests Tells CI to run all API tests (not just BAT). ci-race-tests Uses a `-race` build for all e2e tests labels Aug 26, 2025
@vikin91 vikin91 force-pushed the piotr/ROX-30390-categorized branch 4 times, most recently from 3fd11bf to 67d9d64 Compare August 29, 2025 14:42
@vikin91 vikin91 force-pushed the piotr/ROX-30390-update-computer branch from b136877 to 9f663c4 Compare August 29, 2025 16:18
@vikin91 vikin91 force-pushed the piotr/ROX-30390-categorized branch from 3b07945 to 4dfb3c0 Compare August 29, 2025 16:19
Base automatically changed from piotr/ROX-30390-update-computer to master September 1, 2025 08:53
@vikin91 vikin91 force-pushed the piotr/ROX-30390-categorized branch from 4dfb3c0 to fd59e5a Compare September 1, 2025 08:54
@vikin91 vikin91 marked this pull request as ready for review September 1, 2025 16:21
@vikin91 vikin91 requested a review from a team as a code owner September 1, 2025 16:21
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The new Categorized update computer is quite large; consider splitting it into smaller, focused modules (e.g., deduper management, categorization logic, and metrics) to improve readability and maintainability.
  • There’s substantial duplication in transition categorization (categorizeUpdate vs categorizeUpdateNoPast) and in keyString/keyHash implementations—extract shared logic into reusable helpers to reduce repetition.
  • The name ‘Categorized’ and the use of raw strings for metrics labels can be confusing; consider choosing a more descriptive name for the update computer and using constants for label values to improve clarity and avoid typos.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new Categorized update computer is quite large; consider splitting it into smaller, focused modules (e.g., deduper management, categorization logic, and metrics) to improve readability and maintainability.
- There’s substantial duplication in transition categorization (categorizeUpdate vs categorizeUpdateNoPast) and in keyString/keyHash implementations—extract shared logic into reusable helpers to reduce repetition.
- The name ‘Categorized’ and the use of raw strings for metrics labels can be confusing; consider choosing a more descriptive name for the update computer and using constants for label values to improve clarity and avoid typos.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@vikin91 vikin91 added the auto-retest PRs with this label will be automatically retested if prow checks fails label Sep 2, 2025
@vikin91 vikin91 force-pushed the piotr/ROX-30390-categorized branch from 48cb81b to 526b33f Compare September 10, 2025 07:41
@vikin91
Copy link
Contributor Author

vikin91 commented Sep 10, 2025

It looks like a plop e2e test is failing in ocp-4-12-qa-e2e-tests and I saw there an OOM kill. I am investigating this.

Edit: I confirmed that the entire e2e test ./gradlew test --tests="ProcessesListeningOnPortsTest" is passing when I run this locally against a remote cluster. The failure must have been a flake caused by Sensor crash. Let's see if CI can reproduce it.

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Sep 10, 2025

Images are ready for the commit at c435711.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-737-gc4357118ba.

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Sep 10, 2025

Images are ready for the commit at 15ce78d.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-738-g15ce78d821.

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I added a comment regarding making having a way to rollback to legacy mode just in case.

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 10, 2025

/retest

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 11, 2025

The two recent failures of ocp-4-12-qa-e2e-tests were unrelated. Looks like no more OOM kills. Also the test ProcessesListeningOnPortsTest is passing.

@vikin91 vikin91 enabled auto-merge (squash) September 11, 2025 07:32
@openshift-ci
Copy link

openshift-ci bot commented Sep 11, 2025

@vikin91: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-12-qa-e2e-tests 15ce78d link false /test ocp-4-12-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rhacs-bot
Copy link
Contributor

/retest

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 11, 2025

Unfortunately the run https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/stackrox_stackrox/16522/pull-ci-stackrox-stackrox-master-ocp-4-12-qa-e2e-tests/1966038639137787904 had OOM kill for Sensor.

It maybe a better idea to merge this together with #16532 - wdyt @lvalerom ?

Ups, and it got auto-merged 😅 I will investigate this OOM, as it makes no sense to me that 8GB of memory would be consumed so fast.

@vikin91 vikin91 merged commit 9678d9b into master Sep 11, 2025
140 of 174 checks passed
@vikin91 vikin91 deleted the piotr/ROX-30390-categorized branch September 11, 2025 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review area/sensor auto-retest PRs with this label will be automatically retested if prow checks fails ci-all-qa-tests Tells CI to run all API tests (not just BAT). ci-race-tests Uses a `-race` build for all e2e tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants