ROX-30390: Add TransitionBased update computer#16522
Conversation
|
Skipping CI for Draft Pull Request. |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
This change is part of the following stack: Change managed by git-spice. |
|
Images are ready for the commit at c972ee5. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #16522 +/- ##
==========================================
+ Coverage 48.67% 48.80% +0.13%
==========================================
Files 2676 2679 +3
Lines 199819 200331 +512
==========================================
+ Hits 97264 97776 +512
- Misses 94944 94946 +2
+ Partials 7611 7609 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
40a11cd to
3a951ff
Compare
c2fddcf to
4908166
Compare
3fd11bf to
67d9d64
Compare
b136877 to
9f663c4
Compare
3b07945 to
4dfb3c0
Compare
4dfb3c0 to
fd59e5a
Compare
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- The new Categorized update computer is quite large; consider splitting it into smaller, focused modules (e.g., deduper management, categorization logic, and metrics) to improve readability and maintainability.
- There’s substantial duplication in transition categorization (categorizeUpdate vs categorizeUpdateNoPast) and in keyString/keyHash implementations—extract shared logic into reusable helpers to reduce repetition.
- The name ‘Categorized’ and the use of raw strings for metrics labels can be confusing; consider choosing a more descriptive name for the update computer and using constants for label values to improve clarity and avoid typos.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new Categorized update computer is quite large; consider splitting it into smaller, focused modules (e.g., deduper management, categorization logic, and metrics) to improve readability and maintainability.
- There’s substantial duplication in transition categorization (categorizeUpdate vs categorizeUpdateNoPast) and in keyString/keyHash implementations—extract shared logic into reusable helpers to reduce repetition.
- The name ‘Categorized’ and the use of raw strings for metrics labels can be confusing; consider choosing a more descriptive name for the update computer and using constants for label values to improve clarity and avoid typos.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
# Conflicts: # sensor/common/networkflow/manager/manager_impl.go
48cb81b to
526b33f
Compare
|
It looks like a plop e2e test is failing in Edit: I confirmed that the entire e2e test |
|
Images are ready for the commit at c435711. To use with deploy scripts, first |
|
Images are ready for the commit at 15ce78d. To use with deploy scripts, first |
lvalerom
left a comment
There was a problem hiding this comment.
Looks really good. I added a comment regarding making having a way to rollback to legacy mode just in case.
|
/retest |
|
The two recent failures of |
|
@vikin91: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest |
|
Unfortunately the run https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/stackrox_stackrox/16522/pull-ci-stackrox-stackrox-master-ocp-4-12-qa-e2e-tests/1966038639137787904 had OOM kill for Sensor. It maybe a better idea to merge this together with #16532 - wdyt @lvalerom ? Ups, and it got auto-merged 😅 I will investigate this OOM, as it makes no sense to me that 8GB of memory would be consumed so fast. |
🐑 Shepherd of this stack is @lvalerom and @rhybrillou
Suggestion on how to review
transition_based.go~500 lines.NumUpdatedto theupdatecomputerpkg and added a gauge variant to it.updateComputerto tests and the production code (boilerplate)Description
In this description, whenever I write about enriched entity (EE), I mean a connection, endpoint, or process.
This PR adds a new update computer called "TransitionBased". This implementation is optimized for calculating updates based on state transitions of enriched entities.
What problem does it solve?
The new update computer is meant to simplify calculating the updates that are to be sent to Central. This is done by:
Benefits of TransitionBased update computer
keyString()functions for respective EE indicators. Using this function also reduces CPU usage, so that a single enrichment is faster when compared against legacy implementation.keyHash()functions for the respective EE indicators.*) The memory saving was measured when comparing the byte-size of
transitionBased.deduper(3 instances - one for each EE) plustransitionBased.closedConnTimestampsagainstlegacy.enrichedConnsLastSentState+legacy.enrichedEndpointsLastSentState+enrichedProcessesLastSentState.Disadvantages of TransitionBased update computer
keyHash()functions for the respective EE indicators. Difficult to estimate, based on the cpu profiles, this may be 20% longer execution of thetransitionBased.ComputeUpdatedConnsthanlegacy.ComputeUpdatedConns.keyString()function (good for small clusters) or for memory with thekeyHash()function (good for large clusters with Sensor getting OOM kills).Changes in behavior
The transition-based update computer offers many performance improvements, but there are some consequences.
The most important parts were consulted with the Collector team internal Slack link to discussion.
1️⃣ Deleted open EEs are no longer triggering an update to Central
For legacy implementation, an update to Central was generated when an EE was opened and then disappeared (without Collector sending the closed timestamp for that EE). In that case, Sensor would assume that the EE was closed in the moment when it disappeared from memory.
In the transition-based implementation, we do not send an update to Central when an open EE disappears from Sensor memory (e.g., deleted by purger, or some future algorithm that would manipulate the enrichment queue).
(There are unit test to document that behavior).
2️⃣ No more filtering of closing messages if multiple of them arrive
(This change was decided by the author and is not a limitation of the transition-based implementation. Moreover, it is applied to endpoints and processes, but not to connections, as it looks like it plays a slightly more important role for connections than for other EEs.)
If collector emits multiple close messages with different timestamps for the same EE, then we would send updates to Central only if the current close timestamp is more into the future than the previous one. In other words, any update that would tell Central that a given EE was living longer was sent to Central.
The goal of that was to limit the number of updates to Central, in case multiple close updates happen.
In the transition-based implementation, this behavior is preserved for connections but dropped for entities and processes.
closedConnTimestamps- that costs memory. The memory cost is negligible, but the complexity in code remains.📝 I am open for reviewer suggestions whether: (1) keep it the way it is, (2) implement the past behavior for all EEs, (3) drop this behavior for all EEs (i.e., remove it for connections).
3️⃣ Enables the biggest memory-win change
See #16532, where we change the rule for removing the already enriched EEs from the queue. Before that, we kept the open EEs in
hostConnsfor as long as they were open. In #16532, we change thehostConnsto be a real enrichment queue, so that it holds only those items that have not been enriched yet or the enrichment must be retried.That allows to remove really a lot of items from the enrichment queue (memory saving) and also saves significant amount of CPU as all of the items from the queue are processed on each tick.
The memory and CPU savings are a bit tricky to estimate using metrics (we cannot separate the memory savings and CPU increase won in this PR from the ones introduced in #16532), but I will try to provide some rough estimate.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
keyString()andkeyHashed()methods.combinedupdate computer that runs bothlegacyandtransition-basedin parallel for the purpose of gathering the metrics (see: ROX-30390: Leftovers from the prototype & code instrumentation #16238 - some parts are only for measurement and should never be merged to master).Memory saved by using the new updateComputer
The graph shows the number of bytes consumed by the
legacyandtransition-basedupdate computers (generated from #16238). The cluster was running fake-workloads over a weekend.The new update computer uses only 25% of the memory for endpoints when running under fake workloads (the workload is endpoint-heavy).
Note that the memory usage is higher for processes - visible in the legend. This is something specific to the fake workloads and does not happen in production. The effect is caused by generating the load using random processes attached to a limited poll of endpoints. To sensor, it looks like the same endpoint starts multiple different processes without closing the previous ones. The deduper sees that as a new process indicator running on the same endpoint and skips deduping. In reality, this happens really rarely, as we usually stop a process before starting a new one. However, one cannot exclude that a malicious actor may start a benign process listening on a port to later attach a malicious process to the same port without closing the previous process. This forces us to use the process data in the deduper and thus pay the price of more memory - see discussion with Collector team in internal Slack.
The way to go is to change the fake workload generator to not reuse existing endpoints and generate unique endpoints for unique processes. This is being addressed in followups #16629 (prefactor) and #16628 (change how processes are generated).
Number of updates skipped (not sent to Central)
The secondary goal of the enricher is to filter-out unnecessary updates to Central. In the graph we can see that over 7 billion updates were not sent (over 3 days) - note

skipin the legend. The workload was heavy on endpoints and processesListening and very light on connections.This data is only available for the transition-based update computer.
I will try to generate another graph where we compare the number of updates sent to central for both computers.
CPU and memory consumption
Note that in the experiment both update computers were running in parallel - it means that we see memory consumption of both legacy and transition-based in the graph.
Sensor was assigned with 16GB of memory limit and didn't go OOM over 3 days.