ROX-28259: Reenable entities store history & add enrichment purger#14538
ROX-28259: Reenable entities store history & add enrichment purger#14538
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
|
Images are ready for the commit at 2c27b6f. To use with deploy scripts, first |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14538 +/- ##
==========================================
+ Coverage 49.00% 49.06% +0.06%
==========================================
Files 2550 2551 +1
Lines 187280 187595 +315
==========================================
+ Hits 91768 92035 +267
- Misses 88256 88299 +43
- Partials 7256 7261 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f2033e5 to
bc9c4b5
Compare
85380f7 to
353c7a9
Compare
063753c to
6dbdc5c
Compare
6ab33a6 to
633ea1a
Compare
633ea1a to
cbc809c
Compare
bfec36a to
e7b173b
Compare
9aba1ec to
2f5c078
Compare
lvalerom
left a comment
There was a problem hiding this comment.
Partial review (I haven't looked at the metrics in too many detail yet). Looks good, my main concern is not fully understanding why we need to also purge activeEndpoints if we are purging the hostConnection.endpoint`.
27bd009 to
7e8149c
Compare
cdc7141 to
b91a26a
Compare
b91a26a to
6324301
Compare
|
/retest |
|
/retest |
1 similar comment
|
/retest |
|
The gke-upgrade-test fails in a strange way: This looks unrelated. All other *-qa-e2e-tests fail due to |
lvalerom
left a comment
There was a problem hiding this comment.
LGTM. I'm not giving it a green yet in case you decided to implement my suggestion about disabling the purger. If you'd rather not implement it, let me know and I'll give it a green.
|
@vikin91: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
lvalerom
left a comment
There was a problem hiding this comment.
LGTM. Thank you for all the improvements 🙂
mtodor
left a comment
There was a problem hiding this comment.
Nice work! The changes look good to me. I didn't find anything major that could cause issues. There may be a few nitpicks, but I didn't see them as relevant to further delay PR.
PR is a bit difficult to review. It's simply a lot of changes in one PR. Since this PR blocks other tasks, I would suggest merging it.
One proposal for future work. Let's try to split PR into smaller ones. That could help with reviews. In this case, I can see three parts:
- refactor manager logic
- introduce additional metrics collection
- introduce purger
Description
This PR implements solutions for the following goals:
How this part of the code works
The enrichment process is there to add data known to Sensor to the data sent by Collector.
Sensor keeps track of open connections (network flows) and endpoints that participate in connections.
When Collector sends an update (every 30s), Sensor analyzes the data and does the following procedure:
connectionsByHostmap. This map can be understood as a queue of updates to be enriched.activeConnections&activeEndpoints), because further updates from Collector are expected.connectionsByHostmap, so that the enrichment can be tried again in the next cycle.connectionsByHostmap.connectionsByHostmap if it knows that the data is faulty or incomplete and there is no chance for it to work in the next cycle.The goal is that at the end of each 30s cycle, the
connectionsByHostmap contains only those entries that should be retried again in the next cycle. It is expected that over longer time periods the size ofconnectionsByHostmap fluctuates around a known (relatively small) number.Adding the purger
Unfortunately, we observed that in some cases connections and endpoints were never deleted from the
connectionsByHostmap. Moreover, in each enrichment cycle they were copied to theactiveConnections&activeEndpointsslices and were also stuck there. The enrichment process is expected to handle those data structures so that data stays there only for a limited time, but that was not the case.To make sure that no data is stuck in Sensor, I add a purger in this PR. The purger periodically (30m) checks the three data structures (
activeConnections,activeEndpoints, andconnectionsByHost) and checks a list of rules whether a given entry can be removed. The ultimate rule is themax-agewhere the purger checks the time the data resides in Sensor and deletes it if it has not been enriched for a given time (default 4h).In clusters with high workload, the purger takes up to 2s (based on metrics collected for a long-running cluster) to complete the cleanup. During that time, it locks two mutexes, so I made sure that it runs rarely (twice per hour). On smaller clusters the runtime of purger is negligible.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Mainly by long-running cluster and checking metrics. I have been observing long-running clusters for about 2 weeks until I got to the point where:
activeEndpoints,activeConnections, andconnectionsByHostin thenetworkFlowManagerare not constantly growingBEFORE: Applying purger only to activeConnections & activeEndpoints
AFTER: Purger cleans activeConnections, activeEndpoints, and connectionsByHost
BEFORE: Typical CPU & memory consumption for fake workloads long-running cluster
AFTER: CPU & memory consumption for fake workloads long-running cluster (with purger)
Summary by Sourcery
Implement a purger mechanism to prevent memory accumulation in Sensor's network flow tracking by periodically cleaning up stale network flows, endpoints, and connections
New Features:
Enhancements:
CI:
Tests:
Chores: