ROX-32965: throttle reprocessing of deployments#18919
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
|
Images are ready for the commit at 9393aec. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18919 +/- ##
=======================================
Coverage 49.48% 49.48%
=======================================
Files 2661 2662 +1
Lines 200784 200879 +95
=======================================
+ Hits 99350 99407 +57
- Misses 94016 94053 +37
- Partials 7418 7419 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
15e2b63 to
439af8c
Compare
|
/test all |
johannes94
left a comment
There was a problem hiding this comment.
LGTM. Only a nitpick about the error handling.
e42c255 to
9393aec
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18919 +/- ##
=======================================
Coverage 49.52% 49.52%
=======================================
Files 2666 2667 +1
Lines 201181 201264 +83
=======================================
+ Hits 99630 99674 +44
- Misses 94114 94151 +37
- Partials 7437 7439 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@dashrews78: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
I received a comment from an opensource user on a previous PR dealing with evaluating process baselines. That is part of the reprocessing of a deployment. The comments can be found from this comment on down. #16965 (comment).
Turns out the users problem was an invalid index causing the queries to be incredibly slow. But what they were experience was the slow queries on resync causing connections to stack up and eventually all the connections to the database were consumed and queries timed out.
That point is key, when we reprocess deployments we need to be good stewards of our database connections so that other work within central can be performed while reprocessing is occurring. So after all that, the point of this PR is to add throttling into the reprocessing of deployments so that that the calls to the database are more spread out and not blasted at once. This also helps memory pressure as we are no longer holding loads of data waiting on subsequent calls.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Claude added a test.
CI was used.
This was also executed in a long running cluster for days. The diagnostic bundle was retrieved and analyzed with the help of Claude comparing it to a diagnostic bundle from another long running cluster. The results of the queries and the memory pressure looks to be improved.
In the data below
throttlerepresents a long running cluster from this PR. 'btree' represents a run from a couple of weeks ago when btree indexes were merged. I will run this evaluation again against the stable 4.10 RC once that long running cluster has ran for a few days. (This also compares the profiles attached to ROX-33091)Reprocessing Throttle Performance Comparison
Comparison of two test runs evaluating the impact of semaphore-based throttling on deployment risk reprocessing.
dashrews/reprocess-throttle-32965, semaphore limiting concurrency to 15 operationsBoth runs used identical hardware, database configuration (PostgreSQL 15.12, 10-90 connection pool), and container resource limits (16Gi memory limit, 8Gi request).
Query Performance
Worst-case (max) query times also improved with throttling:
Aggregate Database Time
Reprocessing Performance
Tail Latency (Deployment Risk Processing)
The baseline had 87% fewer sub-4ms operations and 82% fewer sub-8ms operations, indicating severe tail latency degradation from connection pool contention.
Database Health (Dead Tuples)
Throttled incremental processing allows autovacuum to keep up with changes rather than being overwhelmed by large batch operations.
Go Runtime / Memory
The baseline spent 70% more CPU on garbage collection, consistent with higher memory churn from goroutine pile-up.
Memory Spike Analysis (Restart/Resync Scenario)
Heap profiles from a sensor restart scenario (before throttle fix) show the connection pool exhaustion problem:
Root cause: Sensor restart queues all deployments simultaneously. Without throttling, 17 workers/cluster x multiple clusters = 102+ goroutines contend for 90 database connections. Blocked goroutines hold deserialized objects in memory while waiting for connections.
With throttle: Semaphore at 15 limits concurrent DB operations to ~150 (15 goroutines x 10 calls) vs ~1,020 (102 goroutines x 10 calls). Remaining goroutines wait at the semaphore before fetching data, preventing heap accumulation.
Issues Found
No Issues
Conclusion
Throttling reprocessing concurrency to 15 operations via semaphore produces consistent improvements across all measured dimensions: query latency, tail latency, reprocessor throughput, GC overhead, database health, and system responsiveness. It directly addresses the connection pool exhaustion pattern observed during sensor restart/resync scenarios.