Skip to content

ROX-32965: throttle reprocessing of deployments#18919

Merged
dashrews78 merged 8 commits intomasterfrom
dashrews/reprocess-throttle-32965
Feb 19, 2026
Merged

ROX-32965: throttle reprocessing of deployments#18919
dashrews78 merged 8 commits intomasterfrom
dashrews/reprocess-throttle-32965

Conversation

@dashrews78
Copy link
Contributor

@dashrews78 dashrews78 commented Feb 9, 2026

Description

I received a comment from an opensource user on a previous PR dealing with evaluating process baselines. That is part of the reprocessing of a deployment. The comments can be found from this comment on down. #16965 (comment).

Turns out the users problem was an invalid index causing the queries to be incredibly slow. But what they were experience was the slow queries on resync causing connections to stack up and eventually all the connections to the database were consumed and queries timed out.

That point is key, when we reprocess deployments we need to be good stewards of our database connections so that other work within central can be performed while reprocessing is occurring. So after all that, the point of this PR is to add throttling into the reprocessing of deployments so that that the calls to the database are more spread out and not blasted at once. This also helps memory pressure as we are no longer holding loads of data waiting on subsequent calls.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Claude added a test.
CI was used.
This was also executed in a long running cluster for days. The diagnostic bundle was retrieved and analyzed with the help of Claude comparing it to a diagnostic bundle from another long running cluster. The results of the queries and the memory pressure looks to be improved.

In the data below throttle represents a long running cluster from this PR. 'btree' represents a run from a couple of weeks ago when btree indexes were merged. I will run this evaluation again against the stable 4.10 RC once that long running cluster has ran for a few days. (This also compares the profiles attached to ROX-33091)

Reprocessing Throttle Performance Comparison

Comparison of two test runs evaluating the impact of semaphore-based throttling on deployment risk reprocessing.

  • Throttle run: Branch dashrews/reprocess-throttle-32965, semaphore limiting concurrency to 15 operations
  • Btree run (baseline): No throttling, all reprocessing goroutines proceed concurrently

Both runs used identical hardware, database configuration (PostgreSQL 15.12, 10-90 connection pool), and container resource limits (16Gi memory limit, 8Gi request).


Query Performance

Query Type Throttle Mean Btree Mean Change Throttle Total Btree Total Change
COPY listening_endpoints 490.89 ms 681.16 ms -27.9% 8,070s 11,803s -31.6%
COPY process_indicators 22.37 ms 30.22 ms -26.0% 419s 608s -31.1%
SELECT alerts 0.19 ms 0.24 ms -21.9% 1,694s 2,230s -24.0%
DELETE listening_endpoints (PodUid) 0.20 ms 0.24 ms -19.2% 329s 426s -22.9%
SELECT image_cves_v2 0.06 ms 0.07 ms -13.2% 500s 630s -20.7%
DELETE listening_endpoints (Id) 79.91 ms 88.38 ms -9.6% 1,505s 1,723s -12.7%
DELETE network_flows_v2 10.56 ms 10.92 ms -3.4% 1,100s 1,202s -8.5%
COPY image_cves_v2 57.56 ms 52.94 ms +8.7% 356s 331s +7.3%

Worst-case (max) query times also improved with throttling:

Query Type Throttle Max Btree Max Change
COPY listening_endpoints 4,129 ms 6,962 ms -40.7%
COPY image_cves_v2 2,224 ms 3,463 ms -35.8%

Aggregate Database Time

Metric Throttle Btree Change
Total query execution time 459.6 min 584.2 min -21.3%
Total query calls 89.8M 94.1M -4.6%
Mean time per call 307 us 372 us -17.5%

Reprocessing Performance

Metric Throttle Btree Change
Reprocessor loop duration 717s 857s -16.3%
Mean deployment risk time 15.88 ms 19.32 ms -17.8%
Mean image risk time 65.17 ms 73.29 ms -11.1%
Node component risk time 0.017 ms 0.035 ms -51.4%

Tail Latency (Deployment Risk Processing)

Bucket Throttle Btree Change
< 4 ms 102 13 +685%
< 8 ms 24,725 4,380 +465%
< 16 ms 185,388 112,868 +64%
< 32 ms 283,293 289,397 +2.2%

The baseline had 87% fewer sub-4ms operations and 82% fewer sub-8ms operations, indicating severe tail latency degradation from connection pool contention.


Database Health (Dead Tuples)

Table Throttle Dead % Btree Dead % Improvement
listening_endpoints 0.09% 10.9% 121x cleaner
network_flows (partition) 0.00% 3.22% eliminated
process_indicators 2.34% 4.49% 48% fewer
image_component_v2 2.81% 3.35% 16% fewer
image_cves_v2 2.11% 2.97% 29% fewer

Throttled incremental processing allows autovacuum to keep up with changes rather than being overwhelmed by large batch operations.


Go Runtime / Memory

Metric Throttle Btree Change
RSS 1.00 GB 1.01 GB ~identical
Heap allocated 265 MB 290 MB -8.6%
GC cycles 30,525 32,392 -5.8%
GC total pause 4.65s 11.03s -57.8%
GC mean pause 0.152 ms 0.341 ms -55.4%
GC CPU fraction 0.42% 0.72% -41.7%
Health check (ping) latency 0.221 ms 0.320 ms -30.9%

The baseline spent 70% more CPU on garbage collection, consistent with higher memory churn from goroutine pile-up.


Memory Spike Analysis (Restart/Resync Scenario)

Heap profiles from a sensor restart scenario (before throttle fix) show the connection pool exhaustion problem:

Metric Before Restart After Restart Growth
Total heap 711.54 MB 769.02 MB +57.48 MB (+8.1%)
pgx scanning (DB deserialization) 174.44 MB 222.25 MB +47.81 MB (+27.4%)
Process filter/baseline eval 46.01 MB 61.01 MB +15.00 MB (+32.6%)
Network graph construction 16.55 MB 24.14 MB +7.59 MB (+45.9%)

Root cause: Sensor restart queues all deployments simultaneously. Without throttling, 17 workers/cluster x multiple clusters = 102+ goroutines contend for 90 database connections. Blocked goroutines hold deserialized objects in memory while waiting for connections.

With throttle: Semaphore at 15 limits concurrent DB operations to ~150 (15 goroutines x 10 calls) vs ~1,020 (102 goroutines x 10 calls). Remaining goroutines wait at the semaphore before fetching data, preventing heap accumulation.


Issues Found

No Issues

  • 0 panics, fatals, OOM errors
  • 0 connection pool exhaustion, broken pipes, context deadline exceeded
  • Semaphore mechanism functioned correctly: 10,734 acquire/release cycles, perfect 1:1:1 ratio, sub-millisecond acquisition latency
  • Network flows DELETE spikes (up to 7,988ms) are periodic maintenance on a 60-minute schedule, not throttle-related
  • Alerts table dead tuple anomaly (3.39% vs 0.65%) not caused by throttle activity

Conclusion

Throttling reprocessing concurrency to 15 operations via semaphore produces consistent improvements across all measured dimensions: query latency, tail latency, reprocessor throughput, GC overhead, database health, and system responsiveness. It directly addresses the connection pool exhaustion pattern observed during sensor restart/resync scenarios.

@dashrews78
Copy link
Contributor Author

dashrews78 commented Feb 9, 2026

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link

openshift-ci bot commented Feb 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Feb 9, 2026

Images are ready for the commit at 9393aec.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-112-g9393aec812.

@codecov
Copy link

codecov bot commented Feb 9, 2026

Codecov Report

❌ Patch coverage is 85.96491% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.48%. Comparing base (336c1f1) to head (15e2b63).

Files with missing lines Patch % Lines
...l/sensor/service/pipeline/reprocessing/pipeline.go 85.96% 7 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #18919   +/-   ##
=======================================
  Coverage   49.48%   49.48%           
=======================================
  Files        2661     2662    +1     
  Lines      200784   200879   +95     
=======================================
+ Hits        99350    99407   +57     
- Misses      94016    94053   +37     
- Partials     7418     7419    +1     
Flag Coverage Δ
go-unit-tests 49.48% <85.96%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dashrews78 dashrews78 force-pushed the dashrews/reprocess-throttle-32965 branch from 15e2b63 to 439af8c Compare February 11, 2026 19:59
@dashrews78
Copy link
Contributor Author

/test all

@dashrews78 dashrews78 changed the title ROX-32965: throttle reprocessing of deployments ROX-32965, ROX-33091: throttle reprocessing of deployments Feb 11, 2026
@dashrews78 dashrews78 marked this pull request as ready for review February 12, 2026 11:20
Copy link
Contributor

@johannes94 johannes94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only a nitpick about the error handling.

@dashrews78 dashrews78 changed the title ROX-32965, ROX-33091: throttle reprocessing of deployments ROX-32965: throttle reprocessing of deployments Feb 12, 2026
@dashrews78 dashrews78 force-pushed the dashrews/reprocess-throttle-32965 branch from e42c255 to 9393aec Compare February 13, 2026 17:39
@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 75.55556% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.52%. Comparing base (bfcce03) to head (9393aec).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...l/sensor/service/pipeline/reprocessing/pipeline.go 75.55% 10 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #18919   +/-   ##
=======================================
  Coverage   49.52%   49.52%           
=======================================
  Files        2666     2667    +1     
  Lines      201181   201264   +83     
=======================================
+ Hits        99630    99674   +44     
- Misses      94114    94151   +37     
- Partials     7437     7439    +2     
Flag Coverage Δ
go-unit-tests 49.52% <75.55%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@openshift-ci
Copy link

openshift-ci bot commented Feb 13, 2026

@dashrews78: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-20-qa-e2e-tests 9393aec link false /test ocp-4-20-qa-e2e-tests
ci/prow/ocp-4-20-operator-e2e-tests 9393aec link false /test ocp-4-20-operator-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dashrews78 dashrews78 merged commit 7dff17d into master Feb 19, 2026
170 of 185 checks passed
@dashrews78 dashrews78 deleted the dashrews/reprocess-throttle-32965 branch February 19, 2026 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants