ROX-32965: throttle reprocessing of deployments by dashrews78 · Pull Request #18919 · stackrox/stackrox

dashrews78 · 2026-02-09T15:41:38Z

Description

I received a comment from an opensource user on a previous PR dealing with evaluating process baselines. That is part of the reprocessing of a deployment. The comments can be found from this comment on down. #16965 (comment).

Turns out the users problem was an invalid index causing the queries to be incredibly slow. But what they were experience was the slow queries on resync causing connections to stack up and eventually all the connections to the database were consumed and queries timed out.

That point is key, when we reprocess deployments we need to be good stewards of our database connections so that other work within central can be performed while reprocessing is occurring. So after all that, the point of this PR is to add throttling into the reprocessing of deployments so that that the calls to the database are more spread out and not blasted at once. This also helps memory pressure as we are no longer holding loads of data waiting on subsequent calls.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

Claude added a test.
CI was used.
This was also executed in a long running cluster for days. The diagnostic bundle was retrieved and analyzed with the help of Claude comparing it to a diagnostic bundle from another long running cluster. The results of the queries and the memory pressure looks to be improved.

In the data below throttle represents a long running cluster from this PR. 'btree' represents a run from a couple of weeks ago when btree indexes were merged. I will run this evaluation again against the stable 4.10 RC once that long running cluster has ran for a few days. (This also compares the profiles attached to ROX-33091)

Reprocessing Throttle Performance Comparison

Comparison of two test runs evaluating the impact of semaphore-based throttling on deployment risk reprocessing.

Throttle run: Branch dashrews/reprocess-throttle-32965, semaphore limiting concurrency to 15 operations
Btree run (baseline): No throttling, all reprocessing goroutines proceed concurrently

Both runs used identical hardware, database configuration (PostgreSQL 15.12, 10-90 connection pool), and container resource limits (16Gi memory limit, 8Gi request).

Query Performance

Query Type	Throttle Mean	Btree Mean	Change	Throttle Total	Btree Total	Change
COPY listening_endpoints	490.89 ms	681.16 ms	-27.9%	8,070s	11,803s	-31.6%
COPY process_indicators	22.37 ms	30.22 ms	-26.0%	419s	608s	-31.1%
SELECT alerts	0.19 ms	0.24 ms	-21.9%	1,694s	2,230s	-24.0%
DELETE listening_endpoints (PodUid)	0.20 ms	0.24 ms	-19.2%	329s	426s	-22.9%
SELECT image_cves_v2	0.06 ms	0.07 ms	-13.2%	500s	630s	-20.7%
DELETE listening_endpoints (Id)	79.91 ms	88.38 ms	-9.6%	1,505s	1,723s	-12.7%
DELETE network_flows_v2	10.56 ms	10.92 ms	-3.4%	1,100s	1,202s	-8.5%
COPY image_cves_v2	57.56 ms	52.94 ms	+8.7%	356s	331s	+7.3%

Worst-case (max) query times also improved with throttling:

Query Type	Throttle Max	Btree Max	Change
COPY listening_endpoints	4,129 ms	6,962 ms	-40.7%
COPY image_cves_v2	2,224 ms	3,463 ms	-35.8%

Aggregate Database Time

Metric	Throttle	Btree	Change
Total query execution time	459.6 min	584.2 min	-21.3%
Total query calls	89.8M	94.1M	-4.6%
Mean time per call	307 us	372 us	-17.5%

Reprocessing Performance

Metric	Throttle	Btree	Change
Reprocessor loop duration	717s	857s	-16.3%
Mean deployment risk time	15.88 ms	19.32 ms	-17.8%
Mean image risk time	65.17 ms	73.29 ms	-11.1%
Node component risk time	0.017 ms	0.035 ms	-51.4%

Tail Latency (Deployment Risk Processing)

Bucket	Throttle	Btree	Change
< 4 ms	102	13	+685%
< 8 ms	24,725	4,380	+465%
< 16 ms	185,388	112,868	+64%
< 32 ms	283,293	289,397	+2.2%

The baseline had 87% fewer sub-4ms operations and 82% fewer sub-8ms operations, indicating severe tail latency degradation from connection pool contention.

Database Health (Dead Tuples)

Table	Throttle Dead %	Btree Dead %	Improvement
listening_endpoints	0.09%	10.9%	121x cleaner
network_flows (partition)	0.00%	3.22%	eliminated
process_indicators	2.34%	4.49%	48% fewer
image_component_v2	2.81%	3.35%	16% fewer
image_cves_v2	2.11%	2.97%	29% fewer

Throttled incremental processing allows autovacuum to keep up with changes rather than being overwhelmed by large batch operations.

Go Runtime / Memory

Metric	Throttle	Btree	Change
RSS	1.00 GB	1.01 GB	~identical
Heap allocated	265 MB	290 MB	-8.6%
GC cycles	30,525	32,392	-5.8%
GC total pause	4.65s	11.03s	-57.8%
GC mean pause	0.152 ms	0.341 ms	-55.4%
GC CPU fraction	0.42%	0.72%	-41.7%
Health check (ping) latency	0.221 ms	0.320 ms	-30.9%

The baseline spent 70% more CPU on garbage collection, consistent with higher memory churn from goroutine pile-up.

Memory Spike Analysis (Restart/Resync Scenario)

Heap profiles from a sensor restart scenario (before throttle fix) show the connection pool exhaustion problem:

Metric	Before Restart	After Restart	Growth
Total heap	711.54 MB	769.02 MB	+57.48 MB (+8.1%)
pgx scanning (DB deserialization)	174.44 MB	222.25 MB	+47.81 MB (+27.4%)
Process filter/baseline eval	46.01 MB	61.01 MB	+15.00 MB (+32.6%)
Network graph construction	16.55 MB	24.14 MB	+7.59 MB (+45.9%)

Root cause: Sensor restart queues all deployments simultaneously. Without throttling, 17 workers/cluster x multiple clusters = 102+ goroutines contend for 90 database connections. Blocked goroutines hold deserialized objects in memory while waiting for connections.

With throttle: Semaphore at 15 limits concurrent DB operations to ~150 (15 goroutines x 10 calls) vs ~1,020 (102 goroutines x 10 calls). Remaining goroutines wait at the semaphore before fetching data, preventing heap accumulation.

Issues Found

No Issues

0 panics, fatals, OOM errors
0 connection pool exhaustion, broken pipes, context deadline exceeded
Semaphore mechanism functioned correctly: 10,734 acquire/release cycles, perfect 1:1:1 ratio, sub-millisecond acquisition latency
Network flows DELETE spikes (up to 7,988ms) are periodic maintenance on a 60-minute schedule, not throttle-related
Alerts table dead tuple anomaly (3.39% vs 0.65%) not caused by throttle activity

Conclusion

Throttling reprocessing concurrency to 15 operations via semaphore produces consistent improvements across all measured dimensions: query latency, tail latency, reprocessor throughput, GC overhead, database health, and system responsiveness. It directly addresses the connection pool exhaustion pattern observed during sensor restart/resync scenarios.

dashrews78 · 2026-02-09T15:41:40Z

This change is part of the following stack:

ROX-32965: throttle reprocessing of deployments #18919 ◀

_{Change managed by git-spice.}

openshift-ci · 2026-02-09T15:41:43Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

rhacs-bot · 2026-02-09T16:11:13Z

Images are ready for the commit at 9393aec.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-112-g9393aec812.

codecov · 2026-02-09T16:20:21Z

Codecov Report

❌ Patch coverage is 85.96491% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.48%. Comparing base (336c1f1) to head (15e2b63).

Files with missing lines	Patch %	Lines
...l/sensor/service/pipeline/reprocessing/pipeline.go	85.96%	7 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #18919   +/-   ##
=======================================
  Coverage   49.48%   49.48%           
=======================================
  Files        2661     2662    +1     
  Lines      200784   200879   +95     
=======================================
+ Hits        99350    99407   +57     
- Misses      94016    94053   +37     
- Partials     7418     7419    +1

Flag	Coverage Δ
go-unit-tests	`49.48% <85.96%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dashrews78 · 2026-02-11T20:17:07Z

/test all

central/processbaseline/evaluator/evaluator_integration_test.go

johannes94

LGTM. Only a nitpick about the error handling.

central/sensor/service/pipeline/reprocessing/pipeline.go

pkg/env/reprocessing_interval.go

central/sensor/service/pipeline/reprocessing/pipeline.go

…the problem

codecov · 2026-02-13T18:20:44Z

Codecov Report

❌ Patch coverage is 75.55556% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.52%. Comparing base (bfcce03) to head (9393aec).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
...l/sensor/service/pipeline/reprocessing/pipeline.go	75.55%	10 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #18919   +/-   ##
=======================================
  Coverage   49.52%   49.52%           
=======================================
  Files        2666     2667    +1     
  Lines      201181   201264   +83     
=======================================
+ Hits        99630    99674   +44     
- Misses      94114    94151   +37     
- Partials     7437     7439    +2

Flag	Coverage Δ
go-unit-tests	`49.52% <75.55%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

openshift-ci · 2026-02-13T20:07:41Z

@dashrews78: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ocp-4-20-qa-e2e-tests	`9393aec`	link	false	`/test ocp-4-20-qa-e2e-tests`
ci/prow/ocp-4-20-operator-e2e-tests	`9393aec`	link	false	`/test ocp-4-20-operator-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress label Feb 9, 2026

github-actions bot added the area/central label Feb 9, 2026

dashrews78 force-pushed the dashrews/reprocess-throttle-32965 branch from 15e2b63 to 439af8c Compare February 11, 2026 19:59

dashrews78 changed the title ~~ROX-32965: throttle reprocessing of deployments~~ ROX-32965, ROX-33091: throttle reprocessing of deployments Feb 11, 2026

dashrews78 commented Feb 12, 2026

View reviewed changes

central/processbaseline/evaluator/evaluator_integration_test.go Show resolved Hide resolved

dashrews78 marked this pull request as ready for review February 12, 2026 11:20

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 12, 2026

johannes94 approved these changes Feb 12, 2026

View reviewed changes

central/sensor/service/pipeline/reprocessing/pipeline.go Outdated Show resolved Hide resolved

dashrews78 changed the title ~~ROX-32965, ROX-33091: throttle reprocessing of deployments~~ ROX-32965: throttle reprocessing of deployments Feb 12, 2026

c-du reviewed Feb 12, 2026

View reviewed changes

pkg/env/reprocessing_interval.go Show resolved Hide resolved

central/sensor/service/pipeline/reprocessing/pipeline.go Outdated Show resolved Hide resolved

AlexVulaj reviewed Feb 12, 2026

View reviewed changes

central/sensor/service/pipeline/reprocessing/pipeline.go Show resolved Hide resolved

dashrews78 added 8 commits February 13, 2026 12:22

trying something

011b9f6

clean up some logs

78aa340

style

1a25208

more style

4e4fdf5

removing test as it is flaky and was really only useful in analyzing …

a454776

…the problem

removing lots of extra words from Claude

356fe18

annoying format stuff

1c32bb9

fix logging per comment

9393aec

dashrews78 force-pushed the dashrews/reprocess-throttle-32965 branch from e42c255 to 9393aec Compare February 13, 2026 17:39

AlexVulaj approved these changes Feb 17, 2026

View reviewed changes

c-du approved these changes Feb 17, 2026

View reviewed changes

dashrews78 merged commit 7dff17d into master Feb 19, 2026
170 of 185 checks passed

dashrews78 deleted the dashrews/reprocess-throttle-32965 branch February 19, 2026 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-32965: throttle reprocessing of deployments#18919

ROX-32965: throttle reprocessing of deployments#18919
dashrews78 merged 8 commits intomasterfrom
dashrews/reprocess-throttle-32965

dashrews78 commented Feb 9, 2026 •

edited

Loading

Uh oh!

dashrews78 commented Feb 9, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

rhacs-bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 9, 2026

Uh oh!

dashrews78 commented Feb 11, 2026

Uh oh!

Uh oh!

johannes94 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dashrews78 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Reprocessing Throttle Performance Comparison

Query Performance

Aggregate Database Time

Reprocessing Performance

Tail Latency (Deployment Risk Processing)

Database Health (Dead Tuples)

Go Runtime / Memory

Memory Spike Analysis (Restart/Resync Scenario)

Issues Found

No Issues

Conclusion

Uh oh!

dashrews78 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

rhacs-bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 9, 2026

Codecov Report

Uh oh!

dashrews78 commented Feb 11, 2026

Uh oh!

Uh oh!

johannes94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dashrews78 commented Feb 9, 2026 •

edited

Loading

dashrews78 commented Feb 9, 2026 •

edited

Loading

rhacs-bot commented Feb 9, 2026 •

edited

Loading

codecov bot commented Feb 13, 2026 •

edited

Loading

openshift-ci bot commented Feb 13, 2026 •

edited

Loading