Skip to content

ROX-30963: Only query indicator fields needed for evalution#16965

Merged
dashrews78 merged 7 commits intomasterfrom
dashrews/evaluate-baseline-query-needed-30963
Oct 1, 2025
Merged

ROX-30963: Only query indicator fields needed for evalution#16965
dashrews78 merged 7 commits intomasterfrom
dashrews/evaluate-baseline-query-needed-30963

Conversation

@dashrews78
Copy link
Contributor

@dashrews78 dashrews78 commented Sep 22, 2025

Description

Stripping the data pulled from the database. Using our "view" like mechanism to run Select queries to only retrieve the 6 or so fields that are required for re-evaluation. Notice the explain plans below that show the width of each row returned is cut in half which will save memory on both central and central-db as well as traffic between the two. Also the benchmark results below can be compared to those in #16932 . The speed is very slightly slower, but the memory allocations are cut in half with this PR.

 go test -tags sql_integration -bench=. -benchmem -run=^$
goos: darwin
goarch: arm64
pkg: github.com/stackrox/rox/central/processbaseline/evaluator
cpu: Apple M3 Pro
BenchmarkEvaluateBaselinesAndPersistResult/100_processes_2_containers-12         	    3396	    317166 ns/op	  154897 B/op	    3761 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/500_processes_3_containers-12         	    1332	   1200867 ns/op	  745970 B/op	   19772 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/1000_processes_5_containers-12        	     471	   3253969 ns/op	 1902272 B/op	   50981 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-12         	     254	   5629626 ns/op	 4639233 B/op	   82186 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/2000_processes_10_containers-12                   	     150	   9233982 ns/op	 6856404 B/op	  143391 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/10000_processes_20_containers-12                  	      80	  22013626 ns/op	14314797 B/op	  345402 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/25000_processes_50_containers-12                  	      32	  52227600 ns/op	32828808 B/op	  847411 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/50000_processes_100_containers-12                 	      13	 116390516 ns/op	68937164 B/op	 1849421 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-12      	       4	 287536584 ns/op	165883168 B/op	 2851423 allocs/op
BenchmarkEvaluateBaselinesSmallScale/2500_processes_5_containers-12                          	     738	   1622809 ns/op	  973947 B/op	   26472 allocs/op
PASS
ok  	github.com/stackrox/rox/central/processbaseline/evaluator	28.193s

Explain plans -- The top one is the old query, the bottom one is the query from this PR. Both queries access the same rows, but you can see by the width that the new query returns far less data per row which is friendlier to both central and central-db.

-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.process_indicators  (cost=4.19..102.48 rows=25 width=441) (actual time=0.020..0.080 rows=12 loops=1)
   Output: serialized
   Recheck Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
   Heap Blocks: exact=12
   Buffers: shared hit=14
   WAL: records=2 bytes=112
   ->  Bitmap Index Scan on processindicators_deploymentid  (cost=0.00..4.19 rows=25 width=0) (actual time=0.011..0.011 rows=12 loops=1)
         Index Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
         Buffers: shared hit=2
 Query Identifier: -2289536428206420033
 Planning:
   Buffers: shared hit=80
 Planning Time: 0.392 ms
 Execution Time: 0.097 ms
(14 rows)

central_active=# explain (analyze, verbose, buffers, costs, wal) select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Time as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = '4f95d656-8460-4fd0-8778-96d268ec791f';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.process_indicators  (cost=4.19..102.60 rows=25 width=149) (actual time=0.019..0.057 rows=12 loops=1)
   Output: (id)::text, containername, signal_execfilepath, containerstarttime, signal_time, signal_name, signal_args
   Recheck Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
   Heap Blocks: exact=12
   Buffers: shared hit=14
   WAL: records=1 bytes=56
   ->  Bitmap Index Scan on processindicators_deploymentid  (cost=0.00..4.19 rows=25 width=0) (actual time=0.010..0.010 rows=12 loops=1)
         Index Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
         Buffers: shared hit=2
 Query Identifier: -4387973057236564826
 Planning:
   Buffers: shared hit=97
 Planning Time: 0.324 ms
 Execution Time: 0.073 ms
(14 rows)

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

change me!

@dashrews78
Copy link
Contributor Author

dashrews78 commented Sep 22, 2025

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link

openshift-ci bot commented Sep 22, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Sep 22, 2025

Images are ready for the commit at 135af34.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-940-g135af34b1e.

@dashrews78 dashrews78 force-pushed the dashrews/container-start-migration-30961 branch 2 times, most recently from 494fd3a to c4ccc51 Compare September 24, 2025 19:37
@dashrews78 dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from 50cf275 to 1db6ed5 Compare September 24, 2025 19:37
@dashrews78 dashrews78 requested a review from janisz September 24, 2025 19:44
@dashrews78 dashrews78 force-pushed the dashrews/container-start-migration-30961 branch 2 times, most recently from 894aad5 to 3915aa5 Compare September 29, 2025 13:40
@dashrews78 dashrews78 changed the base branch from dashrews/container-start-migration-30961 to dashrews/container-start-migration-30961-sequential September 29, 2025 18:12
@dashrews78 dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from 1db6ed5 to 06aa438 Compare September 29, 2025 18:12
@dashrews78 dashrews78 changed the base branch from dashrews/container-start-migration-30961-sequential to master September 29, 2025 18:30
@dashrews78 dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from 06aa438 to 7b39116 Compare September 29, 2025 18:30
@dashrews78 dashrews78 marked this pull request as ready for review September 29, 2025 18:30
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In GetProcessIndicatorsRiskView you’re using WriteAllowed for SAC checks even though it’s a read-only operation—switch to ReadAllowed to accurately reflect permissions.
  • Avoid logging errors inside GetProcessIndicatorsRiskView; wrap and return the error instead so callers can decide how to report it without duplicate log entries.
  • You now have almost identical logic for storage.ProcessIndicator and views.ProcessIndicatorRiskView (e.g., BaselineItemFromProcess vs. BaselineItemFromProcessView); consider unifying these via a shared interface or helper to avoid drift.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In GetProcessIndicatorsRiskView you’re using WriteAllowed for SAC checks even though it’s a read-only operation—switch to ReadAllowed to accurately reflect permissions.
- Avoid logging errors inside GetProcessIndicatorsRiskView; wrap and return the error instead so callers can decide how to report it without duplicate log entries.
- You now have almost identical logic for storage.ProcessIndicator and views.ProcessIndicatorRiskView (e.g., BaselineItemFromProcess vs. BaselineItemFromProcessView); consider unifying these via a shared interface or helper to avoid drift.

## Individual Comments

### Comment 1
<location> `central/processbaseline/baselineutils.go:89` </location>
<code_context>
+// IsStartupProcess determines if the process is a startup process
+// A process is considered a startup process if it happens within the first ContainerStartupDuration and was not scraped
+// but instead pulled from exec
+func IsStartupProcessView(process *views.ProcessIndicatorRiskView) bool {
+	if process.ContainerStartTime == nil {
+		return false
</code_context>

<issue_to_address>
**issue (bug_risk):** IsStartupProcessView does not check for nil SignalTime.

A nil SignalTime can result in protoutils.Sub panicking or misbehaving. Please add a nil check for SignalTime as done for ContainerStartTime.
</issue_to_address>

### Comment 2
<location> `central/processindicator/datastore/datastore_impl.go:229` </location>
<code_context>
 }

+// GetProcessIndicatorsRiskView retrieves minimal fields from process indicator for risk evaluation
+func (ds *datastoreImpl) GetProcessIndicatorsRiskView(ctx context.Context, q *v1.Query) ([]*views.ProcessIndicatorRiskView, error) {
+	if ok, err := deploymentExtensionSAC.WriteAllowed(ctx); err != nil {
+		return nil, err
</code_context>

<issue_to_address>
**issue (complexity):** Consider extracting the SAC check and risk view query-selects into helper functions to simplify GetProcessIndicatorsRiskView.

```suggestion
We can collapse most of the plumbing in `GetProcessIndicatorsRiskView` by extracting the SAC‐check and the “risk view” query‐selects into small, reusable helpers.  This keeps the method focused and avoids manual duplication.

1) In `central/processindicator/views/helpers.go`, define the select‐fields builder:

```go
package views

import (
  pkgSearch "github.com/stackrox/rox/pkg/search"
  v1 "github.com/stackrox/rox/generated/api/v1"
)

// BuildRiskViewQuery clones `q` and injects exactly the selects needed for risk view.
func BuildRiskViewQuery(q *v1.Query) *v1.Query {
  clone := q.CloneVT()
  clone.Selects = []*v1.QuerySelect{
    pkgSearch.NewQuerySelect(pkgSearch.ProcessID).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ContainerName).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessExecPath).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessContainerStartTime).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessCreationTime).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessName).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessArguments).Proto(),
  }
  return clone
}
```

2) In your datastore file, factor out the SAC guard:

```go
func requireDeploymentExtensionWrite(ctx context.Context) error {
  ok, err := deploymentExtensionSAC.WriteAllowed(ctx)
  if err != nil {
    return err
  }
  if !ok {
    return sac.ErrResourceAccessDenied
  }
  return nil
}
```

3) Simplify `GetProcessIndicatorsRiskView`:

```go
func (ds *datastoreImpl) GetProcessIndicatorsRiskView(
  ctx context.Context,
  q *v1.Query,
) ([]*views.ProcessIndicatorRiskView, error) {
  if err := requireDeploymentExtensionWrite(ctx); err != nil {
    return nil, err
  }

  riskQ := views.BuildRiskViewQuery(q)
  return pgSearch.RunSelectRequestForSchema[views.ProcessIndicatorRiskView](
    ctx,
    ds.db,
    pkgSchema.ProcessIndicatorsSchema,
    riskQ,
  )
}
```

This keeps all functionality intact but removes inline plumbing, making the datastore method much easier to read and maintain.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@dashrews78 dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from f03a3cd to 55f4df3 Compare September 30, 2025 09:51
@codecov
Copy link

codecov bot commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 33.33333% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.78%. Comparing base (85aa9d2) to head (135af34).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...ntral/processindicator/datastore/datastore_impl.go 0.00% 25 Missing ⚠️
central/processindicator/datastore/singleton.go 0.00% 4 Missing ⚠️
...entral/processbaseline/evaluator/evaluator_impl.go 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #16965      +/-   ##
==========================================
- Coverage   48.79%   48.78%   -0.02%     
==========================================
  Files        2711     2711              
  Lines      202284   202311      +27     
==========================================
- Hits        98712    98692      -20     
- Misses      95793    95834      +41     
- Partials     7779     7785       +6     
Flag Coverage Δ
go-unit-tests 48.78% <33.33%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@janisz
Copy link
Contributor

janisz commented Sep 30, 2025

Out of curiosity I run updated benchmark from #17078 on master, #16987 and this PR
What we can observe ideally we should combine 2 approaches to get the best results of reducing memory usage:

                                                                              │   master       │            #16987                   │              #16965                  │
                                                                              │     sec/op     │    sec/op     vs base               │    sec/op      vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                   3.234m ± 582%   2.982m ± 22%        ~ (p=0.240 n=6)    3.800m ± 52%        ~ (p=0.394 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                   5.712m ±  12%   5.006m ± 25%        ~ (p=0.240 n=6)    6.997m ± 35%  +22.50% (p=0.009 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                 13.669m ±  37%   9.174m ± 39%  -32.88% (p=0.009 n=6)   11.470m ± 25%        ~ (p=0.093 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8       13.96m ±   9%   11.48m ± 15%  -17.77% (p=0.009 n=6)    18.36m ± 25%  +31.53% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                 15.64m ±  67%   15.07m ± 25%        ~ (p=0.394 n=6)    17.11m ± 14%        ~ (p=0.065 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                73.41m ±  57%   62.33m ± 19%  -15.09% (p=0.026 n=6)    86.49m ± 21%        ~ (p=0.065 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8                176.7m ±   4%   145.7m ±  5%  -17.59% (p=0.002 n=6)    205.9m ± 16%  +16.49% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8               352.4m ±   7%   277.5m ±  2%  -21.26% (p=0.002 n=6)    413.4m ± 23%  +17.30% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8    501.9m ±  10%   466.5m ± 11%   -7.07% (p=0.026 n=6)    509.8m ± 25%        ~ (p=0.310 n=6)
geomean                                                                          36.89m          31.18m        -15.48%                  41.43m        +12.30%

                                                                              │   master       │            #16987                     │              #16965                   │
                                                                              │ max_heap_bytes │ max_heap_bytes  vs base               │ max_heap_bytes  vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                     10.76M ± 1%     10.79M ±  0%        ~ (p=0.180 n=6)      10.27M ± 0%   -4.57% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                     14.75M ± 0%     14.72M ±  0%   -0.18% (p=0.015 n=6)      11.82M ± 0%  -19.88% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                    19.54M ± 1%     19.26M ±  1%   -1.42% (p=0.026 n=6)      13.56M ± 0%  -30.62% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8         25.30M ± 0%     19.75M ±  3%  -21.91% (p=0.002 n=6)      18.87M ± 2%  -25.42% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                   29.68M ± 0%     22.67M ± 14%  -23.62% (p=0.002 n=6)      17.09M ± 0%  -42.42% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                 109.78M ± 1%     41.24M ± 56%  -62.43% (p=0.002 n=6)      40.25M ± 3%  -63.33% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8                 260.41M ± 1%     50.99M ± 44%  -80.42% (p=0.002 n=6)      88.32M ± 2%  -66.09% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8                509.47M ± 1%     80.57M ± 44%  -84.19% (p=0.002 n=6)     173.10M ± 4%  -66.02% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8      832.9M ± 0%     100.1M ± 69%  -87.98% (p=0.002 n=6)      470.3M ± 1%  -43.53% (p=0.002 n=6)
geomean                                                                            67.28M          30.56M        -54.58%                    37.69M       -43.98%

                                                                              │    master        │           #16987                        │               #16965                    │
                                                                              │ max_heap_objects │ max_heap_objects  vs base               │ max_heap_objects  vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                       68.75k ± 1%       68.91k ±  0%        ~ (p=0.065 n=6)        73.46k ± 0%   +6.85% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                       96.87k ± 0%       96.99k ±  0%        ~ (p=0.240 n=6)       113.70k ± 0%  +17.38% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                      130.3k ± 1%       128.5k ±  1%        ~ (p=0.065 n=6)        161.4k ± 0%  +23.92% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8           130.2k ± 0%       106.1k ±  1%  -18.53% (p=0.002 n=6)        160.2k ± 2%  +23.06% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                     200.1k ± 0%       153.1k ± 15%  -23.46% (p=0.002 n=6)        255.5k ± 0%  +27.70% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                    758.0k ± 0%       285.9k ± 50%  -62.29% (p=0.002 n=6)        874.0k ± 5%  +15.29% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8                   1807.4k ± 0%       355.3k ± 31%  -80.34% (p=0.002 n=6)       2096.2k ± 2%  +15.98% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8                  3557.0k ± 0%       546.7k ± 42%  -84.63% (p=0.002 n=6)       4257.1k ± 4%  +19.68% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8       3557.1k ± 0%       436.0k ± 54%  -87.74% (p=0.002 n=6)       4170.7k ± 2%  +17.25% (p=0.002 n=6)
geomean                                                                              418.3k            191.0k        -54.34%                      495.3k       +18.43%

                                                                              │  master       │            #16987                   │             #16965                  │
                                                                              │     B/op      │     B/op       vs base              │     B/op      vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                  1166.6Ki ± 0%   1160.3Ki ± 0%  -0.54% (p=0.002 n=6)   680.4Ki ± 0%  -41.68% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                   4.914Mi ± 0%    4.852Mi ± 0%  -1.27% (p=0.002 n=6)   2.117Mi ± 0%  -56.92% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                  9.687Mi ± 0%    9.462Mi ± 0%  -2.33% (p=0.002 n=6)   3.774Mi ± 0%  -61.04% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8      15.051Mi ± 0%   14.826Mi ± 1%  -1.49% (p=0.002 n=6)   8.958Mi ± 0%  -40.48% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                19.228Mi ± 0%   18.675Mi ± 0%  -2.87% (p=0.002 n=6)   7.130Mi ± 0%  -62.92% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                96.39Mi ± 0%    92.99Mi ± 0%  -3.52% (p=0.002 n=6)   34.68Mi ± 0%  -64.02% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8               241.36Mi ± 0%   232.97Mi ± 0%  -3.48% (p=0.002 n=6)   87.45Mi ± 0%  -63.77% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8               482.8Mi ± 0%    466.2Mi ± 0%  -3.44% (p=0.002 n=6)   175.7Mi ± 0%  -63.61% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8    791.4Mi ± 0%    772.7Mi ± 0%  -2.35% (p=0.002 n=6)   463.0Mi ± 0%  -41.50% (p=0.002 n=6)
geomean                                                                          37.29Mi         36.41Mi       -2.37%                 16.35Mi       -56.15%

                                                                              │ master      │             #16987                │             #16965                  │
                                                                              │  allocs/op  │  allocs/op   vs base              │  allocs/op    vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                  10.36k ± 0%   10.39k ± 0%  +0.29% (p=0.002 n=6)    15.59k ± 0%  +50.48% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                  38.40k ± 0%   38.41k ± 0%  +0.03% (p=0.002 n=6)    57.99k ± 0%  +51.03% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                 73.47k ± 0%   73.45k ± 0%  -0.02% (p=0.002 n=6)   108.68k ± 0%  +47.93% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8      73.45k ± 0%   73.44k ± 0%  -0.01% (p=0.002 n=6)   108.72k ± 0%  +48.02% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                143.4k ± 0%   143.4k ± 0%       ~ (p=0.065 n=6)    208.6k ± 0%  +45.51% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8               703.5k ± 0%   703.5k ± 0%  -0.01% (p=0.002 n=6)   1014.9k ± 0%  +44.25% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8               1.754M ± 0%   1.754M ± 0%  -0.00% (p=0.002 n=6)    2.515M ± 0%  +43.42% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8              3.504M ± 0%   3.504M ± 0%  -0.00% (p=0.009 n=6)    5.015M ± 0%  +43.14% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8   3.504M ± 0%   3.504M ± 0%       ~ (p=0.240 n=6)    5.015M ± 0%  +43.14% (p=0.002 n=6)
geomean                                                                         255.6k        255.7k       +0.03%                  374.0k       +46.29%

@dashrews78 dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from a4720d0 to 135af34 Compare October 1, 2025 10:06
@dashrews78
Copy link
Contributor Author

/test ocp-4-18-nongroovy-e2e-tests

@dashrews78 dashrews78 merged commit 9c4cae0 into master Oct 1, 2025
97 of 98 checks passed
@dashrews78 dashrews78 deleted the dashrews/evaluate-baseline-query-needed-30963 branch October 1, 2025 16:25
@danekantner
Copy link

Has anyone that has upgraded to 4.9.x from 4.8 or earlier seen issues related to database timeouts after this change? My tenant has become unusable -- literred with errors of timeouts from STATEMENT: select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Time as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1

central / central-db starts up fine but then minutes in everything goes haywire and starts timing out those process indicator queries and then nothing works in the console after it goes in to this loop of timeouts. I have tried tuning the central database back to defaults and gently upwards, and it is sitting now at the upper limits of what a 16 CPU GKE node can handle but still timing out. Is there any tuning that can/should be done on the process indicators itself? Could changing the required review period to a much lower time help, or anything along those lines?

@dashrews78
Copy link
Contributor Author

Has anyone that has upgraded to 4.9.x from 4.8 or earlier seen issues related to database timeouts after this change? My tenant has become unusable -- literred with errors of timeouts from STATEMENT: select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Time as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1

central / central-db starts up fine but then minutes in everything goes haywire and starts timing out those process indicator queries and then nothing works in the console after it goes in to this loop of timeouts. I have tried tuning the central database back to defaults and gently upwards, and it is sitting now at the upper limits of what a 16 CPU GKE node can handle but still timing out. Is there any tuning that can/should be done on the process indicators itself? Could changing the required review period to a much lower time help, or anything along those lines?

@danekantner We have not seen any occurrences of that to my knowledge. That query should be using the index and should be fast unless there is some other action backing up the database or there are just too many records. Far faster than what it was doing before when it was selecting serialized. Suggestions:

  • Were you seeing process indicator timeouts before in 4.8 on
select serialized from process_indicators where process_indicators.DeploymentId = $1
  • Run an explain plan on that query.
  • Watch the memory and cpu profiles on both the database and central and see which if either are spiking.
  • in 4.10 the indexes are going to be changed to btree. So you could try switching the 2 hash indexes on the process_indicators table to btree.
  • Check the number max number of connections on the database and the number of connections running in actively by querying select * from pg_stat_activity;. There are some connection issues we solved in the upcoming 4.9.3 and 4.10 releases that could also be at play.

All those are just guesses based on the information you supplied.

@danekantner
Copy link

What I am seeing is whatever process handles the process_indicators dumps about 80-90 of these queries simultaneously on to central-db. If I select from pg_stat_activity to view activity repeatedly, the pids change but the count remains relatively similar. On the central side these are showing as context deadline timeouts (Error: Couldn't evaluate process baseline: iterating over process indicators for deployment inf-prd-env-cluster/stackdriver-exporter/algos-cms-crc: retry timer is expired: timeout: context deadline exceeded), and in the database log they are showing as ERROR: canceling statement due to user request -- the timeout seems to be around 130s, I'm not clear where that originates or if I can alter it. I have the external db config timeout set to 20 minutes and the database itself seems to have no value set and when I query it shows the default 0.

SELECT pid, query, state, now() - query_start AS duration
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
  pid   |                                                                                                                                                                                                                                query                              
                                                                                                                                                                                                   | state  |    duration     
--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+-----------------
 106642 | select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Tim
e as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1 | active | 00:00:42.17414
 106629 | select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Tim
e as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1 | active | 00:00:42.127686
... (90 rows)

I will see if I can optimize other postgres configurations to help this along too

@dashrews78
Copy link
Contributor Author

@danekantner So a couple possibilities. It could be that #17126 in how it changed the logic may be what is actually causing this but it is hard to say.

I would really be curious what it looks like when 4.9.3 drops as we backported a bunch of postgres connection fixes

What is your max_connection count and the pools on the database connection? We have seen issues frequently in the last few releases where connections will hit the max and other calls will wait on a connection and wind up timing out.

You could also tinker with ROX_POSTGRES_DEFAULT_TIMEOUT which defaults to 60 seconds to see if that makes a difference. But I'm fairly confident that 4.9.3 or 4.10 will help a lot or at least help us narrow it down.

Some logs or a diagnostic bundle may be useful but this is probably not the forum to share those. We did have a case recently where something unrelated was not properly rolling back a transaction causing transactions to get stuck. Based on what you said with the pg_stat_activity, that is probably not the case, but something we would look in the logs for.

@danekantner
Copy link

danekantner commented Feb 3, 2026

@dashrews78 thank you -- I was able to get it stable now I believe enough that I won't need to roll back to 4.8 now -- I had just started testing an increased ROX_POSTGRES_DEFAULT_TIMEOUT earlier today and that alone actually did not fix it; However, it enabled a lot more of the quries to seemingly actually finish. The secondary problem became the default 90 connection pool limit. I increased that to 175. I immediately did again see about 175 connections caused by this process, which I think seems expected, because I have 4400 deployments overall that it is iterating through. But, of those, about half seem to process pretty quick, and others seem to take maybe 5-6 minutes (even still). This in turn allows the actual other queries central is trying to start to go through and process. It does seem like having an independently configured count of a waitgroup or worker pool specifically only for these baseline queries, could benefit the overall ecosystem.

overall fixes were:

  • increased memory allocation upper limit to central-db (and central) to 44Gi; cpu upper limit 8 corresponds with default postgres worker count
  • in db-configmap postgresql.conf, increased shared_buffers = 10GB and wal_buffers = 32MB
  • increased pool_max_conns=175 in external-db-configmap
  • set ROX_POSTGRES_DEFAULT_TIMEOUT= '15m'

@dashrews78
Copy link
Contributor Author

Yeah once I saw your first message on this I was thinking we probably want to figure out a way to throttle those requests. (I will file an internal ticket to do so). I suspect that the ones that take a while are waiting on connections.

@danekantner
Copy link

Thanks again for your help in troubleshooting this and suggesting to check the database -- just circling back to update here, I think the pool size issues stemmed from a bad index shown by \d process_indicators -- after a reindex REINDEX INDEX processindicators_deploymentid; the process_indicators query that previously took 2.5 minutes now takes 300 ms. That was essentially why they were so stacked up.

(also in the UX previously, the Risk tab was showing Unknown Deployment / 'the selected deployment may have been removed' -- which also related to this whole problem of the query not returning fast)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants