ROX-30963: Only query indicator fields needed for evalution by dashrews78 · Pull Request #16965 · stackrox/stackrox

dashrews78 · 2025-09-22T14:32:23Z

Description

Stripping the data pulled from the database. Using our "view" like mechanism to run Select queries to only retrieve the 6 or so fields that are required for re-evaluation. Notice the explain plans below that show the width of each row returned is cut in half which will save memory on both central and central-db as well as traffic between the two. Also the benchmark results below can be compared to those in #16932 . The speed is very slightly slower, but the memory allocations are cut in half with this PR.

 go test -tags sql_integration -bench=. -benchmem -run=^$
goos: darwin
goarch: arm64
pkg: github.com/stackrox/rox/central/processbaseline/evaluator
cpu: Apple M3 Pro
BenchmarkEvaluateBaselinesAndPersistResult/100_processes_2_containers-12         	    3396	    317166 ns/op	  154897 B/op	    3761 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/500_processes_3_containers-12         	    1332	   1200867 ns/op	  745970 B/op	   19772 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/1000_processes_5_containers-12        	     471	   3253969 ns/op	 1902272 B/op	   50981 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-12         	     254	   5629626 ns/op	 4639233 B/op	   82186 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/2000_processes_10_containers-12                   	     150	   9233982 ns/op	 6856404 B/op	  143391 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/10000_processes_20_containers-12                  	      80	  22013626 ns/op	14314797 B/op	  345402 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/25000_processes_50_containers-12                  	      32	  52227600 ns/op	32828808 B/op	  847411 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/50000_processes_100_containers-12                 	      13	 116390516 ns/op	68937164 B/op	 1849421 allocs/op
BenchmarkEvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-12      	       4	 287536584 ns/op	165883168 B/op	 2851423 allocs/op
BenchmarkEvaluateBaselinesSmallScale/2500_processes_5_containers-12                          	     738	   1622809 ns/op	  973947 B/op	   26472 allocs/op
PASS
ok  	github.com/stackrox/rox/central/processbaseline/evaluator	28.193s

Explain plans -- The top one is the old query, the bottom one is the query from this PR. Both queries access the same rows, but you can see by the width that the new query returns far less data per row which is friendlier to both central and central-db.

-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.process_indicators  (cost=4.19..102.48 rows=25 width=441) (actual time=0.020..0.080 rows=12 loops=1)
   Output: serialized
   Recheck Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
   Heap Blocks: exact=12
   Buffers: shared hit=14
   WAL: records=2 bytes=112
   ->  Bitmap Index Scan on processindicators_deploymentid  (cost=0.00..4.19 rows=25 width=0) (actual time=0.011..0.011 rows=12 loops=1)
         Index Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
         Buffers: shared hit=2
 Query Identifier: -2289536428206420033
 Planning:
   Buffers: shared hit=80
 Planning Time: 0.392 ms
 Execution Time: 0.097 ms
(14 rows)

central_active=# explain (analyze, verbose, buffers, costs, wal) select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Time as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = '4f95d656-8460-4fd0-8778-96d268ec791f';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.process_indicators  (cost=4.19..102.60 rows=25 width=149) (actual time=0.019..0.057 rows=12 loops=1)
   Output: (id)::text, containername, signal_execfilepath, containerstarttime, signal_time, signal_name, signal_args
   Recheck Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
   Heap Blocks: exact=12
   Buffers: shared hit=14
   WAL: records=1 bytes=56
   ->  Bitmap Index Scan on processindicators_deploymentid  (cost=0.00..4.19 rows=25 width=0) (actual time=0.010..0.010 rows=12 loops=1)
         Index Cond: (process_indicators.deploymentid = '4f95d656-8460-4fd0-8778-96d268ec791f'::uuid)
         Buffers: shared hit=2
 Query Identifier: -4387973057236564826
 Planning:
   Buffers: shared hit=97
 Planning Time: 0.324 ms
 Execution Time: 0.073 ms
(14 rows)

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

change me!

dashrews78 · 2025-09-22T14:32:24Z

This change is part of the following stack:

ROX-30963: Only query indicator fields needed for evalution #16965 ◀

_{Change managed by git-spice.}

openshift-ci · 2025-09-22T14:32:27Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

rhacs-bot · 2025-09-22T14:51:48Z

Images are ready for the commit at 135af34.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-940-g135af34b1e.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

In GetProcessIndicatorsRiskView you’re using WriteAllowed for SAC checks even though it’s a read-only operation—switch to ReadAllowed to accurately reflect permissions.
Avoid logging errors inside GetProcessIndicatorsRiskView; wrap and return the error instead so callers can decide how to report it without duplicate log entries.
You now have almost identical logic for storage.ProcessIndicator and views.ProcessIndicatorRiskView (e.g., BaselineItemFromProcess vs. BaselineItemFromProcessView); consider unifying these via a shared interface or helper to avoid drift.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In GetProcessIndicatorsRiskView you’re using WriteAllowed for SAC checks even though it’s a read-only operation—switch to ReadAllowed to accurately reflect permissions.
- Avoid logging errors inside GetProcessIndicatorsRiskView; wrap and return the error instead so callers can decide how to report it without duplicate log entries.
- You now have almost identical logic for storage.ProcessIndicator and views.ProcessIndicatorRiskView (e.g., BaselineItemFromProcess vs. BaselineItemFromProcessView); consider unifying these via a shared interface or helper to avoid drift.

## Individual Comments

### Comment 1
<location> `central/processbaseline/baselineutils.go:89` </location>
<code_context>
+// IsStartupProcess determines if the process is a startup process
+// A process is considered a startup process if it happens within the first ContainerStartupDuration and was not scraped
+// but instead pulled from exec
+func IsStartupProcessView(process *views.ProcessIndicatorRiskView) bool {
+	if process.ContainerStartTime == nil {
+		return false
</code_context>

<issue_to_address>
**issue (bug_risk):** IsStartupProcessView does not check for nil SignalTime.

A nil SignalTime can result in protoutils.Sub panicking or misbehaving. Please add a nil check for SignalTime as done for ContainerStartTime.
</issue_to_address>

### Comment 2
<location> `central/processindicator/datastore/datastore_impl.go:229` </location>
<code_context>
 }

+// GetProcessIndicatorsRiskView retrieves minimal fields from process indicator for risk evaluation
+func (ds *datastoreImpl) GetProcessIndicatorsRiskView(ctx context.Context, q *v1.Query) ([]*views.ProcessIndicatorRiskView, error) {
+	if ok, err := deploymentExtensionSAC.WriteAllowed(ctx); err != nil {
+		return nil, err
</code_context>

<issue_to_address>
**issue (complexity):** Consider extracting the SAC check and risk view query-selects into helper functions to simplify GetProcessIndicatorsRiskView.

```suggestion
We can collapse most of the plumbing in `GetProcessIndicatorsRiskView` by extracting the SAC‐check and the “risk view” query‐selects into small, reusable helpers.  This keeps the method focused and avoids manual duplication.

1) In `central/processindicator/views/helpers.go`, define the select‐fields builder:

```go
package views

import (
  pkgSearch "github.com/stackrox/rox/pkg/search"
  v1 "github.com/stackrox/rox/generated/api/v1"
)

// BuildRiskViewQuery clones `q` and injects exactly the selects needed for risk view.
func BuildRiskViewQuery(q *v1.Query) *v1.Query {
  clone := q.CloneVT()
  clone.Selects = []*v1.QuerySelect{
    pkgSearch.NewQuerySelect(pkgSearch.ProcessID).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ContainerName).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessExecPath).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessContainerStartTime).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessCreationTime).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessName).Proto(),
    pkgSearch.NewQuerySelect(pkgSearch.ProcessArguments).Proto(),
  }
  return clone
}
```

2) In your datastore file, factor out the SAC guard:

```go
func requireDeploymentExtensionWrite(ctx context.Context) error {
  ok, err := deploymentExtensionSAC.WriteAllowed(ctx)
  if err != nil {
    return err
  }
  if !ok {
    return sac.ErrResourceAccessDenied
  }
  return nil
}
```

3) Simplify `GetProcessIndicatorsRiskView`:

```go
func (ds *datastoreImpl) GetProcessIndicatorsRiskView(
  ctx context.Context,
  q *v1.Query,
) ([]*views.ProcessIndicatorRiskView, error) {
  if err := requireDeploymentExtensionWrite(ctx); err != nil {
    return nil, err
  }

  riskQ := views.BuildRiskViewQuery(q)
  return pgSearch.RunSelectRequestForSchema[views.ProcessIndicatorRiskView](
    ctx,
    ds.db,
    pkgSchema.ProcessIndicatorsSchema,
    riskQ,
  )
}
```

This keeps all functionality intact but removes inline plumbing, making the datastore method much easier to read and maintain.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

central/processbaseline/baselineutils.go

central/processbaseline/evaluator/evaluator_impl.go

central/processbaseline/evaluator/evaluator_test.go

central/processbaseline/baselineutils.go

central/processindicator/datastore/datastore.go

central/processindicator/datastore/datastore_impl.go

codecov · 2025-09-30T16:01:33Z

Codecov Report

❌ Patch coverage is 33.33333% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.78%. Comparing base (85aa9d2) to head (135af34).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...ntral/processindicator/datastore/datastore_impl.go	0.00%	25 Missing ⚠️
central/processindicator/datastore/singleton.go	0.00%	4 Missing ⚠️
...entral/processbaseline/evaluator/evaluator_impl.go	87.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #16965      +/-   ##
==========================================
- Coverage   48.79%   48.78%   -0.02%     
==========================================
  Files        2711     2711              
  Lines      202284   202311      +27     
==========================================
- Hits        98712    98692      -20     
- Misses      95793    95834      +41     
- Partials     7779     7785       +6

Flag	Coverage Δ
go-unit-tests	`48.78% <33.33%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

janisz · 2025-09-30T16:28:41Z

Out of curiosity I run updated benchmark from #17078 on master, #16987 and this PR
What we can observe ideally we should combine 2 approaches to get the best results of reducing memory usage:

walking over result set fix(risk): improve reprocess deployment risk #16987
limiting amount of received data ROX-30963: Only query indicator fields needed for evalution #16965

                                                                              │   master       │            #16987                   │              #16965                  │
                                                                              │     sec/op     │    sec/op     vs base               │    sec/op      vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                   3.234m ± 582%   2.982m ± 22%        ~ (p=0.240 n=6)    3.800m ± 52%        ~ (p=0.394 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                   5.712m ±  12%   5.006m ± 25%        ~ (p=0.240 n=6)    6.997m ± 35%  +22.50% (p=0.009 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                 13.669m ±  37%   9.174m ± 39%  -32.88% (p=0.009 n=6)   11.470m ± 25%        ~ (p=0.093 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8       13.96m ±   9%   11.48m ± 15%  -17.77% (p=0.009 n=6)    18.36m ± 25%  +31.53% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                 15.64m ±  67%   15.07m ± 25%        ~ (p=0.394 n=6)    17.11m ± 14%        ~ (p=0.065 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                73.41m ±  57%   62.33m ± 19%  -15.09% (p=0.026 n=6)    86.49m ± 21%        ~ (p=0.065 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8                176.7m ±   4%   145.7m ±  5%  -17.59% (p=0.002 n=6)    205.9m ± 16%  +16.49% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8               352.4m ±   7%   277.5m ±  2%  -21.26% (p=0.002 n=6)    413.4m ± 23%  +17.30% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8    501.9m ±  10%   466.5m ± 11%   -7.07% (p=0.026 n=6)    509.8m ± 25%        ~ (p=0.310 n=6)
geomean                                                                          36.89m          31.18m        -15.48%                  41.43m        +12.30%

                                                                              │   master       │            #16987                     │              #16965                   │
                                                                              │ max_heap_bytes │ max_heap_bytes  vs base               │ max_heap_bytes  vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                     10.76M ± 1%     10.79M ±  0%        ~ (p=0.180 n=6)      10.27M ± 0%   -4.57% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                     14.75M ± 0%     14.72M ±  0%   -0.18% (p=0.015 n=6)      11.82M ± 0%  -19.88% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                    19.54M ± 1%     19.26M ±  1%   -1.42% (p=0.026 n=6)      13.56M ± 0%  -30.62% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8         25.30M ± 0%     19.75M ±  3%  -21.91% (p=0.002 n=6)      18.87M ± 2%  -25.42% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                   29.68M ± 0%     22.67M ± 14%  -23.62% (p=0.002 n=6)      17.09M ± 0%  -42.42% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                 109.78M ± 1%     41.24M ± 56%  -62.43% (p=0.002 n=6)      40.25M ± 3%  -63.33% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8                 260.41M ± 1%     50.99M ± 44%  -80.42% (p=0.002 n=6)      88.32M ± 2%  -66.09% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8                509.47M ± 1%     80.57M ± 44%  -84.19% (p=0.002 n=6)     173.10M ± 4%  -66.02% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8      832.9M ± 0%     100.1M ± 69%  -87.98% (p=0.002 n=6)      470.3M ± 1%  -43.53% (p=0.002 n=6)
geomean                                                                            67.28M          30.56M        -54.58%                    37.69M       -43.98%

                                                                              │    master        │           #16987                        │               #16965                    │
                                                                              │ max_heap_objects │ max_heap_objects  vs base               │ max_heap_objects  vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                       68.75k ± 1%       68.91k ±  0%        ~ (p=0.065 n=6)        73.46k ± 0%   +6.85% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                       96.87k ± 0%       96.99k ±  0%        ~ (p=0.240 n=6)       113.70k ± 0%  +17.38% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                      130.3k ± 1%       128.5k ±  1%        ~ (p=0.065 n=6)        161.4k ± 0%  +23.92% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8           130.2k ± 0%       106.1k ±  1%  -18.53% (p=0.002 n=6)        160.2k ± 2%  +23.06% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                     200.1k ± 0%       153.1k ± 15%  -23.46% (p=0.002 n=6)        255.5k ± 0%  +27.70% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                    758.0k ± 0%       285.9k ± 50%  -62.29% (p=0.002 n=6)        874.0k ± 5%  +15.29% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8                   1807.4k ± 0%       355.3k ± 31%  -80.34% (p=0.002 n=6)       2096.2k ± 2%  +15.98% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8                  3557.0k ± 0%       546.7k ± 42%  -84.63% (p=0.002 n=6)       4257.1k ± 4%  +19.68% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8       3557.1k ± 0%       436.0k ± 54%  -87.74% (p=0.002 n=6)       4170.7k ± 2%  +17.25% (p=0.002 n=6)
geomean                                                                              418.3k            191.0k        -54.34%                      495.3k       +18.43%

                                                                              │  master       │            #16987                   │             #16965                  │
                                                                              │     B/op      │     B/op       vs base              │     B/op      vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                  1166.6Ki ± 0%   1160.3Ki ± 0%  -0.54% (p=0.002 n=6)   680.4Ki ± 0%  -41.68% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                   4.914Mi ± 0%    4.852Mi ± 0%  -1.27% (p=0.002 n=6)   2.117Mi ± 0%  -56.92% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                  9.687Mi ± 0%    9.462Mi ± 0%  -2.33% (p=0.002 n=6)   3.774Mi ± 0%  -61.04% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8      15.051Mi ± 0%   14.826Mi ± 1%  -1.49% (p=0.002 n=6)   8.958Mi ± 0%  -40.48% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                19.228Mi ± 0%   18.675Mi ± 0%  -2.87% (p=0.002 n=6)   7.130Mi ± 0%  -62.92% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8                96.39Mi ± 0%    92.99Mi ± 0%  -3.52% (p=0.002 n=6)   34.68Mi ± 0%  -64.02% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8               241.36Mi ± 0%   232.97Mi ± 0%  -3.48% (p=0.002 n=6)   87.45Mi ± 0%  -63.77% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8               482.8Mi ± 0%    466.2Mi ± 0%  -3.44% (p=0.002 n=6)   175.7Mi ± 0%  -63.61% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8    791.4Mi ± 0%    772.7Mi ± 0%  -2.35% (p=0.002 n=6)   463.0Mi ± 0%  -41.50% (p=0.002 n=6)
geomean                                                                          37.29Mi         36.41Mi       -2.37%                 16.35Mi       -56.15%

                                                                              │ master      │             #16987                │             #16965                  │
                                                                              │  allocs/op  │  allocs/op   vs base              │  allocs/op    vs base               │
EvaluateBaselinesAndPersistResult/100_processes_2_containers-8                  10.36k ± 0%   10.39k ± 0%  +0.29% (p=0.002 n=6)    15.59k ± 0%  +50.48% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/500_processes_3_containers-8                  38.40k ± 0%   38.41k ± 0%  +0.03% (p=0.002 n=6)    57.99k ± 0%  +51.03% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers-8                 73.47k ± 0%   73.45k ± 0%  -0.02% (p=0.002 n=6)   108.68k ± 0%  +47.93% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/1000_processes_5_containers_large_args-8      73.45k ± 0%   73.44k ± 0%  -0.01% (p=0.002 n=6)   108.72k ± 0%  +48.02% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/2000_processes_10_containers-8                143.4k ± 0%   143.4k ± 0%       ~ (p=0.065 n=6)    208.6k ± 0%  +45.51% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/10000_processes_20_containers-8               703.5k ± 0%   703.5k ± 0%  -0.01% (p=0.002 n=6)   1014.9k ± 0%  +44.25% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/25000_processes_50_containers-8               1.754M ± 0%   1.754M ± 0%  -0.00% (p=0.002 n=6)    2.515M ± 0%  +43.42% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers-8              3.504M ± 0%   3.504M ± 0%  -0.00% (p=0.009 n=6)    5.015M ± 0%  +43.14% (p=0.002 n=6)
EvaluateBaselinesAndPersistResult/50000_processes_100_containers_large_args-8   3.504M ± 0%   3.504M ± 0%       ~ (p=0.240 n=6)    5.015M ± 0%  +43.14% (p=0.002 n=6)
geomean                                                                         255.6k        255.7k       +0.03%                  374.0k       +46.29%

central/processbaseline/baselineutils.go

dashrews78 · 2025-10-01T13:28:34Z

/test ocp-4-18-nongroovy-e2e-tests

danekantner · 2026-02-02T23:01:35Z

Has anyone that has upgraded to 4.9.x from 4.8 or earlier seen issues related to database timeouts after this change? My tenant has become unusable -- literred with errors of timeouts from STATEMENT: select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Time as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1

central / central-db starts up fine but then minutes in everything goes haywire and starts timing out those process indicator queries and then nothing works in the console after it goes in to this loop of timeouts. I have tried tuning the central database back to defaults and gently upwards, and it is sitting now at the upper limits of what a 16 CPU GKE node can handle but still timing out. Is there any tuning that can/should be done on the process indicators itself? Could changing the required review period to a much lower time help, or anything along those lines?

dashrews78 · 2026-02-03T10:31:57Z

Has anyone that has upgraded to 4.9.x from 4.8 or earlier seen issues related to database timeouts after this change? My tenant has become unusable -- literred with errors of timeouts from STATEMENT: select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Time as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1

central / central-db starts up fine but then minutes in everything goes haywire and starts timing out those process indicator queries and then nothing works in the console after it goes in to this loop of timeouts. I have tried tuning the central database back to defaults and gently upwards, and it is sitting now at the upper limits of what a 16 CPU GKE node can handle but still timing out. Is there any tuning that can/should be done on the process indicators itself? Could changing the required review period to a much lower time help, or anything along those lines?

@danekantner We have not seen any occurrences of that to my knowledge. That query should be using the index and should be fast unless there is some other action backing up the database or there are just too many records. Far faster than what it was doing before when it was selecting serialized. Suggestions:

Were you seeing process indicator timeouts before in 4.8 on

select serialized from process_indicators where process_indicators.DeploymentId = $1

Run an explain plan on that query.
Watch the memory and cpu profiles on both the database and central and see which if either are spiking.
in 4.10 the indexes are going to be changed to btree. So you could try switching the 2 hash indexes on the process_indicators table to btree.
Check the number max number of connections on the database and the number of connections running in actively by querying select * from pg_stat_activity;. There are some connection issues we solved in the upcoming 4.9.3 and 4.10 releases that could also be at play.

All those are just guesses based on the information you supplied.

danekantner · 2026-02-03T17:39:32Z

What I am seeing is whatever process handles the process_indicators dumps about 80-90 of these queries simultaneously on to central-db. If I select from pg_stat_activity to view activity repeatedly, the pids change but the count remains relatively similar. On the central side these are showing as context deadline timeouts (Error: Couldn't evaluate process baseline: iterating over process indicators for deployment inf-prd-env-cluster/stackdriver-exporter/algos-cms-crc: retry timer is expired: timeout: context deadline exceeded), and in the database log they are showing as ERROR: canceling statement due to user request -- the timeout seems to be around 130s, I'm not clear where that originates or if I can alter it. I have the external db config timeout set to 20 minutes and the database itself seems to have no value set and when I query it shows the default 0.

SELECT pid, query, state, now() - query_start AS duration
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
  pid   |                                                                                                                                                                                                                                query                              
                                                                                                                                                                                                   | state  |    duration     
--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+-----------------
 106642 | select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Tim
e as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1 | active | 00:00:42.17414
 106629 | select process_indicators.Id::text as process_id, process_indicators.ContainerName as container_name, process_indicators.Signal_ExecFilePath as process_path, process_indicators.ContainerStartTime as process_container_start_time, process_indicators.Signal_Tim
e as process_creation_time, process_indicators.Signal_Name as process_name, process_indicators.Signal_Args as process_arguments from process_indicators where process_indicators.DeploymentId = $1 | active | 00:00:42.127686
... (90 rows)

I will see if I can optimize other postgres configurations to help this along too

dashrews78 · 2026-02-03T19:23:54Z

@danekantner So a couple possibilities. It could be that #17126 in how it changed the logic may be what is actually causing this but it is hard to say.

I would really be curious what it looks like when 4.9.3 drops as we backported a bunch of postgres connection fixes

What is your max_connection count and the pools on the database connection? We have seen issues frequently in the last few releases where connections will hit the max and other calls will wait on a connection and wind up timing out.

You could also tinker with ROX_POSTGRES_DEFAULT_TIMEOUT which defaults to 60 seconds to see if that makes a difference. But I'm fairly confident that 4.9.3 or 4.10 will help a lot or at least help us narrow it down.

Some logs or a diagnostic bundle may be useful but this is probably not the forum to share those. We did have a case recently where something unrelated was not properly rolling back a transaction causing transactions to get stuck. Based on what you said with the pg_stat_activity, that is probably not the case, but something we would look in the logs for.

danekantner · 2026-02-03T22:03:46Z

@dashrews78 thank you -- I was able to get it stable now I believe enough that I won't need to roll back to 4.8 now -- I had just started testing an increased ROX_POSTGRES_DEFAULT_TIMEOUT earlier today and that alone actually did not fix it; However, it enabled a lot more of the quries to seemingly actually finish. The secondary problem became the default 90 connection pool limit. I increased that to 175. I immediately did again see about 175 connections caused by this process, which I think seems expected, because I have 4400 deployments overall that it is iterating through. But, of those, about half seem to process pretty quick, and others seem to take maybe 5-6 minutes (even still). This in turn allows the actual other queries central is trying to start to go through and process. It does seem like having an independently configured count of a waitgroup or worker pool specifically only for these baseline queries, could benefit the overall ecosystem.

overall fixes were:

increased memory allocation upper limit to central-db (and central) to 44Gi; cpu upper limit 8 corresponds with default postgres worker count
in db-configmap postgresql.conf, increased shared_buffers = 10GB and wal_buffers = 32MB
increased pool_max_conns=175 in external-db-configmap
set ROX_POSTGRES_DEFAULT_TIMEOUT= '15m'

dashrews78 · 2026-02-04T12:09:21Z

Yeah once I saw your first message on this I was thinking we probably want to figure out a way to throttle those requests. (I will file an internal ticket to do so). I suspect that the ones that take a while are waiting on connections.

danekantner · 2026-02-06T21:51:23Z

Thanks again for your help in troubleshooting this and suggesting to check the database -- just circling back to update here, I think the pool size issues stemmed from a bad index shown by \d process_indicators -- after a reindex REINDEX INDEX processindicators_deploymentid; the process_indicators query that previously took 2.5 minutes now takes 300 ms. That was essentially why they were so stacked up.

(also in the UX previously, the Risk tab was showing Unknown Deployment / 'the selected deployment may have been removed' -- which also related to this whole problem of the query not returning fast)

This was referenced Sep 22, 2025

ROX-30960: add container start time as a column #16933

Merged

ROX-30961: migration to populate indicator container start time WIP DO NOT REVIEW #16935

Closed

openshift-ci bot added the do-not-merge/work-in-progress label Sep 22, 2025

github-actions bot added the area/central label Sep 22, 2025

dashrews78 force-pushed the dashrews/container-start-migration-30961 branch 2 times, most recently from 494fd3a to c4ccc51 Compare September 24, 2025 19:37

dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from 50cf275 to 1db6ed5 Compare September 24, 2025 19:37

dashrews78 requested a review from janisz September 24, 2025 19:44

dashrews78 force-pushed the dashrews/container-start-migration-30961 branch 2 times, most recently from 894aad5 to 3915aa5 Compare September 29, 2025 13:40

dashrews78 mentioned this pull request Sep 29, 2025

fix(risk): improve reprocess deployment risk #16987

Closed

dashrews78 changed the base branch from dashrews/container-start-migration-30961 to dashrews/container-start-migration-30961-sequential September 29, 2025 18:12

dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from 1db6ed5 to 06aa438 Compare September 29, 2025 18:12

dashrews78 changed the base branch from dashrews/container-start-migration-30961-sequential to master September 29, 2025 18:30

dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from 06aa438 to 7b39116 Compare September 29, 2025 18:30

dashrews78 marked this pull request as ready for review September 29, 2025 18:30

openshift-ci bot removed the do-not-merge/work-in-progress label Sep 29, 2025

dashrews78 added the ai-review label Sep 29, 2025

sourcery-ai bot reviewed Sep 29, 2025

View reviewed changes

central/processbaseline/baselineutils.go Show resolved Hide resolved

janisz reviewed Sep 30, 2025

View reviewed changes

central/processbaseline/evaluator/evaluator_impl.go Show resolved Hide resolved

janisz reviewed Sep 30, 2025

View reviewed changes

dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from f03a3cd to 55f4df3 Compare September 30, 2025 09:51

janisz approved these changes Sep 30, 2025

View reviewed changes

dashrews78 requested review from clickboo and ksurabhi91 September 30, 2025 16:39

dashrews78 requested review from ajheflin and charmik-redhat September 30, 2025 16:39

charmik-redhat reviewed Oct 1, 2025

View reviewed changes

central/processbaseline/baselineutils.go Outdated Show resolved Hide resolved

dashrews78 added 7 commits October 1, 2025 06:00

first run

c7471fa

some more updates

d7fc30b

making sourcery happier

7011998

fix a test add a test

4737d61

fix some style and some tests

4411dc1

implementing a suggestion

f8e6d33

keep same behavior for signal time being nil

135af34

dashrews78 force-pushed the dashrews/evaluate-baseline-query-needed-30963 branch from a4720d0 to 135af34 Compare October 1, 2025 10:06

charmik-redhat approved these changes Oct 1, 2025

View reviewed changes

dashrews78 merged commit 9c4cae0 into master Oct 1, 2025
97 of 98 checks passed

dashrews78 deleted the dashrews/evaluate-baseline-query-needed-30963 branch October 1, 2025 16:25

janisz mentioned this pull request Oct 3, 2025

ROX-31176: improve reprocess deployment risk #17126

Merged

dashrews78 mentioned this pull request Feb 11, 2026

ROX-32965: throttle reprocessing of deployments #18919

Merged

9 tasks

Conversation

dashrews78 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

dashrews78 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Sep 22, 2025

Uh oh!

rhacs-bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

janisz commented Sep 30, 2025

Uh oh!

Uh oh!

dashrews78 commented Oct 1, 2025

Uh oh!

Uh oh!

danekantner commented Feb 2, 2026

Uh oh!

dashrews78 commented Feb 3, 2026

Uh oh!

danekantner commented Feb 3, 2026

Uh oh!

dashrews78 commented Feb 3, 2026

Uh oh!

danekantner commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashrews78 commented Feb 4, 2026

Uh oh!

danekantner commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dashrews78 commented Sep 22, 2025 •

edited

Loading

dashrews78 commented Sep 22, 2025 •

edited

Loading

rhacs-bot commented Sep 22, 2025 •

edited

Loading

codecov bot commented Sep 30, 2025 •

edited

Loading

danekantner commented Feb 3, 2026 •

edited

Loading