ROX-30963: Only query indicator fields needed for evalution#16965
ROX-30963: Only query indicator fields needed for evalution#16965dashrews78 merged 7 commits intomasterfrom
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
|
Images are ready for the commit at 135af34. To use with deploy scripts, first |
494fd3a to
c4ccc51
Compare
50cf275 to
1db6ed5
Compare
894aad5 to
3915aa5
Compare
1db6ed5 to
06aa438
Compare
06aa438 to
7b39116
Compare
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- In GetProcessIndicatorsRiskView you’re using WriteAllowed for SAC checks even though it’s a read-only operation—switch to ReadAllowed to accurately reflect permissions.
- Avoid logging errors inside GetProcessIndicatorsRiskView; wrap and return the error instead so callers can decide how to report it without duplicate log entries.
- You now have almost identical logic for storage.ProcessIndicator and views.ProcessIndicatorRiskView (e.g., BaselineItemFromProcess vs. BaselineItemFromProcessView); consider unifying these via a shared interface or helper to avoid drift.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In GetProcessIndicatorsRiskView you’re using WriteAllowed for SAC checks even though it’s a read-only operation—switch to ReadAllowed to accurately reflect permissions.
- Avoid logging errors inside GetProcessIndicatorsRiskView; wrap and return the error instead so callers can decide how to report it without duplicate log entries.
- You now have almost identical logic for storage.ProcessIndicator and views.ProcessIndicatorRiskView (e.g., BaselineItemFromProcess vs. BaselineItemFromProcessView); consider unifying these via a shared interface or helper to avoid drift.
## Individual Comments
### Comment 1
<location> `central/processbaseline/baselineutils.go:89` </location>
<code_context>
+// IsStartupProcess determines if the process is a startup process
+// A process is considered a startup process if it happens within the first ContainerStartupDuration and was not scraped
+// but instead pulled from exec
+func IsStartupProcessView(process *views.ProcessIndicatorRiskView) bool {
+ if process.ContainerStartTime == nil {
+ return false
</code_context>
<issue_to_address>
**issue (bug_risk):** IsStartupProcessView does not check for nil SignalTime.
A nil SignalTime can result in protoutils.Sub panicking or misbehaving. Please add a nil check for SignalTime as done for ContainerStartTime.
</issue_to_address>
### Comment 2
<location> `central/processindicator/datastore/datastore_impl.go:229` </location>
<code_context>
}
+// GetProcessIndicatorsRiskView retrieves minimal fields from process indicator for risk evaluation
+func (ds *datastoreImpl) GetProcessIndicatorsRiskView(ctx context.Context, q *v1.Query) ([]*views.ProcessIndicatorRiskView, error) {
+ if ok, err := deploymentExtensionSAC.WriteAllowed(ctx); err != nil {
+ return nil, err
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the SAC check and risk view query-selects into helper functions to simplify GetProcessIndicatorsRiskView.
```suggestion
We can collapse most of the plumbing in `GetProcessIndicatorsRiskView` by extracting the SAC‐check and the “risk view” query‐selects into small, reusable helpers. This keeps the method focused and avoids manual duplication.
1) In `central/processindicator/views/helpers.go`, define the select‐fields builder:
```go
package views
import (
pkgSearch "github.com/stackrox/rox/pkg/search"
v1 "github.com/stackrox/rox/generated/api/v1"
)
// BuildRiskViewQuery clones `q` and injects exactly the selects needed for risk view.
func BuildRiskViewQuery(q *v1.Query) *v1.Query {
clone := q.CloneVT()
clone.Selects = []*v1.QuerySelect{
pkgSearch.NewQuerySelect(pkgSearch.ProcessID).Proto(),
pkgSearch.NewQuerySelect(pkgSearch.ContainerName).Proto(),
pkgSearch.NewQuerySelect(pkgSearch.ProcessExecPath).Proto(),
pkgSearch.NewQuerySelect(pkgSearch.ProcessContainerStartTime).Proto(),
pkgSearch.NewQuerySelect(pkgSearch.ProcessCreationTime).Proto(),
pkgSearch.NewQuerySelect(pkgSearch.ProcessName).Proto(),
pkgSearch.NewQuerySelect(pkgSearch.ProcessArguments).Proto(),
}
return clone
}
```
2) In your datastore file, factor out the SAC guard:
```go
func requireDeploymentExtensionWrite(ctx context.Context) error {
ok, err := deploymentExtensionSAC.WriteAllowed(ctx)
if err != nil {
return err
}
if !ok {
return sac.ErrResourceAccessDenied
}
return nil
}
```
3) Simplify `GetProcessIndicatorsRiskView`:
```go
func (ds *datastoreImpl) GetProcessIndicatorsRiskView(
ctx context.Context,
q *v1.Query,
) ([]*views.ProcessIndicatorRiskView, error) {
if err := requireDeploymentExtensionWrite(ctx); err != nil {
return nil, err
}
riskQ := views.BuildRiskViewQuery(q)
return pgSearch.RunSelectRequestForSchema[views.ProcessIndicatorRiskView](
ctx,
ds.db,
pkgSchema.ProcessIndicatorsSchema,
riskQ,
)
}
```
This keeps all functionality intact but removes inline plumbing, making the datastore method much easier to read and maintain.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
f03a3cd to
55f4df3
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #16965 +/- ##
==========================================
- Coverage 48.79% 48.78% -0.02%
==========================================
Files 2711 2711
Lines 202284 202311 +27
==========================================
- Hits 98712 98692 -20
- Misses 95793 95834 +41
- Partials 7779 7785 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Out of curiosity I run updated benchmark from #17078 on master, #16987 and this PR
|
a4720d0 to
135af34
Compare
|
/test ocp-4-18-nongroovy-e2e-tests |
|
Has anyone that has upgraded to 4.9.x from 4.8 or earlier seen issues related to database timeouts after this change? My tenant has become unusable -- literred with errors of timeouts from central / central-db starts up fine but then minutes in everything goes haywire and starts timing out those process indicator queries and then nothing works in the console after it goes in to this loop of timeouts. I have tried tuning the central database back to defaults and gently upwards, and it is sitting now at the upper limits of what a 16 CPU GKE node can handle but still timing out. Is there any tuning that can/should be done on the process indicators itself? Could changing the required review period to a much lower time help, or anything along those lines? |
@danekantner We have not seen any occurrences of that to my knowledge. That query should be using the index and should be fast unless there is some other action backing up the database or there are just too many records. Far faster than what it was doing before when it was selecting serialized. Suggestions:
All those are just guesses based on the information you supplied. |
|
What I am seeing is whatever process handles the process_indicators dumps about 80-90 of these queries simultaneously on to central-db. If I select from I will see if I can optimize other postgres configurations to help this along too |
|
@danekantner So a couple possibilities. It could be that #17126 in how it changed the logic may be what is actually causing this but it is hard to say. I would really be curious what it looks like when 4.9.3 drops as we backported a bunch of postgres connection fixes What is your max_connection count and the pools on the database connection? We have seen issues frequently in the last few releases where connections will hit the max and other calls will wait on a connection and wind up timing out. You could also tinker with Some logs or a diagnostic bundle may be useful but this is probably not the forum to share those. We did have a case recently where something unrelated was not properly rolling back a transaction causing transactions to get stuck. Based on what you said with the pg_stat_activity, that is probably not the case, but something we would look in the logs for. |
|
@dashrews78 thank you -- I was able to get it stable now I believe enough that I won't need to roll back to 4.8 now -- I had just started testing an increased overall fixes were:
|
|
Yeah once I saw your first message on this I was thinking we probably want to figure out a way to throttle those requests. (I will file an internal ticket to do so). I suspect that the ones that take a while are waiting on connections. |
|
Thanks again for your help in troubleshooting this and suggesting to check the database -- just circling back to update here, I think the pool size issues stemmed from a bad index shown by (also in the UX previously, the Risk tab was showing Unknown Deployment / 'the selected deployment may have been removed' -- which also related to this whole problem of the query not returning fast) |
Description
Stripping the data pulled from the database. Using our "view" like mechanism to run
Selectqueries to only retrieve the 6 or so fields that are required for re-evaluation. Notice the explain plans below that show the width of each row returned is cut in half which will save memory on both central and central-db as well as traffic between the two. Also the benchmark results below can be compared to those in #16932 . The speed is very slightly slower, but the memory allocations are cut in half with this PR.Explain plans -- The top one is the old query, the bottom one is the query from this PR. Both queries access the same rows, but you can see by the width that the new query returns far less data per row which is friendlier to both central and central-db.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
change me!