ROX-30135: Send baselines to sensor when deployment leaves observation by JoukoVirtanen · Pull Request #16077 · stackrox/stackrox

JoukoVirtanen · 2025-07-17T21:21:40Z

Description

When a deployment leaves the observation window a "user" locked process baseline is created for it automatically, which is persisted in the database and sent to sensor. This change is behind a feature flag. This is done in the detection lifecycle manager. Previously the detection lifecycle manager created a stackrox locked baseline and persisted it in the database without sending it to sensor.

There are two PRs built upon this PR.

Add process baseline autolocking to cluster config
#16427

Configure process baseline auto locking via helm
#16462

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

Testing was not done with the latest version so results might now be different.

Testing with autolocking enabled

Set the following environment variables

export ROX_BASELINE_GENERATION_DURATION=5m
export ROX_AUTOLOCK_PROCESS_BASELINES=true

Deployed ACS.

Created a pod that could be used to run some processes and entered it.

kubectl run ubuntu-pod --image=ubuntu --restart=Never --command -- sleep infinity
kubectl exec ubuntu-pod -it -- /bin/bash

Logged into the UI and checked "Risk".

Initially the process baseline is unlocked.

The following command is run inside the pod

# cat /proc/1/net/tcp

The process baseline is checked after 5 minutes.

Note that the processes are locked without any action having been taken.

A new process which is not in the baseline is run

# tac /proc/1/net/tcp

Violations are checked and the anomalous process is there.

The process baselines for other deployments were checked as well and they were locked

They were locked as well.

Testing with locking process baseline before auto-lock

Set the following environment variables

export ROX_BASELINE_GENERATION_DURATION=5m
export ROX_AUTOLOCK_PROCESS_BASELINES=true

Deployed ACS.

Created a pod that could be used to run some processes and entered it.

kubectl run ubuntu-pod --image=ubuntu --restart=Never --command -- sleep infinity
kubectl exec ubuntu-pod -it -- /bin/bash

Logged into the UI and checked "Risk".

# tac /proc/1/net/tcp

Locked the process baseline

Ran an anomalous process

# tac /proc/1/net/tcp

The anomalous process is flagged in "Process Discovery"

The process baseline is unlocked.

Another process is run

ls /proc/1/net/tcp

The new process is not considered to be anomalous.

The process that was run while the process baseline was locked shows up in violations.

Testing with auto-locking disabled

Did the same steps except the feature flag was disabled.

export ROX_BASELINE_GENERATION_DURATION=5m
export ROX_AUTOLOCK_PROCESS_BASELINES=false

Deployed ACS.

Created a pod that could be used to run some processes and entered it.

kubectl run ubuntu-pod --image=ubuntu --restart=Never --command -- sleep infinity
kubectl exec ubuntu-pod -it -- /bin/bash

Logged into the UI and checked "Risk".

Just as before the process baseline was unlocked at the beginning.

The following command is run inside the pod

# cat /proc/1/net/tcp

The process baseline was still unlocked after more than 5 minutes.

A new process which is not in the baseline is run

# tac /proc/1/net/tcp

None of the process baselines were locked and there was no alerting for process baseline violations.

Testing with locking process baseline before observation period ends with feature disabled

Repeated the steps in "Testing with locking process baseline before auto-lock", except the feature flag was disabled. The results were the same.

openshift-ci · 2025-07-17T21:21:45Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

red-hat-konflux · 2025-07-17T21:21:46Z

Caution

There are some errors in your PipelineRun template.

PipelineRun	Error
quay-proxy	`no kind "ImageDigestMirrorSet" is registered for version "config.openshift.io/v1" in scheme "k8s.io/client-go/kubernetes/scheme/register.go:83"`

pkg/processbaseline/processbaseline.go

codecov · 2025-07-17T21:52:13Z

Codecov Report

❌ Patch coverage is 68.75000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.07%. Comparing base (9be56fb) to head (6c97e89).
⚠️ Report is 63 commits behind head on master.

Files with missing lines	Patch %	Lines
central/detection/lifecycle/manager_impl.go	78.94%	9 Missing and 3 partials ⚠️
central/processbaseline/service/service_impl.go	22.22%	6 Missing and 1 partial ⚠️
...entral/processbaseline/datastore/datastore_impl.go	73.68%	3 Missing and 2 partials ⚠️
central/detection/lifecycle/manager.go	0.00%	3 Missing ⚠️
central/alert/service/service_impl.go	0.00%	1 Missing ⚠️
central/detection/lifecycle/singleton.go	0.00%	1 Missing ⚠️
central/processbaseline/service/singleton.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #16077      +/-   ##
==========================================
+ Coverage   49.06%   49.07%   +0.01%     
==========================================
  Files        2645     2645              
  Lines      196451   196512      +61     
==========================================
+ Hits        96382    96439      +57     
- Misses      92517    92520       +3     
- Partials     7552     7553       +1

Flag	Coverage Δ
go-unit-tests	`49.07% <68.75%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rhacs-bot · 2025-07-17T21:52:43Z

Images are ready for the commit at 6c97e89.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-618-g6c97e8953a.

JoukoVirtanen · 2025-07-17T22:31:09Z

central/detection/lifecycle/manager_impl.go

-	if !exists {
-		_, err = m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true)
+
+	if exists {


Previously this came at the end. Before ReprocessRiskForDeployments was called after a check that exists was false. If exists was false the function would be exited, before this could be reached. ReprocessRiskForDeployments is only run if the baseline was user or stackrox locked. Now the baseline will always be user locked, but I don't think we always want to call ReprocessRiskForDeployments. I think we just want to call it if the process baseline already exists and is user or stackrox locked. Therefore the check if this is locked is done before the baseline is locked later in the function.

This needs to be changed so that this only returns if the baseline is already user locked.

I don't really understand why auto locking requires all this refactoring and logic change. Couldn't you just create a function like processbaseline.IsAutoLocked(baseline) and do logic on that where the existing logic is done?

I think sending the baseline should be separate from the processing of the indicator. Perhaps a queue or some other process should handle that so we can leave this logic and the handling of the indicators alone.

Couldn't you just create a function like processbaseline.IsAutoLocked(baseline) and do logic on that where the existing logic is done?

Where would the baseline here come from? What existing logic are you referring to? Where would this be called? Would IsAutoLocked require a change to the protobuf definition for baselines or would it use the existing fields in it?

I think sending the baseline should be separate from the processing of the indicator.

What indicator are you referring to? The changes I am making apply to what happens when a deployment leaves the observation period and is removed from the queue of deployments in observation. A part of that is querying the database for processes in the deployment and adding those to the baseline. Is that what you mean?

I believe that a deployment is added to the queue when the first process within the deployment is run. Is that the indicator you are referring to?

Perhaps a queue or some other process should handle that so we can leave this logic and the handling of the indicators alone.

You also had a similar comment in slack.

Couldn't we start a queue or something that handles sending the baselines to sensor when the lock time expires?

The lock time expires when the deployments leave the observation window. The existing queue handles those cases. That is why I did not see a need for a separate queue. Would it be better if I had another version of checkAndUpdateBaseline one for when autolocking is used and another when it is not? Are you proposing that whenever the datastore adds a baseline to the database, that it is added to a queue and the queue checks if the stack_rox_lock timestamp is in the past and sends it to sensor? That might not be as simple as it first seems. There are many places that upsert process baselines into the database. Those places would need to use a common function that adds to the queue and upserts to the database. In some cases it might need to add a baseline somewhere other than the end of the queue, in which case a queue is not the right thing to use. I am not sure if there are such cases. In some cases elements will need to be removed from the queue. In the future a auto_lock_timestamp could be used instead of stack_rox_lock_timestamp.

What would the queue consist of? When I first read your comments I thought you meant another queue of deployments, but now I think you mean a queue of baselines.

I don't really understand why auto locking requires all this refactoring and logic change.

If I understand how you want the second queue to work than the refactoring is not needed, but there are advantages to the refactoring. E.g UpsertProcessBaseline does not actually take in a process baseline as input as the name suggests. Instead it takes a process baseline key and a list of process baseline elements as input, creates a process baseline from them and then upserts the process baseline. I believe it is better to have a function that can take a process baseline key and a list of elements and return a baseline and then call another to upsert it. Those two functions could be called in a renamed UpsertProcessBaseline if desired. I have not actually done that to UpsertProcessBaseline.

I didn't plan on doing this refactoring when I started work on this PR, but when I realized that checkAndUpdateBaseline didn't actually have access to the process baselines that were being upserted into the database I felt that the refactoring was necessary to send the baselines to sensor.

I had my queues mixed up when I made this comment. So you are correct the observation queue is already the second queue.

I still think the refactor is excessive. Wouldn't it be simpler to pass the baseline back up the call chain and send to sensor from

stackrox/central/detection/lifecycle/manager_impl.go

Line 156 in d92a1ca

func (m *managerImpl) flushBaselineQueue() {

?

And at the crux of it, I don't think we can auto lock anything really until we understand the impact.

I still think the refactor is excessive. Wouldn't it be simpler to pass the baseline back up the call chain and send to sensor from

Yes. I will do that. I think the datastore for process baselines does need to be refactored, but that can be done separately from this work. I don't think I did a great job refactoring in this PR and it can be done better.

And at the crux of it, I don't think we can auto lock anything really until we understand the impact.

I will let this PR be dormant for now and focus on testing.

JoukoVirtanen · 2025-07-17T22:33:19Z

central/detection/lifecycle/manager_impl.go

+	auto := true
+	rox_lock := true
+	user_lock := features.AutolockAllProcessBaselines.Enabled()
+	baseline, err = processBaselinePkg.BaselineFromKeysItemsAndExistingBaseline(baseline, key, elements, auto, rox_lock, user_lock)


Previously the baseline that was inserted into the database was not available here. There has been some refactoring done to make it available here, so that it can be sent to sensor from here.

This is outdated.

JoukoVirtanen · 2025-07-17T22:35:23Z

central/detection/lifecycle/manager_impl.go

+				},
+			}
+
+		err = m.connectionManager.SendMessage(baseline.GetKey().GetClusterId(), msg)


It might be better to have some common place were these types of messages are sent from, rather than sending it from both the service and here. I don't know what all of the pros and cons would be.

I think the service still needs to send them to sensor. But I think the service already calls this manager. So you could make a function her to do it and have both the manager and service call it if that is what you are looking for.

Makes sense.

central/detection/lifecycle/manager_impl_test.go

JoukoVirtanen · 2025-07-17T23:01:42Z

pkg/processbaseline/processbaseline.go

+	genDuration = env.BaselineGenerationDuration.DurationSetting()
+)
+
+func KeyToID(key *storage.ProcessBaselineKey) (string, error) {


Moving this here was not necessary, but since I was moving other code to here, it seemed to make sense to move this here as well. Having this here means this does not need to import the datastore. I think it made sense to move things from the datastore to here that did not actually interact with the database.

IMO, from a reviewers perspective, unnecessary refactors are best done in a separate PR. They distract from the rest of the review and create changes that need to be digested and understood in context with the changes being made. They make the blast radius larger and the review more expensive.

I will not do the refactoring for this ticket. I have created this ticket for the refactoring https://issues.redhat.com/browse/ROX-30341

pkg/processbaseline/processbaseline.go

dashrews78 · 2025-07-21T10:05:23Z

central/detection/lifecycle/manager_impl.go

+		roxBaseline := processbaseline.IsRoxLocked(baseline) && hasNonStartupProcess
+		if userBaseline || roxBaseline {
+			// We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary
+			m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID)


I suspect we would still need to do this for auto lock.

JoukoVirtanen · 2025-08-14T15:53:02Z

central/detection/lifecycle/manager_impl.go

 		return false, err
 	}

+	inObservation := m.deploymentObservationQueue.InObservation(key.GetDeploymentId())


There are now more checks for the deployment being in observation, so this is saved to a new variable.

JoukoVirtanen · 2025-08-14T16:48:39Z

central/detection/lifecycle/manager_impl.go

 		elements = append(elements, insertableElement)
 	}
-	if len(elements) == 0 {
+	if len(elements) == 0 && (inObservation || !features.AutolockAllProcessBaselines.Enabled() || processbaseline.IsUserLocked(baseline)) {


Without this change or something like it a process baseline leaving the observation period would not get locked if it was previously created, for example by looking at the processes for a deployment in "Risks", and then no new processes were launched between when the process baseline was created and when it left the observation period.

This is a lot. Can you extract that check into a well named function or something for readability?

JoukoVirtanen · 2025-08-14T16:54:07Z

central/detection/lifecycle/manager_impl.go

 	if !exists {
-		_, err = m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true)
+		userLocked := features.AutolockAllProcessBaselines.Enabled() && !inObservation
+		upsertedBaseline, err := m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true, userLocked)


UpsertProcessBaseline now needs to take in if the baseline should be user locked or not.

You need to do something with the err.

JoukoVirtanen · 2025-08-14T16:57:04Z

central/detection/lifecycle/manager_impl.go

-	if userBaseline || roxBaseline {
-		// We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary
-		m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID)
+	if !features.AutolockAllProcessBaselines.Enabled() {


This section could be refactored, but perhaps now it is the simplest way to make it clear that the behavior here is not changing if the feature flag is disabled.

This is very difficult to follow and a lot of repetition. I would suggest that you do refactor this area. And most importantly I would recommend some comments around things like
if userLocked || !roxBaseline {
As it is not obvious why that case matters so much as userLocked generally takes precedent.

This has been refactored and comments have been added. Previously if the baseline was neither user or stackrox locked it would be updated. The same behavior is kept here.

JoukoVirtanen · 2025-08-14T16:59:10Z

central/detection/lifecycle/manager_impl.go

+			// We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary
+			m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID)
+		}
+		if !userBaseline {


We may need to user lock the process baseline if it is just stackrox locked.

JoukoVirtanen · 2025-08-14T17:05:30Z

central/processbaseline/datastore/datastore_impl.go

 }

-func (ds *datastoreImpl) updateProcessBaselineElements(ctx context.Context, baseline *storage.ProcessBaseline, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool) (*storage.ProcessBaseline, error) {
+func (ds *datastoreImpl) updateProcessBaselineElements(ctx context.Context, baseline *storage.ProcessBaseline, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool, locked bool) (*storage.ProcessBaseline, error) {


updateProcessBaselineElements will need to user lock the process baseline when the deployment leaves the observation period.

JoukoVirtanen · 2025-08-14T17:05:52Z

central/processbaseline/datastore/datastore_impl.go

 }

-func (ds *datastoreImpl) UpdateProcessBaselineElements(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool) (*storage.ProcessBaseline, error) {
+func (ds *datastoreImpl) UpdateProcessBaselineElements(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool, locked bool) (*storage.ProcessBaseline, error) {


Similar comment as above.

JoukoVirtanen · 2025-08-14T17:06:23Z

central/processbaseline/datastore/datastore_impl.go

 }

-func (ds *datastoreImpl) UpsertProcessBaseline(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, auto bool, lock bool) (*storage.ProcessBaseline, error) {
+func (ds *datastoreImpl) UpsertProcessBaseline(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, auto bool, lock bool, userLock bool) (*storage.ProcessBaseline, error) {


Similar comment as above.

Co-authored-by: David Shrewsberry <99685630+dashrews78@users.noreply.github.com>

… is passed to .updateProcessBaselineElements even if the baseline already exists

… future

JoukoVirtanen · 2025-09-02T18:55:54Z

Replaced by #16564

openshift-ci bot added the do-not-merge/work-in-progress label Jul 17, 2025

github-actions bot added the area/central label Jul 17, 2025

JoukoVirtanen commented Jul 17, 2025

View reviewed changes

pkg/processbaseline/processbaseline.go Outdated Show resolved Hide resolved

JoukoVirtanen commented Jul 17, 2025

View reviewed changes

central/detection/lifecycle/manager_impl_test.go Outdated Show resolved Hide resolved

JoukoVirtanen commented Jul 17, 2025

View reviewed changes

pkg/processbaseline/processbaseline.go Outdated Show resolved Hide resolved

JoukoVirtanen requested review from a team and dashrews78 and removed request for a team July 17, 2025 23:02

dashrews78 reviewed Jul 21, 2025

View reviewed changes

JoukoVirtanen force-pushed the jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation branch from 01ac856 to dbf4f87 Compare August 7, 2025 22:03

JoukoVirtanen commented Aug 14, 2025

View reviewed changes

JoukoVirtanen marked this pull request as ready for review August 15, 2025 04:35

JoukoVirtanen requested a review from a team as a code owner August 15, 2025 04:35

openshift-ci bot removed the do-not-merge/work-in-progress label Aug 15, 2025

JoukoVirtanen requested a review from dashrews78 August 15, 2025 05:50

JoukoVirtanen and others added 25 commits August 27, 2025 10:16

Fixed unit test

89af50f

Changed the name of the feature flag

7eae4ae

Setting connectionManager for lifecycle manager

986e97b

Manual testing works

c264034

Fixed style

ade5925

Not upserting the baseline if it doesn't need to be upserted

e4d93de

Added unit test

d9e9ce5

Remved connection manager from process baseline service

cd11907

make style

b583e14

Apply suggestions from code review

d71f71c

Co-authored-by: David Shrewsberry <99685630+dashrews78@users.noreply.github.com>

Made feature flag consistent

0d0f8fc

Added test for UpdateProcessBaselineElements where autolock is true

618ff7c

Clarified which lock is being used

343cea3

Further clarified which lock is being used

95f5c2e

stackroxLock will be set to true if userLock is set to true. userLock…

4387c22

… is passed to .updateProcessBaselineElements even if the baseline already exists

SendBaselineToSensor returns error. Variable renamed from pw to baseline

f8761ba

Added more tests for .checkAndUpdateBaseline with autolocking

829b7d2

Added another test. UserLock should be unlocked when needed

5351e5d

Moved a check into its own function

0d46ce6

Refactored case where the baseline already exists

53d88a2

Corrected comment

7703df6

Saving feature flag to env var, because it can be more complex in the…

fd17e0c

… future

Renamed checkIfBaselineDoesntNeedUpdate to checkIfBaselineCanBeSkipped

3fdc16d

Not passing userLock to updateProcessBaselineElements

379d14f

Saving cluster id to a variable in SendBaselineToSensor

6c97e89

JoukoVirtanen force-pushed the jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation branch from 8575a13 to 6c97e89 Compare August 27, 2025 17:16

JoukoVirtanen mentioned this pull request Aug 29, 2025

ROX-30135: Auto-lock process baselines #16564

Merged

9 tasks

JoukoVirtanen closed this Sep 2, 2025

JoukoVirtanen deleted the jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation branch September 2, 2025 18:56

Conversation

JoukoVirtanen commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Testing with autolocking enabled

Testing with locking process baseline before auto-lock

Testing with auto-locking disabled

Testing with locking process baseline before observation period ends with feature disabled

Uh oh!

openshift-ci bot commented Jul 17, 2025

Uh oh!

red-hat-konflux bot commented Jul 17, 2025

Uh oh!

Uh oh!

codecov bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rhacs-bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoukoVirtanen Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoukoVirtanen Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

JoukoVirtanen commented Jul 17, 2025 •

edited

Loading

codecov bot commented Jul 17, 2025 •

edited

Loading

rhacs-bot commented Jul 17, 2025 •

edited

Loading

JoukoVirtanen Jul 17, 2025 •

edited

Loading

JoukoVirtanen Aug 14, 2025 •

edited

Loading