Skip to content

ROX-30135: Send baselines to sensor when deployment leaves observation#16077

Closed
JoukoVirtanen wants to merge 30 commits intomasterfrom
jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation
Closed

ROX-30135: Send baselines to sensor when deployment leaves observation#16077
JoukoVirtanen wants to merge 30 commits intomasterfrom
jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation

Conversation

@JoukoVirtanen
Copy link
Contributor

@JoukoVirtanen JoukoVirtanen commented Jul 17, 2025

Description

When a deployment leaves the observation window a "user" locked process baseline is created for it automatically, which is persisted in the database and sent to sensor. This change is behind a feature flag. This is done in the detection lifecycle manager. Previously the detection lifecycle manager created a stackrox locked baseline and persisted it in the database without sending it to sensor.

There are two PRs built upon this PR.

Add process baseline autolocking to cluster config
#16427

Configure process baseline auto locking via helm
#16462

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Testing was not done with the latest version so results might now be different.

Testing with autolocking enabled

Set the following environment variables

export ROX_BASELINE_GENERATION_DURATION=5m
export ROX_AUTOLOCK_PROCESS_BASELINES=true

Deployed ACS.

Created a pod that could be used to run some processes and entered it.

kubectl run ubuntu-pod --image=ubuntu --restart=Never --command -- sleep infinity
kubectl exec ubuntu-pod -it -- /bin/bash

Logged into the UI and checked "Risk".

Screenshot From 2025-08-13 20-17-52 Screenshot From 2025-08-13 20-18-03

Initially the process baseline is unlocked.

The following command is run inside the pod

# cat /proc/1/net/tcp

The process baseline is checked after 5 minutes.

Screenshot From 2025-08-13 20-22-36

Note that the processes are locked without any action having been taken.

A new process which is not in the baseline is run

# tac /proc/1/net/tcp
Screenshot From 2025-08-13 20-22-58 Screenshot From 2025-08-13 20-23-17 Screenshot From 2025-08-13 20-23-27

Violations are checked and the anomalous process is there.

The process baselines for other deployments were checked as well and they were locked

Screenshot From 2025-08-13 20-25-25

They were locked as well.

Testing with locking process baseline before auto-lock

Set the following environment variables

export ROX_BASELINE_GENERATION_DURATION=5m
export ROX_AUTOLOCK_PROCESS_BASELINES=true

Deployed ACS.

Created a pod that could be used to run some processes and entered it.

kubectl run ubuntu-pod --image=ubuntu --restart=Never --command -- sleep infinity
kubectl exec ubuntu-pod -it -- /bin/bash

Logged into the UI and checked "Risk".

Screenshot From 2025-08-26 22-10-52
# tac /proc/1/net/tcp
Screenshot From 2025-08-26 22-11-50 Screenshot From 2025-08-26 22-12-02

Locked the process baseline

Ran an anomalous process

# tac /proc/1/net/tcp
Screenshot From 2025-08-26 22-12-58

The anomalous process is flagged in "Process Discovery"

Screenshot From 2025-08-26 22-13-17

The process baseline is unlocked.

Another process is run

ls /proc/1/net/tcp
Screenshot From 2025-08-26 22-14-28

The new process is not considered to be anomalous.

Screenshot From 2025-08-26 22-16-43

The process that was run while the process baseline was locked shows up in violations.

Testing with auto-locking disabled

Did the same steps except the feature flag was disabled.

export ROX_BASELINE_GENERATION_DURATION=5m
export ROX_AUTOLOCK_PROCESS_BASELINES=false

Deployed ACS.

Created a pod that could be used to run some processes and entered it.

kubectl run ubuntu-pod --image=ubuntu --restart=Never --command -- sleep infinity
kubectl exec ubuntu-pod -it -- /bin/bash

Logged into the UI and checked "Risk".

Screenshot From 2025-08-14 07-44-47

Just as before the process baseline was unlocked at the beginning.

Screenshot From 2025-08-14 07-45-40

The following command is run inside the pod

# cat /proc/1/net/tcp
Screenshot From 2025-08-14 08-15-01

The process baseline was still unlocked after more than 5 minutes.

Screenshot From 2025-08-14 08-15-45

A new process which is not in the baseline is run

# tac /proc/1/net/tcp
Screenshot From 2025-08-14 08-15-53 Screenshot From 2025-08-14 08-16-08

None of the process baselines were locked and there was no alerting for process baseline violations.

Testing with locking process baseline before observation period ends with feature disabled

Repeated the steps in "Testing with locking process baseline before auto-lock", except the feature flag was disabled. The results were the same.

@openshift-ci
Copy link

openshift-ci bot commented Jul 17, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@red-hat-konflux
Copy link
Contributor

Caution

There are some errors in your PipelineRun template.

PipelineRun Error
quay-proxy no kind "ImageDigestMirrorSet" is registered for version "config.openshift.io/v1" in scheme "k8s.io/client-go/kubernetes/scheme/register.go:83"

@codecov
Copy link

codecov bot commented Jul 17, 2025

Codecov Report

❌ Patch coverage is 68.75000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.07%. Comparing base (9be56fb) to head (6c97e89).
⚠️ Report is 63 commits behind head on master.

Files with missing lines Patch % Lines
central/detection/lifecycle/manager_impl.go 78.94% 9 Missing and 3 partials ⚠️
central/processbaseline/service/service_impl.go 22.22% 6 Missing and 1 partial ⚠️
...entral/processbaseline/datastore/datastore_impl.go 73.68% 3 Missing and 2 partials ⚠️
central/detection/lifecycle/manager.go 0.00% 3 Missing ⚠️
central/alert/service/service_impl.go 0.00% 1 Missing ⚠️
central/detection/lifecycle/singleton.go 0.00% 1 Missing ⚠️
central/processbaseline/service/singleton.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #16077      +/-   ##
==========================================
+ Coverage   49.06%   49.07%   +0.01%     
==========================================
  Files        2645     2645              
  Lines      196451   196512      +61     
==========================================
+ Hits        96382    96439      +57     
- Misses      92517    92520       +3     
- Partials     7552     7553       +1     
Flag Coverage Δ
go-unit-tests 49.07% <68.75%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Jul 17, 2025

Images are ready for the commit at 6c97e89.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-618-g6c97e8953a.

if !exists {
_, err = m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true)

if exists {
Copy link
Contributor Author

@JoukoVirtanen JoukoVirtanen Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this came at the end. Before ReprocessRiskForDeployments was called after a check that exists was false. If exists was false the function would be exited, before this could be reached. ReprocessRiskForDeployments is only run if the baseline was user or stackrox locked. Now the baseline will always be user locked, but I don't think we always want to call ReprocessRiskForDeployments. I think we just want to call it if the process baseline already exists and is user or stackrox locked. Therefore the check if this is locked is done before the baseline is locked later in the function.

This needs to be changed so that this only returns if the baseline is already user locked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why auto locking requires all this refactoring and logic change. Couldn't you just create a function like processbaseline.IsAutoLocked(baseline) and do logic on that where the existing logic is done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sending the baseline should be separate from the processing of the indicator. Perhaps a queue or some other process should handle that so we can leave this logic and the handling of the indicators alone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just create a function like processbaseline.IsAutoLocked(baseline) and do logic on that where the existing logic is done?

Where would the baseline here come from? What existing logic are you referring to? Where would this be called? Would IsAutoLocked require a change to the protobuf definition for baselines or would it use the existing fields in it?

I think sending the baseline should be separate from the processing of the indicator.

What indicator are you referring to? The changes I am making apply to what happens when a deployment leaves the observation period and is removed from the queue of deployments in observation. A part of that is querying the database for processes in the deployment and adding those to the baseline. Is that what you mean?

I believe that a deployment is added to the queue when the first process within the deployment is run. Is that the indicator you are referring to?

Perhaps a queue or some other process should handle that so we can leave this logic and the handling of the indicators alone.

You also had a similar comment in slack.

Couldn't we start a queue or something that handles sending the baselines to sensor when the lock time expires?

The lock time expires when the deployments leave the observation window. The existing queue handles those cases. That is why I did not see a need for a separate queue. Would it be better if I had another version of checkAndUpdateBaseline one for when autolocking is used and another when it is not? Are you proposing that whenever the datastore adds a baseline to the database, that it is added to a queue and the queue checks if the stack_rox_lock timestamp is in the past and sends it to sensor? That might not be as simple as it first seems. There are many places that upsert process baselines into the database. Those places would need to use a common function that adds to the queue and upserts to the database. In some cases it might need to add a baseline somewhere other than the end of the queue, in which case a queue is not the right thing to use. I am not sure if there are such cases. In some cases elements will need to be removed from the queue. In the future a auto_lock_timestamp could be used instead of stack_rox_lock_timestamp.

What would the queue consist of? When I first read your comments I thought you meant another queue of deployments, but now I think you mean a queue of baselines.

I don't really understand why auto locking requires all this refactoring and logic change.

If I understand how you want the second queue to work than the refactoring is not needed, but there are advantages to the refactoring. E.g UpsertProcessBaseline does not actually take in a process baseline as input as the name suggests. Instead it takes a process baseline key and a list of process baseline elements as input, creates a process baseline from them and then upserts the process baseline. I believe it is better to have a function that can take a process baseline key and a list of elements and return a baseline and then call another to upsert it. Those two functions could be called in a renamed UpsertProcessBaseline if desired. I have not actually done that to UpsertProcessBaseline.

I didn't plan on doing this refactoring when I started work on this PR, but when I realized that checkAndUpdateBaseline didn't actually have access to the process baselines that were being upserted into the database I felt that the refactoring was necessary to send the baselines to sensor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I had my queues mixed up when I made this comment. So you are correct the observation queue is already the second queue.
  2. I still think the refactor is excessive. Wouldn't it be simpler to pass the baseline back up the call chain and send to sensor from
    func (m *managerImpl) flushBaselineQueue() {
    ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And at the crux of it, I don't think we can auto lock anything really until we understand the impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the refactor is excessive. Wouldn't it be simpler to pass the baseline back up the call chain and send to sensor from

Yes. I will do that. I think the datastore for process baselines does need to be refactored, but that can be done separately from this work. I don't think I did a great job refactoring in this PR and it can be done better.

And at the crux of it, I don't think we can auto lock anything really until we understand the impact.

I will let this PR be dormant for now and focus on testing.

auto := true
rox_lock := true
user_lock := features.AutolockAllProcessBaselines.Enabled()
baseline, err = processBaselinePkg.BaselineFromKeysItemsAndExistingBaseline(baseline, key, elements, auto, rox_lock, user_lock)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the baseline that was inserted into the database was not available here. There has been some refactoring done to make it available here, so that it can be sent to sensor from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outdated.

},
}

err = m.connectionManager.SendMessage(baseline.GetKey().GetClusterId(), msg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to have some common place were these types of messages are sent from, rather than sending it from both the service and here. I don't know what all of the pros and cons would be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the service still needs to send them to sensor. But I think the service already calls this manager. So you could make a function her to do it and have both the manager and service call it if that is what you are looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

genDuration = env.BaselineGenerationDuration.DurationSetting()
)

func KeyToID(key *storage.ProcessBaselineKey) (string, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this here was not necessary, but since I was moving other code to here, it seemed to make sense to move this here as well. Having this here means this does not need to import the datastore. I think it made sense to move things from the datastore to here that did not actually interact with the database.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, from a reviewers perspective, unnecessary refactors are best done in a separate PR. They distract from the rest of the review and create changes that need to be digested and understood in context with the changes being made. They make the blast radius larger and the review more expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will not do the refactoring for this ticket. I have created this ticket for the refactoring https://issues.redhat.com/browse/ROX-30341

@JoukoVirtanen JoukoVirtanen requested review from a team and dashrews78 and removed request for a team July 17, 2025 23:02
roxBaseline := processbaseline.IsRoxLocked(baseline) && hasNonStartupProcess
if userBaseline || roxBaseline {
// We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary
m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we would still need to do this for auto lock.

@JoukoVirtanen JoukoVirtanen force-pushed the jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation branch from 01ac856 to dbf4f87 Compare August 7, 2025 22:03
return false, err
}

inObservation := m.deploymentObservationQueue.InObservation(key.GetDeploymentId())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are now more checks for the deployment being in observation, so this is saved to a new variable.

elements = append(elements, insertableElement)
}
if len(elements) == 0 {
if len(elements) == 0 && (inObservation || !features.AutolockAllProcessBaselines.Enabled() || processbaseline.IsUserLocked(baseline)) {
Copy link
Contributor Author

@JoukoVirtanen JoukoVirtanen Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change or something like it a process baseline leaving the observation period would not get locked if it was previously created, for example by looking at the processes for a deployment in "Risks", and then no new processes were launched between when the process baseline was created and when it left the observation period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot. Can you extract that check into a well named function or something for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if !exists {
_, err = m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true)
userLocked := features.AutolockAllProcessBaselines.Enabled() && !inObservation
upsertedBaseline, err := m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true, userLocked)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpsertProcessBaseline now needs to take in if the baseline should be user locked or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to do something with the err.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if userBaseline || roxBaseline {
// We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary
m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID)
if !features.AutolockAllProcessBaselines.Enabled() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section could be refactored, but perhaps now it is the simplest way to make it clear that the behavior here is not changing if the feature flag is disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very difficult to follow and a lot of repetition. I would suggest that you do refactor this area. And most importantly I would recommend some comments around things like
if userLocked || !roxBaseline {
As it is not obvious why that case matters so much as userLocked generally takes precedent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been refactored and comments have been added. Previously if the baseline was neither user or stackrox locked it would be updated. The same behavior is kept here.

// We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary
m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID)
}
if !userBaseline {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to user lock the process baseline if it is just stackrox locked.

}

func (ds *datastoreImpl) updateProcessBaselineElements(ctx context.Context, baseline *storage.ProcessBaseline, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool) (*storage.ProcessBaseline, error) {
func (ds *datastoreImpl) updateProcessBaselineElements(ctx context.Context, baseline *storage.ProcessBaseline, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool, locked bool) (*storage.ProcessBaseline, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateProcessBaselineElements will need to user lock the process baseline when the deployment leaves the observation period.

}

func (ds *datastoreImpl) UpdateProcessBaselineElements(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool) (*storage.ProcessBaseline, error) {
func (ds *datastoreImpl) UpdateProcessBaselineElements(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool, locked bool) (*storage.ProcessBaseline, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above.

}

func (ds *datastoreImpl) UpsertProcessBaseline(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, auto bool, lock bool) (*storage.ProcessBaseline, error) {
func (ds *datastoreImpl) UpsertProcessBaseline(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, auto bool, lock bool, userLock bool) (*storage.ProcessBaseline, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above.

@JoukoVirtanen JoukoVirtanen marked this pull request as ready for review August 15, 2025 04:35
@JoukoVirtanen JoukoVirtanen requested a review from a team as a code owner August 15, 2025 04:35
JoukoVirtanen and others added 25 commits August 27, 2025 10:16
Co-authored-by: David Shrewsberry <99685630+dashrews78@users.noreply.github.com>
… is passed to .updateProcessBaselineElements even if the baseline already exists
@JoukoVirtanen JoukoVirtanen force-pushed the jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation branch from 8575a13 to 6c97e89 Compare August 27, 2025 17:16
@JoukoVirtanen
Copy link
Contributor Author

Replaced by #16564

@JoukoVirtanen JoukoVirtanen deleted the jv-ROX-30135-send-baselines-to-sensor-when-deployment-leaves-observation branch September 2, 2025 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants