ROX-30135: Send baselines to sensor when deployment leaves observation#16077
ROX-30135: Send baselines to sensor when deployment leaves observation#16077JoukoVirtanen wants to merge 30 commits intomasterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
Caution There are some errors in your PipelineRun template.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #16077 +/- ##
==========================================
+ Coverage 49.06% 49.07% +0.01%
==========================================
Files 2645 2645
Lines 196451 196512 +61
==========================================
+ Hits 96382 96439 +57
- Misses 92517 92520 +3
- Partials 7552 7553 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Images are ready for the commit at 6c97e89. To use with deploy scripts, first |
| if !exists { | ||
| _, err = m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true) | ||
|
|
||
| if exists { |
There was a problem hiding this comment.
Previously this came at the end. Before ReprocessRiskForDeployments was called after a check that exists was false. If exists was false the function would be exited, before this could be reached. ReprocessRiskForDeployments is only run if the baseline was user or stackrox locked. Now the baseline will always be user locked, but I don't think we always want to call ReprocessRiskForDeployments. I think we just want to call it if the process baseline already exists and is user or stackrox locked. Therefore the check if this is locked is done before the baseline is locked later in the function.
This needs to be changed so that this only returns if the baseline is already user locked.
There was a problem hiding this comment.
I don't really understand why auto locking requires all this refactoring and logic change. Couldn't you just create a function like processbaseline.IsAutoLocked(baseline) and do logic on that where the existing logic is done?
There was a problem hiding this comment.
I think sending the baseline should be separate from the processing of the indicator. Perhaps a queue or some other process should handle that so we can leave this logic and the handling of the indicators alone.
There was a problem hiding this comment.
Couldn't you just create a function like processbaseline.IsAutoLocked(baseline) and do logic on that where the existing logic is done?
Where would the baseline here come from? What existing logic are you referring to? Where would this be called? Would IsAutoLocked require a change to the protobuf definition for baselines or would it use the existing fields in it?
I think sending the baseline should be separate from the processing of the indicator.
What indicator are you referring to? The changes I am making apply to what happens when a deployment leaves the observation period and is removed from the queue of deployments in observation. A part of that is querying the database for processes in the deployment and adding those to the baseline. Is that what you mean?
I believe that a deployment is added to the queue when the first process within the deployment is run. Is that the indicator you are referring to?
Perhaps a queue or some other process should handle that so we can leave this logic and the handling of the indicators alone.
You also had a similar comment in slack.
Couldn't we start a queue or something that handles sending the baselines to sensor when the lock time expires?
The lock time expires when the deployments leave the observation window. The existing queue handles those cases. That is why I did not see a need for a separate queue. Would it be better if I had another version of checkAndUpdateBaseline one for when autolocking is used and another when it is not? Are you proposing that whenever the datastore adds a baseline to the database, that it is added to a queue and the queue checks if the stack_rox_lock timestamp is in the past and sends it to sensor? That might not be as simple as it first seems. There are many places that upsert process baselines into the database. Those places would need to use a common function that adds to the queue and upserts to the database. In some cases it might need to add a baseline somewhere other than the end of the queue, in which case a queue is not the right thing to use. I am not sure if there are such cases. In some cases elements will need to be removed from the queue. In the future a auto_lock_timestamp could be used instead of stack_rox_lock_timestamp.
What would the queue consist of? When I first read your comments I thought you meant another queue of deployments, but now I think you mean a queue of baselines.
I don't really understand why auto locking requires all this refactoring and logic change.
If I understand how you want the second queue to work than the refactoring is not needed, but there are advantages to the refactoring. E.g UpsertProcessBaseline does not actually take in a process baseline as input as the name suggests. Instead it takes a process baseline key and a list of process baseline elements as input, creates a process baseline from them and then upserts the process baseline. I believe it is better to have a function that can take a process baseline key and a list of elements and return a baseline and then call another to upsert it. Those two functions could be called in a renamed UpsertProcessBaseline if desired. I have not actually done that to UpsertProcessBaseline.
I didn't plan on doing this refactoring when I started work on this PR, but when I realized that checkAndUpdateBaseline didn't actually have access to the process baselines that were being upserted into the database I felt that the refactoring was necessary to send the baselines to sensor.
There was a problem hiding this comment.
- I had my queues mixed up when I made this comment. So you are correct the observation queue is already the second queue.
- I still think the refactor is excessive. Wouldn't it be simpler to pass the baseline back up the call chain and send to sensor from ?
There was a problem hiding this comment.
And at the crux of it, I don't think we can auto lock anything really until we understand the impact.
There was a problem hiding this comment.
I still think the refactor is excessive. Wouldn't it be simpler to pass the baseline back up the call chain and send to sensor from
Yes. I will do that. I think the datastore for process baselines does need to be refactored, but that can be done separately from this work. I don't think I did a great job refactoring in this PR and it can be done better.
And at the crux of it, I don't think we can auto lock anything really until we understand the impact.
I will let this PR be dormant for now and focus on testing.
| auto := true | ||
| rox_lock := true | ||
| user_lock := features.AutolockAllProcessBaselines.Enabled() | ||
| baseline, err = processBaselinePkg.BaselineFromKeysItemsAndExistingBaseline(baseline, key, elements, auto, rox_lock, user_lock) |
There was a problem hiding this comment.
Previously the baseline that was inserted into the database was not available here. There has been some refactoring done to make it available here, so that it can be sent to sensor from here.
There was a problem hiding this comment.
This is outdated.
| }, | ||
| } | ||
|
|
||
| err = m.connectionManager.SendMessage(baseline.GetKey().GetClusterId(), msg) |
There was a problem hiding this comment.
It might be better to have some common place were these types of messages are sent from, rather than sending it from both the service and here. I don't know what all of the pros and cons would be.
There was a problem hiding this comment.
I think the service still needs to send them to sensor. But I think the service already calls this manager. So you could make a function her to do it and have both the manager and service call it if that is what you are looking for.
| genDuration = env.BaselineGenerationDuration.DurationSetting() | ||
| ) | ||
|
|
||
| func KeyToID(key *storage.ProcessBaselineKey) (string, error) { |
There was a problem hiding this comment.
Moving this here was not necessary, but since I was moving other code to here, it seemed to make sense to move this here as well. Having this here means this does not need to import the datastore. I think it made sense to move things from the datastore to here that did not actually interact with the database.
There was a problem hiding this comment.
IMO, from a reviewers perspective, unnecessary refactors are best done in a separate PR. They distract from the rest of the review and create changes that need to be digested and understood in context with the changes being made. They make the blast radius larger and the review more expensive.
There was a problem hiding this comment.
I will not do the refactoring for this ticket. I have created this ticket for the refactoring https://issues.redhat.com/browse/ROX-30341
| roxBaseline := processbaseline.IsRoxLocked(baseline) && hasNonStartupProcess | ||
| if userBaseline || roxBaseline { | ||
| // We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary | ||
| m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID) |
There was a problem hiding this comment.
I suspect we would still need to do this for auto lock.
01ac856 to
dbf4f87
Compare
| return false, err | ||
| } | ||
|
|
||
| inObservation := m.deploymentObservationQueue.InObservation(key.GetDeploymentId()) |
There was a problem hiding this comment.
There are now more checks for the deployment being in observation, so this is saved to a new variable.
| elements = append(elements, insertableElement) | ||
| } | ||
| if len(elements) == 0 { | ||
| if len(elements) == 0 && (inObservation || !features.AutolockAllProcessBaselines.Enabled() || processbaseline.IsUserLocked(baseline)) { |
There was a problem hiding this comment.
Without this change or something like it a process baseline leaving the observation period would not get locked if it was previously created, for example by looking at the processes for a deployment in "Risks", and then no new processes were launched between when the process baseline was created and when it left the observation period.
There was a problem hiding this comment.
This is a lot. Can you extract that check into a well named function or something for readability?
| if !exists { | ||
| _, err = m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true) | ||
| userLocked := features.AutolockAllProcessBaselines.Enabled() && !inObservation | ||
| upsertedBaseline, err := m.baselines.UpsertProcessBaseline(lifecycleMgrCtx, key, elements, true, true, userLocked) |
There was a problem hiding this comment.
UpsertProcessBaseline now needs to take in if the baseline should be user locked or not.
There was a problem hiding this comment.
You need to do something with the err.
| if userBaseline || roxBaseline { | ||
| // We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary | ||
| m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID) | ||
| if !features.AutolockAllProcessBaselines.Enabled() { |
There was a problem hiding this comment.
This section could be refactored, but perhaps now it is the simplest way to make it clear that the behavior here is not changing if the feature flag is disabled.
There was a problem hiding this comment.
This is very difficult to follow and a lot of repetition. I would suggest that you do refactor this area. And most importantly I would recommend some comments around things like
if userLocked || !roxBaseline {
As it is not obvious why that case matters so much as userLocked generally takes precedent.
There was a problem hiding this comment.
This has been refactored and comments have been added. Previously if the baseline was neither user or stackrox locked it would be updated. The same behavior is kept here.
| // We already checked if it's in the baseline and it is not, so reprocess risk to mark the results are suspicious if necessary | ||
| m.reprocessor.ReprocessRiskForDeployments(baselineKey.deploymentID) | ||
| } | ||
| if !userBaseline { |
There was a problem hiding this comment.
We may need to user lock the process baseline if it is just stackrox locked.
| } | ||
|
|
||
| func (ds *datastoreImpl) updateProcessBaselineElements(ctx context.Context, baseline *storage.ProcessBaseline, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool) (*storage.ProcessBaseline, error) { | ||
| func (ds *datastoreImpl) updateProcessBaselineElements(ctx context.Context, baseline *storage.ProcessBaseline, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool, locked bool) (*storage.ProcessBaseline, error) { |
There was a problem hiding this comment.
updateProcessBaselineElements will need to user lock the process baseline when the deployment leaves the observation period.
| } | ||
|
|
||
| func (ds *datastoreImpl) UpdateProcessBaselineElements(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool) (*storage.ProcessBaseline, error) { | ||
| func (ds *datastoreImpl) UpdateProcessBaselineElements(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, removeElements []*storage.BaselineItem, auto bool, locked bool) (*storage.ProcessBaseline, error) { |
There was a problem hiding this comment.
Similar comment as above.
| } | ||
|
|
||
| func (ds *datastoreImpl) UpsertProcessBaseline(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, auto bool, lock bool) (*storage.ProcessBaseline, error) { | ||
| func (ds *datastoreImpl) UpsertProcessBaseline(ctx context.Context, key *storage.ProcessBaselineKey, addElements []*storage.BaselineItem, auto bool, lock bool, userLock bool) (*storage.ProcessBaseline, error) { |
There was a problem hiding this comment.
Similar comment as above.
Co-authored-by: David Shrewsberry <99685630+dashrews78@users.noreply.github.com>
… is passed to .updateProcessBaselineElements even if the baseline already exists
8575a13 to
6c97e89
Compare
|
Replaced by #16564 |
Description
When a deployment leaves the observation window a "user" locked process baseline is created for it automatically, which is persisted in the database and sent to sensor. This change is behind a feature flag. This is done in the detection lifecycle manager. Previously the detection lifecycle manager created a stackrox locked baseline and persisted it in the database without sending it to sensor.
There are two PRs built upon this PR.
Add process baseline autolocking to cluster config
#16427
Configure process baseline auto locking via helm
#16462
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Testing was not done with the latest version so results might now be different.
Testing with autolocking enabled
Set the following environment variables
Deployed ACS.
Created a pod that could be used to run some processes and entered it.
Logged into the UI and checked "Risk".
Initially the process baseline is unlocked.
The following command is run inside the pod
The process baseline is checked after 5 minutes.
Note that the processes are locked without any action having been taken.
A new process which is not in the baseline is run
Violations are checked and the anomalous process is there.
The process baselines for other deployments were checked as well and they were locked
They were locked as well.
Testing with locking process baseline before auto-lock
Set the following environment variables
Deployed ACS.
Created a pod that could be used to run some processes and entered it.
Logged into the UI and checked "Risk".
Locked the process baseline
Ran an anomalous process
The anomalous process is flagged in "Process Discovery"
The process baseline is unlocked.
Another process is run
The new process is not considered to be anomalous.
The process that was run while the process baseline was locked shows up in violations.
Testing with auto-locking disabled
Did the same steps except the feature flag was disabled.
Deployed ACS.
Created a pod that could be used to run some processes and entered it.
Logged into the UI and checked "Risk".
Just as before the process baseline was unlocked at the beginning.
The following command is run inside the pod
The process baseline was still unlocked after more than 5 minutes.
A new process which is not in the baseline is run
None of the process baselines were locked and there was no alerting for process baseline violations.
Testing with locking process baseline before observation period ends with feature disabled
Repeated the steps in "Testing with locking process baseline before auto-lock", except the feature flag was disabled. The results were the same.