Skip to content

ROX-13540: Allow Sensor Component NodeScanHandler to start max. once #3800

Merged
vikin91 merged 38 commits intomasterfrom
pr/ROX-12835-followup-1
Nov 29, 2022
Merged

ROX-13540: Allow Sensor Component NodeScanHandler to start max. once #3800
vikin91 merged 38 commits intomasterfrom
pr/ROX-12835-followup-1

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Nov 14, 2022

Description

This is a follow-up PR to address review comments from #3533.
The original PR was merged by accident without addressing optional but
good suggestions.

This PR changes the NodeScan component to start maximally once.

It includes:

  • 🗑️ Removing stoppedC leftover (readded in d7353ae)
  • 🐛 Handling a panic when two or more consecutive Start() calls happen without a Stop() call in-between
  • 🧪 Adding a regression test for the two or more consecutive Start() calls
  • 🐛 Fixing previously undiscovered race condition
  • 💄 Removing one level of indentation in sendScan
  • 💄 Using s.Assert() instead assert.
  • ✨ (Extra) Making the component restartable (removed in d7353ae)
  • Re-added Stopped() so that the tests can wait for the handler to finish work.

Extra

After the first comment about not closing the toCentral channel, I decided that the only clean solution to that challenge it to make the component restartable. I added the restartability in be4bb7c.

Checklist

  • Investigated and inspected CI test results
  • Unit test and regression tests added

Not applies

  • Evaluated and added CHANGELOG entry if required
  • Determined and documented upgrade steps
  • Documented user facing changes (create PR based on openshift/openshift-docs and merge into rhacs-docs)

Testing Performed

  • CI
  • go test -race -count=1 -v -run ^TestNodeScanHandler$ ./sensor/common/compliance
  • go test -race -count=12 -v -run ^TestNodeScanHandler$ ./sensor/common/compliance (more executions at once)
  • Manual deployment of cluster locally, enabling the FF (ROX_RHCOS_NODE_SCANNING=true), and
    confirming that the messages are sent to Sensor and Central. Also
    killing sensor pod and making sure that it gets restarted and
    self-heals.

@openshift-ci
Copy link

openshift-ci bot commented Nov 14, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@vikin91
Copy link
Contributor Author

vikin91 commented Nov 14, 2022

@vikin91 vikin91 force-pushed the pr/ROX-12835-followup-1 branch from 0880bd0 to c9416c7 Compare November 15, 2022 08:50
@vikin91 vikin91 marked this pull request as ready for review November 15, 2022 09:17
@vikin91 vikin91 changed the title Fix: Ensure that consecutive calls to nodeScanHandlerImpl.Start don't panic ROX-13540: Ensure that consecutive calls to nodeScanHandlerImpl.Start don't panic Nov 15, 2022
@ghost
Copy link

ghost commented Nov 15, 2022

Images are ready for the commit at 5246c62.

To use with deploy scripts, first export MAIN_IMAGE_TAG=3.73.x-119-g5246c62090.

return
default:
nodeScans <- fakeNodeScanV2("Node")
case nodeScans <- fakeNodeScanV2("Node"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for the race: placing the blocking part in the case instead of default

Comment on lines +82 to +120
case _, ok := <-h.ResponsesC():
if !ok {
return
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplification: Instead of forwarding to central, we now receive and discard all msgs that are meant to go to central.

@vikin91 vikin91 requested review from RTann and msugakov November 15, 2022 12:58
s.Assert().NotPanics(func() {
defer close(nodeScans)
h := NewNodeScanHandler(nodeScans)
s.Assert().NoError(h.Start())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I found out a *suite is an *assert.Assertions, so no need to do s.Assert()..... You can just do s.NoError(...)

@vikin91 vikin91 marked this pull request as draft November 16, 2022 11:37
@vikin91 vikin91 force-pushed the pr/ROX-12835-followup-1 branch from 3f47d78 to be4bb7c Compare November 16, 2022 12:19
@vikin91 vikin91 changed the title ROX-13540: Ensure that consecutive calls to nodeScanHandlerImpl.Start don't panic ROX-13540: Make Sensor Component NodeScanHandler restartable Nov 16, 2022
@vikin91 vikin91 marked this pull request as ready for review November 16, 2022 12:30
@vikin91 vikin91 requested review from RTann, fredrb and lvalerom November 16, 2022 16:18
@vikin91
Copy link
Contributor Author

vikin91 commented Nov 16, 2022

Inviting additional Sensor experts to review if they find a time :) @fredrb and @lvalerom

toCentral chan *central.MsgFromSensor
toCentral <-chan *central.MsgFromSensor
// toCentralMux prevents the race condition between Start() [writer] and ResponsesC() [reader]
toCentralMux *sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make this a sync.Mutex, so you don't have to initialize it. The zero-value is an unlocked mutex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it looks like this mutex it used for more than just toCentral, but rather to ensure Start() and ResponseC do are not run in parallel/concurrently. Maybe this can just be called lock? (Also I think mux typically implies multiplexer instead of mutex)

nodeScans <-chan *storage.NodeScanV2
toCentral chan *central.MsgFromSensor
toCentral <-chan *central.MsgFromSensor
// toCentralMux prevents the race condition between Start() [writer] and ResponsesC() [reader]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, we call Start() for each component before eventually calling ResponsesC() for each (see https://github.com/stackrox/stackrox/blob/3.72.2/sensor/common/sensor/sensor.go#L210. it looks like we call Start() for each component and then we eventually run ProcessMessage() and ResponsesC() concurrently). It definitely does not hurt at all to be defensive, but I don't think other sensor components have bothered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I wonder if we should just have some flag which indicates if Start() has been called and only allow ResponsesC continue once that happens? Or we even just comment that you must call ResponsesC after Start() completes and leave it up to the caller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that some usage patterns are not used in our setup, but my tests reported race or a leak and I wanted them to be clean. Moreover, we discussed the restartablilty of sensor components in the F2F meeting and we may be soon rely on it.


func (c *nodeScanHandlerImpl) Start() error {
// c.run closes chan toCentral on exit, so we do not want to close twice in case of a restart
c.toCentralMux.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I read this and look at other Sensor components, the more I wonder if this may be overkill and all my questions about how to use this library were not very relevant. This seems to be correct, but I also wonder if it's worth worrying so much about this since we know how these methods are called (well at least now I know because I finally looked into it). Not saying to remove it but I wanted to see what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not overkill. In my opinion, it will be sufficient to check whether a Start() was once attempted, and, if so, return an error. I don't think there's a need to make this component restartable provided that others don't take advantage of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is an overkill a bit, but I think this investment is worth it. We may need to make all Sensor components restartable, so I thought that this may be also a good way to have the race & goroutine leak tests green.

@RTann
Copy link
Contributor

RTann commented Nov 18, 2022

havent read it all yet, but wanted to leave some comments in case i forget to come back to this in a timely manner

Copy link
Contributor

@msugakov msugakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the main code mostly, skipped tests.


func (c *nodeScanHandlerImpl) Start() error {
// c.run closes chan toCentral on exit, so we do not want to close twice in case of a restart
c.toCentralMux.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not overkill. In my opinion, it will be sufficient to check whether a Start() was once attempted, and, if so, return an error. I don't think there's a need to make this component restartable provided that others don't take advantage of it.

@vikin91 vikin91 force-pushed the pr/ROX-12835-followup-1 branch from 56dab50 to e385c20 Compare November 28, 2022 17:15
@vikin91
Copy link
Contributor Author

vikin91 commented Nov 28, 2022

@RTann @fredrb @lvalerom we are done with changes (provided that CI is happy), if you want to review again, then it may be a good time for it.

@vikin91 vikin91 merged commit 5fc5870 into master Nov 29, 2022
@vikin91 vikin91 deleted the pr/ROX-12835-followup-1 branch November 29, 2022 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants