ROX-13540: Allow Sensor Component NodeScanHandler to start max. once #3800
ROX-13540: Allow Sensor Component NodeScanHandler to start max. once #3800
NodeScanHandler to start max. once #3800Conversation
|
Skipping CI for Draft Pull Request. |
|
Current dependencies on/for this PR: This comment was auto-generated by Graphite. |
0880bd0 to
c9416c7
Compare
|
Images are ready for the commit at 5246c62. To use with deploy scripts, first |
| return | ||
| default: | ||
| nodeScans <- fakeNodeScanV2("Node") | ||
| case nodeScans <- fakeNodeScanV2("Node"): |
There was a problem hiding this comment.
Fix for the race: placing the blocking part in the case instead of default
| case _, ok := <-h.ResponsesC(): | ||
| if !ok { | ||
| return | ||
| } |
There was a problem hiding this comment.
Simplification: Instead of forwarding to central, we now receive and discard all msgs that are meant to go to central.
| s.Assert().NotPanics(func() { | ||
| defer close(nodeScans) | ||
| h := NewNodeScanHandler(nodeScans) | ||
| s.Assert().NoError(h.Start()) |
There was a problem hiding this comment.
ah I found out a *suite is an *assert.Assertions, so no need to do s.Assert()..... You can just do s.NoError(...)
3f47d78 to
be4bb7c
Compare
NodeScanHandler restartable
| toCentral chan *central.MsgFromSensor | ||
| toCentral <-chan *central.MsgFromSensor | ||
| // toCentralMux prevents the race condition between Start() [writer] and ResponsesC() [reader] | ||
| toCentralMux *sync.Mutex |
There was a problem hiding this comment.
just make this a sync.Mutex, so you don't have to initialize it. The zero-value is an unlocked mutex
There was a problem hiding this comment.
also, it looks like this mutex it used for more than just toCentral, but rather to ensure Start() and ResponseC do are not run in parallel/concurrently. Maybe this can just be called lock? (Also I think mux typically implies multiplexer instead of mutex)
| nodeScans <-chan *storage.NodeScanV2 | ||
| toCentral chan *central.MsgFromSensor | ||
| toCentral <-chan *central.MsgFromSensor | ||
| // toCentralMux prevents the race condition between Start() [writer] and ResponsesC() [reader] |
There was a problem hiding this comment.
From what I can tell, we call Start() for each component before eventually calling ResponsesC() for each (see https://github.com/stackrox/stackrox/blob/3.72.2/sensor/common/sensor/sensor.go#L210. it looks like we call Start() for each component and then we eventually run ProcessMessage() and ResponsesC() concurrently). It definitely does not hurt at all to be defensive, but I don't think other sensor components have bothered.
There was a problem hiding this comment.
Similarly, I wonder if we should just have some flag which indicates if Start() has been called and only allow ResponsesC continue once that happens? Or we even just comment that you must call ResponsesC after Start() completes and leave it up to the caller
There was a problem hiding this comment.
I know that some usage patterns are not used in our setup, but my tests reported race or a leak and I wanted them to be clean. Moreover, we discussed the restartablilty of sensor components in the F2F meeting and we may be soon rely on it.
|
|
||
| func (c *nodeScanHandlerImpl) Start() error { | ||
| // c.run closes chan toCentral on exit, so we do not want to close twice in case of a restart | ||
| c.toCentralMux.Lock() |
There was a problem hiding this comment.
The more I read this and look at other Sensor components, the more I wonder if this may be overkill and all my questions about how to use this library were not very relevant. This seems to be correct, but I also wonder if it's worth worrying so much about this since we know how these methods are called (well at least now I know because I finally looked into it). Not saying to remove it but I wanted to see what you think
There was a problem hiding this comment.
+1 to not overkill. In my opinion, it will be sufficient to check whether a Start() was once attempted, and, if so, return an error. I don't think there's a need to make this component restartable provided that others don't take advantage of it.
There was a problem hiding this comment.
I know this is an overkill a bit, but I think this investment is worth it. We may need to make all Sensor components restartable, so I thought that this may be also a good way to have the race & goroutine leak tests green.
|
havent read it all yet, but wanted to leave some comments in case i forget to come back to this in a timely manner |
msugakov
left a comment
There was a problem hiding this comment.
Checked the main code mostly, skipped tests.
|
|
||
| func (c *nodeScanHandlerImpl) Start() error { | ||
| // c.run closes chan toCentral on exit, so we do not want to close twice in case of a restart | ||
| c.toCentralMux.Lock() |
There was a problem hiding this comment.
+1 to not overkill. In my opinion, it will be sufficient to check whether a Start() was once attempted, and, if so, return an error. I don't think there's a need to make this component restartable provided that others don't take advantage of it.
…and receiver receiving first message" This reverts commit 8d13af4.
56dab50 to
e385c20
Compare

Description
This is a follow-up PR to address review comments from #3533.
The original PR was merged by accident without addressing optional but
good suggestions.
This PR changes the NodeScan component to start maximally once.
It includes:
🗑️ Removing(readded in d7353ae)stoppedCleftoversendScans.Assert()insteadassert.✨ (Extra) Making the component restartable(removed in d7353ae)Stopped()so that the tests can wait for the handler to finish work.Extra
After the first comment about not closing the
toCentralchannel, I decided that the only clean solution to that challenge it to make the component restartable. I added the restartability in be4bb7c.Checklist
Not applies
Testing Performed
go test -race -count=1 -v -run ^TestNodeScanHandler$ ./sensor/common/compliancego test -race -count=12 -v -run ^TestNodeScanHandler$ ./sensor/common/compliance(more executions at once)ROX_RHCOS_NODE_SCANNING=true), andconfirming that the messages are sent to Sensor and Central. Also
killing sensor pod and making sure that it gets restarted and
self-heals.