ROX-27014: Cert refresh concurrency clean-up#13369
Conversation
|
Skipping CI for Draft Pull Request. |
|
Images are ready for the commit at e067b50. To use with deploy scripts, first |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13369 +/- ##
==========================================
- Coverage 48.56% 48.56% -0.01%
==========================================
Files 2473 2473
Lines 178760 178759 -1
==========================================
- Hits 86821 86818 -3
- Misses 84991 84993 +2
Partials 6948 6948
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
5a92d1a to
3268667
Compare
3268667 to
5a2956b
Compare
vikin91
left a comment
There was a problem hiding this comment.
Unfortunately, I must return it again. The changes still violate the rules on go concurrency that we try to stick to when writing new components or refactoring existing ones. I am happy to follow up on each comment that I posted, as I understand that the patterns may be not immediately clear.
The important factors for us are:
- Channels must always be closed when we know that there will be no more messages written to it.
- The writer goroutine (or object) is responsible for opening, writing, and closing the channel.
- We always return channels in read-only mode (there are some exceptions to this rule).
- We should avoid creating channels in
Start()and closing them inStop()functions of the components - unless there is a good reason to do so. - Aways make sure that the code works correctly with unbuffered channels, only then add buffer if that improves the performance (and always comment why there are buffers).
In the following, I will try to respond to the PR description. It may be not the best form of addressing that, so please let me know if we should speak about it in a meeting.
There weren't more objects writing to a channel,
I saw it differently, so please convince me if you think I was wrong.
For now I left that as is. It's not clear to me if Sensor components should support calling Stop() and then Start() again, but depending on that, changes might be needed. I don't think it's a problem if we leave the msgToCentralC channel open when Sensor is in offline mode.
Unfortunately, it is a problem that cost us a lot of time when a bug in Sensor gets reported. For simplicity, please assume that multiple calls to Start and Stop are possible during runtime and there is no guarantee that each Start is followed by Stop and each Stop is followed by Start.
Regarding the offline mode - I would still ask you to write an integration test for this component that cycles through offline and online modes to be sure that it works correctly.
The CertificateRequester periodically checks cert validity. If required, it makes a request to Central (using RequestCertificates() in certificates.go) which is passed to the msgToCentralC chan. It's ok if Sensor is offline, the context will expire and we'll try again. It's a synchronous function that waits until there's a reply (or an error, or the context expires).
Wouldn't it be nicer to check if Sensor is online and only then send the message? Or even pause the periodic checks while offline and run one immediately after changing to online?
I think it's because there were different signals with similar names, that's why I renamed some of them in this PR. We have the following channels:
If you think I missed something let me know. I'll try to address the other comments. |
I think, we should look at this together in a call. Would that be okay? If yes, then feel free to schedule something - even Friday tomorrow after 10:30 is okay for me. |
5117826 to
e3598df
Compare
* Create send channel in requester * Add test and some comments * Rename channels * Explicitly close internal channel
e3598df to
23c8982
Compare
vikin91
left a comment
There was a problem hiding this comment.
That PR is great, because it pushes the problematic part to a single place in the code and that makes it much easier to comment on that and suggest a solution.
I see here one blocker (marked as "the blocker") and few other ideas for improvement.
Some comments are strictly meant to propose a solution to the blocker, so I would suggest the following order of reading this review:
- Read about the blocker issue first - it uses the 🛑 icon
- Then check my proposal how to resolve it - the comments starts with "(solution to the blocker)"
- All other comments contain various ideas for improving and are less crucial (but shall still be addressed).
sensor/kubernetes/certrefresh/securedcluster_tls_issuer_test.go
Outdated
Show resolved
Hide resolved
porridge
left a comment
There was a problem hiding this comment.
I like the direction of this. Reducing the complexity is great!
I would even go as far as suggesting an experiment where you make a generic issuer (the two issuer types seem identical modulo some strings and parameters, and the response extraction function) and try to merge issuer and requester. The responsibility boundary between them always seemed very fuzzy to me, and the fact that they call each other in what seems like cycles in my small brain, suggests that they should in fact be a single object. We might end up with something much more easy to understand?
I'm not sure how this would impact testability, but I think it might be worth a try.
Also, I noticed that the NewResponseFrom*Certs functions might be possible to merge into one function that takes a custom interface parameter, since it only uses function calls to retrieve information from the proto object.
Some questions inline.
5a2e620 to
11fc05c
Compare
vikin91
left a comment
There was a problem hiding this comment.
The recent changes are really great as they eliminated all issues that I worried about - many thanks for taking the legacy code and agreeing to improve it to our current standards!
I see no blockers to merging this, but I commented on some places that may need to be checked and improved. I am giving it ✅ , but please check my comments.
2988890 to
3d2c97b
Compare
vikin91
left a comment
There was a problem hiding this comment.
It looks good, thanks for improving!
Description
This PR aims to address the issues mentioned in this comment from another PR: #13229 (review)
The main changes are:
LocalScannerTLSIssuerandSecuredClusterTLSIssuerinto one typeRequesterintoTLSIssuer, as it makes more sense from a design perspective, they had similar responsibilities and were passing data back and forth between themResponsesCchannel of the Sensor component, theTLSIssuer)requestCertificatesmethod no longer supports concurrent calls, as they were not needed and were increasing complexity. There are checks that prevent a second call torequestCertificatesif the previous call did not complete.ProcessMessagetoDispatchResponsehas been removed, and replaced with a simple goroutine call.sync.Mapthat was keeping a response channel for each request has also been removed, and replaced with aconcurrency.Signalthat signals when a response has been received.There is still similar looking code remaining, but only in the tests of
LocalScannerTLSIssuerandSecuredClusterTLSIssuer. This could be made more generic and consolidated, but given thatLocalScannerTLSIssueris deprecated and will be removed, it will be much easier to remove as it is (simply deleting the files).Support for Sensor offline mode will be added in a follow-up PR: #13596
How the cert refresher works
The main components here are:
TLSIssuer(LocalScannerTLSIssuerImplandSecuredClusterTLSIssuerImpl), which implementSensorComponent. They are independent of each other, andLocalScannerTLSIssuerwill be deprecated in the future.CertificateRefreshera ticker that periodically checks certificate validityIf certs are expired, the refresher makes a request to Central (using
tlsIssuer.requestCertificates) which is passed to themsgToCentralCchan. It's ok if Sensor is offline, the context will expire and we'll try again - but #13596 adds explicit support for Sensor offline mode. It's a synchronous function that waits until there's a reply (or an error, or the context expires). Central replies in an async way, and that is handled byProcessMessage()inTLSIssuer.User-facing documentation
Testing and quality
Automated testing
How I validated my change
Manual smoke test that shows that the Secured Cluster cert refresher works:
SENSOR_HELM_MANAGED=true ./deploy/k8s/deploy-local.sh ❯ kubectl -n stackrox get secret | grep tls- tls-cert-collector Opaque 3 23s tls-cert-admission-control Opaque 3 23s tls-cert-scanner-v4-db Opaque 3 22s tls-cert-scanner Opaque 3 22s tls-cert-scanner-db Opaque 3 21s tls-cert-scanner-v4-indexer Opaque 3 21s tls-cert-sensor Opaque 3 21s