ROX-25948: Secured Clusters certificate refresher#13229
Conversation
|
Skipping CI for Draft Pull Request. |
|
Images are ready for the commit at 9412802. To use with deploy scripts, first |
|
/test all |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13229 +/- ##
==========================================
+ Coverage 48.50% 48.56% +0.05%
==========================================
Files 2468 2469 +1
Lines 178017 178071 +54
==========================================
+ Hits 86349 86473 +124
+ Misses 84735 84666 -69
+ Partials 6933 6932 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
porridge
left a comment
There was a problem hiding this comment.
LGTM, some nitpicks and philosophical questions inline.
sensor/kubernetes/certrefresh/securedcluster/certificate_requester.go
Outdated
Show resolved
Hide resolved
sensor/kubernetes/certrefresh/securedcluster/certificate_requester.go
Outdated
Show resolved
Hide resolved
d129c00 to
989b538
Compare
989b538 to
12fe0e5
Compare
sensor/kubernetes/certrefresh/securedcluster_tls_issuer_test.go
Outdated
Show resolved
Hide resolved
sensor/kubernetes/certrefresh/securedcluster_tls_issuer_test.go
Outdated
Show resolved
Hide resolved
|
/test ocp-4-17-nongroovy-e2e-tests |
porridge
left a comment
There was a problem hiding this comment.
FTR I changed the sentence about init containers a bit.
jschnath
left a comment
There was a problem hiding this comment.
Just confirmed with Vlad that the component is robust against sensor losing connectivity to central and entering offline mode, no other concerns from my end
LGTM
vikin91
left a comment
There was a problem hiding this comment.
Apparently I cannot post a ❌ on this PR anymore, but I would like to ask you to address the issue that I found in a followup PR.
We want to adhere to some basic rules when writing new components. In this PR, I spotted:
- There is more than one object writing to a channel (may lead to panics)
- Channel is not being closed (an ugly way to avoid the first problem)
- Sensor offline mode is not tested - especially that assumptions are made that the channel to Central will not block. It is unclear if this component is okay with messages to Central not being delivered. This should be clarified.
| func NewCertificateRequester(sendC chan<- *message.ExpiringMessage, | ||
| receiveC <-chan *central.IssueSecuredClusterCertsResponse) certificates.Requester { | ||
| return &certificateRequesterImpl{ | ||
| sendC: sendC, | ||
| receiveC: receiveC, | ||
| } | ||
| } |
There was a problem hiding this comment.
I was late for the review, but this looks like a recipe for problems as it breaks some of the concurrency patterns that we try to follow. I mean in particular the rule "channels open for writing should be created and closed by the same goroutine (or object)".
Here, we accept channel for writing from the outside and need to pay special care to not write into closed channel. Ideally, we would not want to pay any special cares when writing code in the future, so maybe this could be refactored to avoid this?
| msgToCentralC := make(chan *message.ExpiringMessage) | ||
| msgFromCentralC := make(chan *central.IssueSecuredClusterCertsResponse) | ||
| return &securedClusterTLSIssuerImpl{ | ||
| sensorNamespace: sensorNamespace, | ||
| sensorPodName: sensorPodName, | ||
| k8sClient: k8sClient, | ||
| msgToCentralC: msgToCentralC, | ||
| msgFromCentralC: msgFromCentralC, | ||
| certRefreshBackoff: certRefreshBackoff, | ||
| getCertificateRefresherFn: newCertificatesRefresher, | ||
| getServiceCertificatesRepoFn: securedcluster.NewServiceCertificatesRepo, | ||
| certRequester: securedcluster.NewCertificateRequester(msgToCentralC, msgFromCentralC), | ||
| } | ||
| } |
There was a problem hiding this comment.
If we look at this block, then it should be trivial to refactor it in a way, that the certRequester would create the channel and return it. It would also know best when to close it (as currently this channel is not being closed).
| // Doesn't block even if the corresponding call to RequestCertificates is cancelled and no one | ||
| // ever reads this, because requestC has buffer of 1, and we removed it from `r.request` above, | ||
| // ever reads this, because requestC has buffer of 1, and we removed it from `r.requests` above, | ||
| // in case we get more than 1 response for `msg.GetRequestId()`. | ||
| responseC.(chan *central.IssueLocalScannerCertsResponse) <- msg |
There was a problem hiding this comment.
Oh this will block or the message will be dropped when Sensor is offline. We should add tests for this component that include operation in the offline mode and transitioning between modes (something happens in online, then we go offline for some time, then go back online and the logic should still work).
There was a problem hiding this comment.
I agree that this can be improved, but I actually don't think this is blocking, but might be confusing because of the naming.
We have several channels here:
r.receiveC-> messages received from Centralr.sendC-> messages to be sent to CentralresponseCis used internally by the requester, to pass messages that were received onr.receiveCin a goroutine. It's a buffered message of size 1 and we make sure to only write once in it. Doesn't help that at creation time this channel is calledreceiveCin the same file.
It also won't be dropped if Sensor is offline. If we got to this point, we already have the message from Central. It's just passing the message from one goroutine to another.
Description
This PR adds Sensor-side handling for the gRPC endpoint implemented here:
It was preceded by several refactor PRs:
The new Sensor component (
SecuredClusterTLSIssuer) checks for Secured Cluster TLS certificates at startup. If they’re missing or nearing expiration, it automatically requests new ones.Additionally, it also schedules a certificate refresh when the certificates are at ~50%, so there is little risk that the certificates will become stale if Sensor never restarts in the last months of certificate validity.
The added components (the Secured Cluster TLS issuer and Certificate Requester) are based on their Local Scanner counterparts, with a few changes, common code extracted out, and tests adapted. Because these components are tightly coupled with their respective gRPC APIs, this was the easier way to do, but results in some similar code. And the majority of certrefresh code is being reused anyway. It's also making it easier for the Secured Cluster cert refresh logic to diverge, if necessary.
This PR uses generics to remove the duplication mentioned above, at the expense of adding complexity to already quite complicated code. It's also worth mentioning that from 4.7 onwards, the Local Scanner certificate refresher will be provided only for backwards compatibility with older Centrals, and will eventually be deprecated.
❗Important notes ❗
User-facing documentation
Testing and quality
Automated testing
How I validated my change
With a local deployment, checked that the new set of Secured Cluster certificates are being created. I also tried running with a short refresh interval (20 seconds instead of 9 months) to check that they are being rotated.