Skip to content

ROX-25948: Secured Clusters certificate refresher#13229

Merged
vladbologa merged 6 commits intomasterfrom
vb/secured-cluster-tls-issuer
Nov 14, 2024
Merged

ROX-25948: Secured Clusters certificate refresher#13229
vladbologa merged 6 commits intomasterfrom
vb/secured-cluster-tls-issuer

Conversation

@vladbologa
Copy link
Contributor

@vladbologa vladbologa commented Nov 5, 2024

Description

This PR adds Sensor-side handling for the gRPC endpoint implemented here:

It was preceded by several refactor PRs:

The new Sensor component (SecuredClusterTLSIssuer) checks for Secured Cluster TLS certificates at startup. If they’re missing or nearing expiration, it automatically requests new ones.

Additionally, it also schedules a certificate refresh when the certificates are at ~50%, so there is little risk that the certificates will become stale if Sensor never restarts in the last months of certificate validity.

The added components (the Secured Cluster TLS issuer and Certificate Requester) are based on their Local Scanner counterparts, with a few changes, common code extracted out, and tests adapted. Because these components are tightly coupled with their respective gRPC APIs, this was the easier way to do, but results in some similar code. And the majority of certrefresh code is being reused anyway. It's also making it easier for the Secured Cluster cert refresh logic to diverge, if necessary.

This PR uses generics to remove the duplication mentioned above, at the expense of adding complexity to already quite complicated code. It's also worth mentioning that from 4.7 onwards, the Local Scanner certificate refresher will be provided only for backwards compatibility with older Centrals, and will eventually be deprecated.

❗Important notes ❗

  • Secured Clusters using (soon to be legacy) init bundle secrets will retrieve the new refreshed certificates on startup (i.e. in 4.7 we'll have an automatic transition to the new certificates)
  • the new TLS secrets have a new naming scheme, and they can co-exist with the older init bundle secrets (see here)
  • the new certificates are not yet used by the services. That will be done after this PR is merged, see ROX-26055. tl;dr: the services will have initContainers that will check which TLS secrets exist in the cluster, and use the preferred ones.

User-facing documentation

  • CHANGELOG is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed

Testing and quality

  • the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

With a local deployment, checked that the new set of Secured Cluster certificates are being created. I also tried running with a short refresh interval (20 seconds instead of 9 months) to check that they are being rotated.

@openshift-ci
Copy link

openshift-ci bot commented Nov 5, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Nov 5, 2024

Images are ready for the commit at 9412802.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.7.x-69-g9412802f8f.

@vladbologa
Copy link
Contributor Author

/test all

@codecov
Copy link

codecov bot commented Nov 5, 2024

Codecov Report

Attention: Patch coverage is 94.15584% with 9 lines in your changes missing coverage. Please review.

Project coverage is 48.56%. Comparing base (f0da821) to head (9412802).
Report is 57 commits behind head on master.

Files with missing lines Patch % Lines
...ubernetes/certrefresh/securedcluster_tls_issuer.go 91.42% 6 Missing ⚠️
...ertrefresh/securedcluster/certificate_requester.go 95.31% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #13229      +/-   ##
==========================================
+ Coverage   48.50%   48.56%   +0.05%     
==========================================
  Files        2468     2469       +1     
  Lines      178017   178071      +54     
==========================================
+ Hits        86349    86473     +124     
+ Misses      84735    84666      -69     
+ Partials     6933     6932       -1     
Flag Coverage Δ
go-unit-tests 48.56% <94.15%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some nitpicks and philosophical questions inline.

@vladbologa vladbologa force-pushed the vb/secured-cluster-tls-issuer branch 2 times, most recently from d129c00 to 989b538 Compare November 7, 2024 16:35
@vladbologa vladbologa force-pushed the vb/secured-cluster-tls-issuer branch from 989b538 to 12fe0e5 Compare November 7, 2024 17:49
Copy link
Contributor

@mclasmeier mclasmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty

@vladbologa vladbologa marked this pull request as ready for review November 12, 2024 14:13
@vladbologa vladbologa requested a review from a team as a code owner November 12, 2024 14:13
@vladbologa
Copy link
Contributor Author

/test ocp-4-17-nongroovy-e2e-tests

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR I changed the sentence about init containers a bit.

Copy link
Contributor

@jschnath jschnath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirmed with Vlad that the component is robust against sensor losing connectivity to central and entering offline mode, no other concerns from my end

LGTM

@vladbologa vladbologa merged commit 0d2d698 into master Nov 14, 2024
@vladbologa vladbologa deleted the vb/secured-cluster-tls-issuer branch November 14, 2024 13:26
Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I cannot post a ❌ on this PR anymore, but I would like to ask you to address the issue that I found in a followup PR.

We want to adhere to some basic rules when writing new components. In this PR, I spotted:

  • There is more than one object writing to a channel (may lead to panics)
  • Channel is not being closed (an ugly way to avoid the first problem)
  • Sensor offline mode is not tested - especially that assumptions are made that the channel to Central will not block. It is unclear if this component is okay with messages to Central not being delivered. This should be clarified.

Comment on lines +31 to +37
func NewCertificateRequester(sendC chan<- *message.ExpiringMessage,
receiveC <-chan *central.IssueSecuredClusterCertsResponse) certificates.Requester {
return &certificateRequesterImpl{
sendC: sendC,
receiveC: receiveC,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was late for the review, but this looks like a recipe for problems as it breaks some of the concurrency patterns that we try to follow. I mean in particular the rule "channels open for writing should be created and closed by the same goroutine (or object)".
Here, we accept channel for writing from the outside and need to pay special care to not write into closed channel. Ideally, we would not want to pay any special cares when writing code in the future, so maybe this could be refactored to avoid this?

Comment on lines +27 to +40
msgToCentralC := make(chan *message.ExpiringMessage)
msgFromCentralC := make(chan *central.IssueSecuredClusterCertsResponse)
return &securedClusterTLSIssuerImpl{
sensorNamespace: sensorNamespace,
sensorPodName: sensorPodName,
k8sClient: k8sClient,
msgToCentralC: msgToCentralC,
msgFromCentralC: msgFromCentralC,
certRefreshBackoff: certRefreshBackoff,
getCertificateRefresherFn: newCertificatesRefresher,
getServiceCertificatesRepoFn: securedcluster.NewServiceCertificatesRepo,
certRequester: securedcluster.NewCertificateRequester(msgToCentralC, msgFromCentralC),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look at this block, then it should be trivial to refactor it in a way, that the certRequester would create the channel and return it. It would also know best when to close it (as currently this channel is not being closed).

Comment on lines 78 to 81
// Doesn't block even if the corresponding call to RequestCertificates is cancelled and no one
// ever reads this, because requestC has buffer of 1, and we removed it from `r.request` above,
// ever reads this, because requestC has buffer of 1, and we removed it from `r.requests` above,
// in case we get more than 1 response for `msg.GetRequestId()`.
responseC.(chan *central.IssueLocalScannerCertsResponse) <- msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this will block or the message will be dropped when Sensor is offline. We should add tests for this component that include operation in the offline mode and transitioning between modes (something happens in online, then we go offline for some time, then go back online and the logic should still work).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this can be improved, but I actually don't think this is blocking, but might be confusing because of the naming.

We have several channels here:

  • r.receiveC -> messages received from Central
  • r.sendC -> messages to be sent to Central
  • responseC is used internally by the requester, to pass messages that were received on r.receiveC in a goroutine. It's a buffered message of size 1 and we make sure to only write once in it. Doesn't help that at creation time this channel is called receiveC in the same file.

It also won't be dropped if Sensor is offline. If we got to this point, we already have the message from Central. It's just passing the message from one goroutine to another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants