ROX-25948: Secured Clusters certificate refresher by vladbologa · Pull Request #13229 · stackrox/stackrox

vladbologa · 2024-11-05T13:37:32Z

Description

This PR adds Sensor-side handling for the gRPC endpoint implemented here:

ROX-25949: gRPC endpoint to return Secured Cluster TLS certificates #12740

It was preceded by several refactor PRs:

The new Sensor component (SecuredClusterTLSIssuer) checks for Secured Cluster TLS certificates at startup. If they’re missing or nearing expiration, it automatically requests new ones.

Additionally, it also schedules a certificate refresh when the certificates are at ~50%, so there is little risk that the certificates will become stale if Sensor never restarts in the last months of certificate validity.

The added components (the Secured Cluster TLS issuer and Certificate Requester) are based on their Local Scanner counterparts, with a few changes, common code extracted out, and tests adapted. Because these components are tightly coupled with their respective gRPC APIs, this was the easier way to do, but results in some similar code. And the majority of certrefresh code is being reused anyway. It's also making it easier for the Secured Cluster cert refresh logic to diverge, if necessary.

This PR uses generics to remove the duplication mentioned above, at the expense of adding complexity to already quite complicated code. It's also worth mentioning that from 4.7 onwards, the Local Scanner certificate refresher will be provided only for backwards compatibility with older Centrals, and will eventually be deprecated.

❗Important notes ❗

Secured Clusters using (soon to be legacy) init bundle secrets will retrieve the new refreshed certificates on startup (i.e. in 4.7 we'll have an automatic transition to the new certificates)
the new TLS secrets have a new naming scheme, and they can co-exist with the older init bundle secrets (see here)
the new certificates are not yet used by the services. That will be done after this PR is merged, see ROX-26055. tl;dr: the services will have initContainers that will check which TLS secrets exist in the cluster, and use the preferred ones.

User-facing documentation

CHANGELOG is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

With a local deployment, checked that the new set of Secured Cluster certificates are being created. I also tried running with a short refresh interval (20 seconds instead of 9 months) to check that they are being rotated.

openshift-ci · 2024-11-05T13:37:36Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

rhacs-bot · 2024-11-05T14:01:59Z

Images are ready for the commit at 9412802.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.7.x-69-g9412802f8f.

vladbologa · 2024-11-05T14:06:12Z

/test all

codecov · 2024-11-05T14:33:09Z

Codecov Report

Attention: Patch coverage is 94.15584% with 9 lines in your changes missing coverage. Please review.

Project coverage is 48.56%. Comparing base (f0da821) to head (9412802).
Report is 57 commits behind head on master.

Files with missing lines	Patch %	Lines
...ubernetes/certrefresh/securedcluster_tls_issuer.go	91.42%	6 Missing ⚠️
...ertrefresh/securedcluster/certificate_requester.go	95.31%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #13229      +/-   ##
==========================================
+ Coverage   48.50%   48.56%   +0.05%     
==========================================
  Files        2468     2469       +1     
  Lines      178017   178071      +54     
==========================================
+ Hits        86349    86473     +124     
+ Misses      84735    84666      -69     
+ Partials     6933     6932       -1

Flag	Coverage Δ
go-unit-tests	`48.56% <94.15%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sensor/kubernetes/certrefresh/certificate_expiration.go

sensor/kubernetes/sensor/sensor.go

porridge

LGTM, some nitpicks and philosophical questions inline.

sensor/kubernetes/certrefresh/certificate_expiration.go

sensor/kubernetes/sensor/sensor.go

sensor/kubernetes/certrefresh/tls_issuer_common_test.go

sensor/kubernetes/certrefresh/securedcluster/certificate_requester.go

sensor/kubernetes/certrefresh/securedcluster_tls_issuer.go

sensor/kubernetes/certrefresh/securedcluster_tls_issuer_test.go

mclasmeier

empty

vladbologa · 2024-11-13T20:53:50Z

/test ocp-4-17-nongroovy-e2e-tests

porridge

FTR I changed the sentence about init containers a bit.

jschnath

Just confirmed with Vlad that the component is robust against sensor losing connectivity to central and entering offline mode, no other concerns from my end

LGTM

vikin91

Apparently I cannot post a ❌ on this PR anymore, but I would like to ask you to address the issue that I found in a followup PR.

We want to adhere to some basic rules when writing new components. In this PR, I spotted:

There is more than one object writing to a channel (may lead to panics)
Channel is not being closed (an ugly way to avoid the first problem)
Sensor offline mode is not tested - especially that assumptions are made that the channel to Central will not block. It is unclear if this component is okay with messages to Central not being delivered. This should be clarified.

vikin91 · 2024-11-15T13:35:58Z

sensor/kubernetes/certrefresh/securedcluster/certificate_requester.go

+func NewCertificateRequester(sendC chan<- *message.ExpiringMessage,
+	receiveC <-chan *central.IssueSecuredClusterCertsResponse) certificates.Requester {
+	return &certificateRequesterImpl{
+		sendC:    sendC,
+		receiveC: receiveC,
+	}
+}


I was late for the review, but this looks like a recipe for problems as it breaks some of the concurrency patterns that we try to follow. I mean in particular the rule "channels open for writing should be created and closed by the same goroutine (or object)".
Here, we accept channel for writing from the outside and need to pay special care to not write into closed channel. Ideally, we would not want to pay any special cares when writing code in the future, so maybe this could be refactored to avoid this?

vikin91 · 2024-11-15T13:38:05Z

sensor/kubernetes/certrefresh/securedcluster_tls_issuer.go

+	msgToCentralC := make(chan *message.ExpiringMessage)
+	msgFromCentralC := make(chan *central.IssueSecuredClusterCertsResponse)
+	return &securedClusterTLSIssuerImpl{
+		sensorNamespace:              sensorNamespace,
+		sensorPodName:                sensorPodName,
+		k8sClient:                    k8sClient,
+		msgToCentralC:                msgToCentralC,
+		msgFromCentralC:              msgFromCentralC,
+		certRefreshBackoff:           certRefreshBackoff,
+		getCertificateRefresherFn:    newCertificatesRefresher,
+		getServiceCertificatesRepoFn: securedcluster.NewServiceCertificatesRepo,
+		certRequester:                securedcluster.NewCertificateRequester(msgToCentralC, msgFromCentralC),
+	}
+}


If we look at this block, then it should be trivial to refactor it in a way, that the certRequester would create the channel and return it. It would also know best when to close it (as currently this channel is not being closed).

vikin91 · 2024-11-15T13:41:58Z

sensor/kubernetes/certrefresh/localscanner/certificate_requester.go

 			// Doesn't block even if the corresponding call to RequestCertificates is cancelled and no one
-			// ever reads this, because requestC has buffer of 1, and we removed it from `r.request` above,
+			// ever reads this, because requestC has buffer of 1, and we removed it from `r.requests` above,
 			// in case we get more than 1 response for `msg.GetRequestId()`.
 			responseC.(chan *central.IssueLocalScannerCertsResponse) <- msg


Oh this will block or the message will be dropped when Sensor is offline. We should add tests for this component that include operation in the offline mode and transitioning between modes (something happens in online, then we go offline for some time, then go back online and the logic should still work).

I agree that this can be improved, but I actually don't think this is blocking, but might be confusing because of the naming.

We have several channels here:

r.receiveC -> messages received from Central

r.sendC -> messages to be sent to Central

responseC is used internally by the requester, to pass messages that were received on r.receiveC in a goroutine. It's a buffered message of size 1 and we make sure to only write once in it. Doesn't help that at creation time this channel is called receiveC in the same file.

It also won't be dropped if Sensor is offline. If we got to this point, we already have the message from Central. It's just passing the message from one goroutine to another.

openshift-ci bot added the do-not-merge/work-in-progress label Nov 5, 2024

github-actions bot added the area/sensor label Nov 5, 2024

vladbologa requested review from mclasmeier and porridge November 5, 2024 14:06

vladbologa commented Nov 5, 2024

View reviewed changes

sensor/kubernetes/certrefresh/certificate_expiration.go Outdated Show resolved Hide resolved

sensor/kubernetes/sensor/sensor.go Show resolved Hide resolved

porridge approved these changes Nov 7, 2024

View reviewed changes

vladbologa mentioned this pull request Nov 7, 2024

ROX-26719: Extract common TLS issuer code #13260

Merged

9 tasks

vladbologa force-pushed the vb/secured-cluster-tls-issuer branch 2 times, most recently from d129c00 to 989b538 Compare November 7, 2024 16:35

vladbologa added 3 commits November 7, 2024 18:47

ROX-25948: Secured Clusters certificate refresher

e19e7a0

certificates.NewResponseFromSecuredClusterCerts

6e7e30a

Apply some code review suggestions

12fe0e5

vladbologa force-pushed the vb/secured-cluster-tls-issuer branch from 989b538 to 12fe0e5 Compare November 7, 2024 17:49

mclasmeier reviewed Nov 11, 2024

View reviewed changes

sensor/kubernetes/certrefresh/securedcluster_tls_issuer.go Show resolved Hide resolved

mclasmeier reviewed Nov 11, 2024

View reviewed changes

sensor/kubernetes/certrefresh/securedcluster_tls_issuer_test.go Outdated Show resolved Hide resolved

mclasmeier reviewed Nov 11, 2024

View reviewed changes

sensor/kubernetes/certrefresh/securedcluster_tls_issuer_test.go Outdated Show resolved Hide resolved

mclasmeier reviewed Nov 11, 2024

View reviewed changes

vladbologa added 2 commits November 11, 2024 19:15

More code review suggestions

f96142e

Add comment about code duplication

940cc21

vladbologa marked this pull request as ready for review November 12, 2024 14:13

vladbologa requested a review from a team as a code owner November 12, 2024 14:13

openshift-ci bot removed the do-not-merge/work-in-progress label Nov 12, 2024

Modify comment

9412802

vladbologa requested review from mclasmeier and porridge November 14, 2024 09:09

porridge approved these changes Nov 14, 2024

View reviewed changes

jschnath approved these changes Nov 14, 2024

View reviewed changes

vladbologa merged commit 0d2d698 into master Nov 14, 2024

vladbologa deleted the vb/secured-cluster-tls-issuer branch November 14, 2024 13:26

aaa5kameric pushed a commit that referenced this pull request Nov 14, 2024

ROX-25948: Secured Clusters certificate refresher (#13229)

d654b82

vikin91 reviewed Nov 15, 2024

View reviewed changes

vladbologa mentioned this pull request Nov 19, 2024

ROX-27014: Cert refresh concurrency clean-up #13369

Merged

9 tasks

mclasmeier pushed a commit that referenced this pull request Nov 21, 2024

ROX-25948: Secured Clusters certificate refresher (#13229)

4493b85

mclasmeier pushed a commit that referenced this pull request Nov 22, 2024

ROX-25948: Secured Clusters certificate refresher (#13229)

17ddd73

mclasmeier pushed a commit that referenced this pull request Nov 22, 2024

ROX-25948: Secured Clusters certificate refresher (#13229)

f28ca6b

Conversation

vladbologa commented Nov 5, 2024 • edited by porridge Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

❗Important notes ❗

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

openshift-ci bot commented Nov 5, 2024

Uh oh!

rhacs-bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vladbologa commented Nov 5, 2024

Uh oh!

codecov bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

porridge left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mclasmeier left a comment

Choose a reason for hiding this comment

Uh oh!

vladbologa commented Nov 13, 2024

Uh oh!

porridge left a comment

Choose a reason for hiding this comment

Uh oh!

jschnath left a comment

Choose a reason for hiding this comment

Uh oh!

vikin91 left a comment

Choose a reason for hiding this comment

Uh oh!

vikin91 Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

vikin91 Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

vikin91 Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

vladbologa Nov 18, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vladbologa commented Nov 5, 2024 •

edited by porridge

Loading

rhacs-bot commented Nov 5, 2024 •

edited

Loading

codecov bot commented Nov 5, 2024 •

edited

Loading