Skip to content

ROX-27014: Cert refresh concurrency clean-up#13369

Merged
vladbologa merged 8 commits intomasterfrom
vb/cert-refresh-channels
Jan 7, 2025
Merged

ROX-27014: Cert refresh concurrency clean-up#13369
vladbologa merged 8 commits intomasterfrom
vb/cert-refresh-channels

Conversation

@vladbologa
Copy link
Contributor

@vladbologa vladbologa commented Nov 19, 2024

Description

This PR aims to address the issues mentioned in this comment from another PR: #13229 (review)

The main changes are:

  • Refactor common code of LocalScannerTLSIssuer and SecuredClusterTLSIssuer into one type
  • merge Requester into TLSIssuer, as it makes more sense from a design perspective, they had similar responsibilities and were passing data back and forth between them
  • this in turn allows us to use only one channel (the mandatory ResponsesC channel of the Sensor component, the TLSIssuer)
  • the requestCertificates method no longer supports concurrent calls, as they were not needed and were increasing complexity. There are checks that prevent a second call to requestCertificates if the previous call did not complete.
  • The channel that used to send the messages from Central from ProcessMessage to DispatchResponse has been removed, and replaced with a simple goroutine call.
  • The sync.Map that was keeping a response channel for each request has also been removed, and replaced with a concurrency.Signal that signals when a response has been received.

There is still similar looking code remaining, but only in the tests of LocalScannerTLSIssuer and SecuredClusterTLSIssuer. This could be made more generic and consolidated, but given that LocalScannerTLSIssuer is deprecated and will be removed, it will be much easier to remove as it is (simply deleting the files).

Support for Sensor offline mode will be added in a follow-up PR: #13596

How the cert refresher works

The main components here are:

  • TLSIssuer (LocalScannerTLSIssuerImpl and SecuredClusterTLSIssuerImpl), which implement SensorComponent. They are independent of each other, and LocalScannerTLSIssuer will be deprecated in the future.
  • the CertificateRefresher a ticker that periodically checks certificate validity

If certs are expired, the refresher makes a request to Central (using tlsIssuer.requestCertificates) which is passed to the msgToCentralC chan. It's ok if Sensor is offline, the context will expire and we'll try again - but #13596 adds explicit support for Sensor offline mode. It's a synchronous function that waits until there's a reply (or an error, or the context expires). Central replies in an async way, and that is handled by ProcessMessage() in TLSIssuer.

User-facing documentation

  • CHANGELOG is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed

Testing and quality

  • the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Manual smoke test that shows that the Secured Cluster cert refresher works:

SENSOR_HELM_MANAGED=true ./deploy/k8s/deploy-local.sh

❯ kubectl -n stackrox get secret | grep tls-
tls-cert-collector                                        Opaque               3      23s
tls-cert-admission-control                                Opaque               3      23s
tls-cert-scanner-v4-db                                    Opaque               3      22s
tls-cert-scanner                                          Opaque               3      22s
tls-cert-scanner-db                                       Opaque               3      21s
tls-cert-scanner-v4-indexer                               Opaque               3      21s
tls-cert-sensor                                           Opaque               3      21s

@openshift-ci
Copy link

openshift-ci bot commented Nov 19, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Nov 19, 2024

Images are ready for the commit at e067b50.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.7.x-283-ge067b50fcd.

@codecov
Copy link

codecov bot commented Nov 19, 2024

Codecov Report

Attention: Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.

Project coverage is 48.56%. Comparing base (c9ca8c8) to head (5a2956b).
Report is 47 commits behind head on master.

Files with missing lines Patch % Lines
...r/kubernetes/certrefresh/certificates/requester.go 66.66% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #13369      +/-   ##
==========================================
- Coverage   48.56%   48.56%   -0.01%     
==========================================
  Files        2473     2473              
  Lines      178760   178759       -1     
==========================================
- Hits        86821    86818       -3     
- Misses      84991    84993       +2     
  Partials     6948     6948              
Flag Coverage Δ
go-unit-tests 48.56% <76.47%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vladbologa vladbologa force-pushed the vb/cert-refresh-channels branch from 5a92d1a to 3268667 Compare November 19, 2024 14:01
@vladbologa vladbologa requested a review from vikin91 November 20, 2024 18:40
@vladbologa vladbologa changed the title ROX-26055: Cert refresh concurrency clean-up ROX-27014: Cert refresh concurrency clean-up Nov 21, 2024
@vladbologa vladbologa force-pushed the vb/cert-refresh-channels branch from 3268667 to 5a2956b Compare November 21, 2024 17:15
@vladbologa vladbologa marked this pull request as ready for review November 28, 2024 09:42
@vladbologa vladbologa requested a review from a team as a code owner November 28, 2024 09:42
Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I must return it again. The changes still violate the rules on go concurrency that we try to stick to when writing new components or refactoring existing ones. I am happy to follow up on each comment that I posted, as I understand that the patterns may be not immediately clear.

The important factors for us are:

  1. Channels must always be closed when we know that there will be no more messages written to it.
  2. The writer goroutine (or object) is responsible for opening, writing, and closing the channel.
  3. We always return channels in read-only mode (there are some exceptions to this rule).
  4. We should avoid creating channels in Start() and closing them in Stop() functions of the components - unless there is a good reason to do so.
  5. Aways make sure that the code works correctly with unbuffered channels, only then add buffer if that improves the performance (and always comment why there are buffers).

In the following, I will try to respond to the PR description. It may be not the best form of addressing that, so please let me know if we should speak about it in a meeting.

There weren't more objects writing to a channel,

I saw it differently, so please convince me if you think I was wrong.

For now I left that as is. It's not clear to me if Sensor components should support calling Stop() and then Start() again, but depending on that, changes might be needed. I don't think it's a problem if we leave the msgToCentralC channel open when Sensor is in offline mode.

Unfortunately, it is a problem that cost us a lot of time when a bug in Sensor gets reported. For simplicity, please assume that multiple calls to Start and Stop are possible during runtime and there is no guarantee that each Start is followed by Stop and each Stop is followed by Start.

Regarding the offline mode - I would still ask you to write an integration test for this component that cycles through offline and online modes to be sure that it works correctly.

The CertificateRequester periodically checks cert validity. If required, it makes a request to Central (using RequestCertificates() in certificates.go) which is passed to the msgToCentralC chan. It's ok if Sensor is offline, the context will expire and we'll try again. It's a synchronous function that waits until there's a reply (or an error, or the context expires).

Wouldn't it be nicer to check if Sensor is online and only then send the message? Or even pause the periodic checks while offline and run one immediately after changing to online?

@vladbologa vladbologa marked this pull request as draft December 2, 2024 09:49
@vladbologa
Copy link
Contributor Author

There weren't more objects writing to a channel,

I saw it differently, so please convince me if you think I was wrong.

I think it's because there were different signals with similar names, that's why I renamed some of them in this PR. We have the following channels:

  • msgToCentralC -> only written to in requester.go:221
  • certsResponseC -> only written to in requester.go:176
  • respFromCentralC -> only written to in securedcluster_tls_issuer.go:127 (and in localscanner_tls_issuer.go:132 but it's not the same channel)

If you think I missed something let me know. I'll try to address the other comments.

@vikin91
Copy link
Contributor

vikin91 commented Dec 5, 2024

There weren't more objects writing to a channel,

I saw it differently, so please convince me if you think I was wrong.

I think it's because there were different signals with similar names, that's why I renamed some of them in this PR. We have the following channels:

  • msgToCentralC -> only written to in requester.go:221
  • certsResponseC -> only written to in requester.go:176
  • respFromCentralC -> only written to in securedcluster_tls_issuer.go:127 (and in localscanner_tls_issuer.go:132 but it's not the same channel)

If you think I missed something let me know. I'll try to address the other comments.

I think, we should look at this together in a call. Would that be okay? If yes, then feel free to schedule something - even Friday tomorrow after 10:30 is okay for me.

@vladbologa vladbologa force-pushed the vb/cert-refresh-channels branch 7 times, most recently from 5117826 to e3598df Compare December 12, 2024 15:50
* Create send channel in requester
* Add test and some comments
* Rename channels
* Explicitly close internal channel
@vladbologa vladbologa force-pushed the vb/cert-refresh-channels branch from e3598df to 23c8982 Compare December 12, 2024 15:50
@vladbologa vladbologa marked this pull request as ready for review December 12, 2024 15:56
Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR is great, because it pushes the problematic part to a single place in the code and that makes it much easier to comment on that and suggest a solution.

I see here one blocker (marked as "the blocker") and few other ideas for improvement.
Some comments are strictly meant to propose a solution to the blocker, so I would suggest the following order of reading this review:

  • Read about the blocker issue first - it uses the 🛑 icon
  • Then check my proposal how to resolve it - the comments starts with "(solution to the blocker)"
  • All other comments contain various ideas for improving and are less crucial (but shall still be addressed).

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction of this. Reducing the complexity is great!

I would even go as far as suggesting an experiment where you make a generic issuer (the two issuer types seem identical modulo some strings and parameters, and the response extraction function) and try to merge issuer and requester. The responsibility boundary between them always seemed very fuzzy to me, and the fact that they call each other in what seems like cycles in my small brain, suggests that they should in fact be a single object. We might end up with something much more easy to understand?

I'm not sure how this would impact testability, but I think it might be worth a try.

Also, I noticed that the NewResponseFrom*Certs functions might be possible to merge into one function that takes a custom interface parameter, since it only uses function calls to retrieve information from the proto object.

Some questions inline.

@vladbologa vladbologa force-pushed the vb/cert-refresh-channels branch from 5a2e620 to 11fc05c Compare December 20, 2024 09:20
Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recent changes are really great as they eliminated all issues that I worried about - many thanks for taking the legacy code and agreeing to improve it to our current standards!

I see no blockers to merging this, but I commented on some places that may need to be checked and improved. I am giving it ✅ , but please check my comments.

@vladbologa vladbologa force-pushed the vb/cert-refresh-channels branch from 2988890 to 3d2c97b Compare December 20, 2024 16:17
Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, thanks for improving!

@vladbologa vladbologa merged commit be1384a into master Jan 7, 2025
@vladbologa vladbologa deleted the vb/cert-refresh-channels branch January 7, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants