Skip to content

ROX-27394: Sensor offline mode support in TLS issuer#13596

Merged
vladbologa merged 7 commits intomasterfrom
vb/sensor-offline-cert-refresh
Feb 25, 2025
Merged

ROX-27394: Sensor offline mode support in TLS issuer#13596
vladbologa merged 7 commits intomasterfrom
vb/sensor-offline-cert-refresh

Conversation

@vladbologa
Copy link
Contributor

@vladbologa vladbologa commented Dec 12, 2024

Description

This PR adds explicit handling for Sensor offline mode in the TLS issuer components. As these components cannot do anything useful in offline mode, the certificate refresh logic is now only active when Sensor is online.
(note: the TLS issuer was already resilient to Sensor being offline due to its retry mechanism)

Main changes:

  • start the certificate refresher only when the component is both started and online
  • changed behavior of Start / Stop so that they can be called in the wrong order (e.g. calling Start several times in a row is now fine)
  • Perform the Central capability check when Sensor enters online mode
  • integration tests that cycle through online and offline modes
  • unit test for Central capability check
  • some testing code clean-up

User-facing documentation

  • CHANGELOG is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed

Testing and quality

  • the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Manual smoke test that shows that the Secured Cluster cert refresher works:

# Disable CRS to trigger a cert refresh on startup (otherwise the CRS logic retrieves the certs)
ROX_DEPLOY_SENSOR_WITH_CRS=false SENSOR_HELM_MANAGED=true ./deploy/k8s/deploy-local.sh

# Wait for sensor to be in Ready state

❯ kubectl -n stackrox get secret | grep tls-
tls-cert-scanner-v4-db                                    Opaque               3      5m31s
tls-cert-sensor                                           Opaque               3      5m31s
tls-cert-collector                                        Opaque               3      5m30s
tls-cert-admission-control                                Opaque               3      5m30s
tls-cert-scanner                                          Opaque               3      5m29s
tls-cert-scanner-db                                       Opaque               3      5m29s
tls-cert-scanner-v4-indexer                               Opaque               3      5m29s 

Sensor logs:

kubernetes/certrefresh: 2025/02/20 15:59:19.365620 tls_issuer_common.go:139: Info: Component runs now in Online mode
...
kubernetes/certrefresh: 2025/02/20 15:59:21.641724 cert_refresher.go:110: Warn: secured cluster not found (this is expected on a new deployment), will refresh certificates immediately: 7 errors occurred:
	* secrets "tls-cert-scanner-db" not found
	* secrets "tls-cert-scanner-v4-indexer" not found
	* secrets "tls-cert-scanner-v4-db" not found
	* secrets "tls-cert-sensor" not found
	* secrets "tls-cert-collector" not found
	* secrets "tls-cert-admission-control" not found
	* secrets "tls-cert-scanner" not found

kubernetes/certrefresh: 2025/02/20 15:59:24.442826 cert_refresher.go:85: Info: successfully refreshed secured cluster for: SENSOR_SERVICE, COLLECTOR_SERVICE, ADMISSION_CONTROL_SERVICE, SCANNER_V4_DB_SERVICE, SCANNER_DB_SERVICE, SCANNER_SERVICE, SCANNER_V4_INDEXER_SERVICE
kubernetes/certrefresh: 2025/02/20 15:59:24.442884 cert_refresher.go:46: Info: secured cluster scheduled to be refreshed in 3604h37m14.557118633s

@openshift-ci
Copy link

openshift-ci bot commented Dec 12, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@vladbologa vladbologa changed the title WIP: Support Sensor offline mode in TLS issuer ROX-27394: Support Sensor offline mode in TLS issuer Dec 12, 2024
@rhacs-bot
Copy link
Contributor

rhacs-bot commented Dec 12, 2024

Images are ready for the commit at 949d12a.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.8.x-17-g949d12a184.

@vladbologa vladbologa force-pushed the vb/sensor-offline-cert-refresh branch 5 times, most recently from dbece3b to d2a3e4c Compare December 17, 2024 12:55
@vladbologa vladbologa changed the title ROX-27394: Support Sensor offline mode in TLS issuer ROX-27394: Sensor offline mode support in TLS issuer Dec 17, 2024
@vladbologa vladbologa force-pushed the vb/sensor-offline-cert-refresh branch from 0c5f579 to 74476fc Compare December 17, 2024 14:11
@vladbologa vladbologa marked this pull request as ready for review December 17, 2024 14:58
@vladbologa vladbologa requested a review from a team as a code owner December 17, 2024 14:58
@vladbologa vladbologa requested a review from vikin91 December 17, 2024 14:58
Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vlad, I just quickly browsed the code and have two impressions:

  1. The changes required to handle offline mode look pretty complex. This is not typical. Are you sure that we cannot simplify somewhere?
  2. I read the PR description (it is great and all points make sense) and I think you still need to add the integration tests. They are pretty important for testing offline mode as they typically uncover many problems. Please make sure to add them.

I will go over the code properly soon (I hope).

@vladbologa
Copy link
Contributor Author

vladbologa commented Dec 19, 2024

Vlad, I just quickly browsed the code and have two impressions:

  1. The changes required to handle offline mode look pretty complex. This is not typical. Are you sure that we cannot simplify somewhere?

Most changes are in the tests (and I also extracted some common test functions).
I'm also applying the same changes to 2 Sensor components (LocalScannerTLSIssuer the legacy component, and SecuredClusterTLSIssuer the new component), which makes the change look bigger.

But otherwise I wouldn't say that the changes are complex, it's mostly moving some code from the Start/Stop methods to goOnline/goOffline.

  1. I read the PR description (it is great and all points make sense) and I think you still need to add the integration tests. They are pretty important for testing offline mode as they typically uncover many problems. Please make sure to add them.

I did add tests for cycling between offline and online for both components here:

func (s *securedClusterTLSIssuerIntegrationTests) TestSensorOnlineOfflineModes() {

and here:
func (s *localScannerTLSIssuerIntegrationTests) TestSensorOnlineOfflineModes() {

EDIT: The above refers to an older version of this PR. The code has changed significantly since then, and now there are no distinct LocalScannerTLSIssuer and SecuredClusterTLSIssuer, they are now both the same type, with different configuration.

Base automatically changed from vb/cert-refresh-channels to master January 7, 2025 11:08
@vladbologa vladbologa marked this pull request as draft January 9, 2025 13:52
@vladbologa vladbologa force-pushed the vb/sensor-offline-cert-refresh branch from e88e47f to f7d07e3 Compare January 31, 2025 13:36
@vladbologa vladbologa marked this pull request as ready for review January 31, 2025 14:52
@vladbologa vladbologa requested a review from vikin91 January 31, 2025 14:58
@vladbologa vladbologa force-pushed the vb/sensor-offline-cert-refresh branch from 3539f93 to 82d798b Compare February 3, 2025 09:34
@codecov
Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 95.08197% with 3 lines in your changes missing coverage. Please review.

Project coverage is 49.04%. Comparing base (d4e1ed2) to head (949d12a).
Report is 85 commits behind head on master.

Files with missing lines Patch % Lines
sensor/kubernetes/certrefresh/tls_issuer_common.go 94.33% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #13596      +/-   ##
==========================================
- Coverage   49.07%   49.04%   -0.04%     
==========================================
  Files        2514     2524      +10     
  Lines      182769   183446     +677     
==========================================
+ Hits        89691    89963     +272     
- Misses      85956    86363     +407     
+ Partials     7122     7120       -2     
Flag Coverage Δ
go-unit-tests 49.04% <95.08%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vladbologa vladbologa force-pushed the vb/sensor-offline-cert-refresh branch from 82d798b to d25f9c3 Compare February 5, 2025 09:13
@vikin91
Copy link
Contributor

vikin91 commented Feb 12, 2025

Sorry @vladbologa , I didn't make it with the review before my PTO. Do you maybe want to ping others from Maple for a review?

@vladbologa vladbologa requested a review from lvalerom February 13, 2025 09:32
Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to have a closer look a the tests but here's some early feedback to the production logic.

@vladbologa vladbologa force-pushed the vb/sensor-offline-cert-refresh branch from 8331dd0 to 656a37f Compare February 13, 2025 18:58
@vladbologa vladbologa requested a review from a team as a code owner February 20, 2025 15:03
@vladbologa vladbologa requested a review from lvalerom February 20, 2025 15:05
@vladbologa
Copy link
Contributor Author

/retest

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the handling of the cancel function should be protected with a mutex but other than that looks really good.

@vladbologa vladbologa requested a review from lvalerom February 24, 2025 15:34
Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for making the changes!

@vladbologa vladbologa merged commit b3acb5d into master Feb 25, 2025
92 checks passed
@vladbologa vladbologa deleted the vb/sensor-offline-cert-refresh branch February 25, 2025 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants