Skip to content

ROX-25946, ROX-25947: New cluster registration logic#13239

Merged
mclasmeier merged 95 commits intomasterfrom
mc/crs-40-central-cluster-registration
Jan 16, 2025
Merged

ROX-25946, ROX-25947: New cluster registration logic#13239
mclasmeier merged 95 commits intomasterfrom
mc/crs-40-central-cluster-registration

Conversation

@mclasmeier
Copy link
Contributor

@mclasmeier mclasmeier commented Nov 6, 2024

Description

This PR implements the CRS-based registration flow between central and sensor. The sensor-side of the CRS flow is implemented as a sub-command of the sensor binary for convenient testing and usage in the form of an init container in the sensor pod.

Some reviewing notes:

  • To enable MTLS authentication using the CRS credentials, which initially only exist wrapped in some opaque secret, the cert+key are extracted and written to storage (which will be backed in a follow-up PR by a meory-based emptyDir) so that they can be referenced as files and thus injected into the existing MTLS-based authentication flow.

User-facing documentation

  • CHANGELOG update is not needed (will be done in a follow-up PR)
  • documentation PR is not needed

Testing and quality

Automated testing

How I validated my change

In order to keep PRs at a manageable size this PR only contains the changes for the flow between central and sensor. Actually leveraging this flow in our Helm charts is part of #12713. Thus, most testing has been done on the feature branch which contains the Helm chart changes, which is based on this branch.

@openshift-ci
Copy link

openshift-ci bot commented Nov 6, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Nov 6, 2024

Images are ready for the commit at 690cae2.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.7.x-479-g690cae2e76.

@codecov
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 23.72881% with 90 lines in your changes missing coverage. Please review.

Project coverage is 48.56%. Comparing base (d5b309b) to head (e79a523).
Report is 44 commits behind head on master.

Files with missing lines Patch % Lines
pkg/protoconv/certs/typed_service_certificates.go 18.75% 51 Missing and 1 partial ⚠️
central/sensor/service/service_impl.go 0.00% 17 Missing ⚠️
pkg/mtls/env.go 0.00% 5 Missing ⚠️
pkg/services/convert.go 0.00% 5 Missing ⚠️
central/cluster/datastore/datastore_impl.go 0.00% 4 Missing ⚠️
pkg/mtls/crypto.go 0.00% 4 Missing ⚠️
sensor/common/centralclient/client.go 82.35% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #13239      +/-   ##
==========================================
+ Coverage   48.54%   48.56%   +0.01%     
==========================================
  Files        2467     2474       +7     
  Lines      177811   178837    +1026     
==========================================
+ Hits        86319    86849     +530     
- Misses      84564    85045     +481     
- Partials     6928     6943      +15     
Flag Coverage Δ
go-unit-tests 48.56% <23.72%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@mclasmeier mclasmeier force-pushed the mc/crs-40-central-cluster-registration branch 3 times, most recently from 4d2b146 to e744aa1 Compare November 12, 2024 11:23
@mclasmeier mclasmeier marked this pull request as ready for review November 12, 2024 23:44
@mclasmeier mclasmeier requested a review from a team as a code owner November 12, 2024 23:44
@mclasmeier mclasmeier requested review from porridge and vladbologa and removed request for a team November 12, 2024 23:44
@mclasmeier
Copy link
Contributor Author

/retest

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, but I have some questions, some important change requests, and a whole bunch of optional nitpicks.

@mclasmeier mclasmeier requested a review from porridge November 19, 2024 21:07
@mclasmeier
Copy link
Contributor Author

/test all

@mclasmeier mclasmeier force-pushed the mc/crs-40-central-cluster-registration branch from 7d1c267 to 769d838 Compare November 23, 2024 23:27
@mclasmeier
Copy link
Contributor Author

FTR, the updated code, including the refactoring of the crs flow implementation, is tested successfully in a follow-up PR.

@mclasmeier
Copy link
Contributor Author

/retest

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished a first pass. It looks good. I have a couple of overall questions:

  • Out of curiosity, are these init containers executed every time the pod restarts or only the first time?
  • Do I need to rebase any branches in order to test these changes? I'd like to play a little bit with different sensor/central versions.

@mclasmeier
Copy link
Contributor Author

Finished a first pass. It looks good. I have a couple of overall questions:

  • Out of curiosity, are these init containers executed every time the pod restarts or only the first time?
  • Do I need to rebase any branches in order to test these changes? I'd like to play a little bit with different sensor/central versions.

The init containers are executed on every pod start. But the CRS container would terminate very quickly, because it sees that nothing is to do and the init-tls-certs container should be very quick as well, it's a custom go binary that just does a filesystem lookup and copies to an emptyDir.

@mclasmeier
Copy link
Contributor Author

  • Do I need to rebase any branches in order to test these changes? I'd like to play a little bit with different sensor/central versions.

Yes, you can test this. Go to this branch for that. It contains the changes of this branch plus the necessary modifications to the Helm chart.

@mclasmeier mclasmeier requested a review from lvalerom January 13, 2025 09:49
…ficateSet, adjust code and tests accordingly
@lvalerom
Copy link
Contributor

@mclasmeier I was able to make sensor panic by:

  1. Deploying a central that supports CRS installations
  2. Generating the helm charts from ROX-25944: CRS Support in Helm Chart #12713 using roxctl
  3. Generating an init-bundle
  4. Deploying sensor with:
    • Image tag: 4.6.1
    • ROX_DEPLOY_SENSOR_WITH_CRS set to true
    • Using the helm charts generated in step (2)

I'm not sure if this is something that we even want to address. It looks to me like a very extreme case but it thought it was worth mentioning.

@mclasmeier
Copy link
Contributor Author

@mclasmeier I was able to make sensor panic by:

  1. Deploying a central that supports CRS installations

  2. Generating the helm charts from ROX-25944: CRS Support in Helm Chart #12713 using roxctl

  3. Generating an init-bundle

  4. Deploying sensor with:

    • Image tag: 4.6.1
    • ROX_DEPLOY_SENSOR_WITH_CRS set to true
    • Using the helm charts generated in step (2)

I'm not sure if this is something that we even want to address. It looks to me like a very extreme case but it thought it was worth mentioning.

Thank you for the testing!

I think what you are outlining above seems like somewhat of an artificial breakage caused by combining code fragments from different versions that were not designed to play nice with each other. My guess is that in your scenario the following happens:

ROX_DEPLOY_SENSOR_WITH_CRS=true (within the deploy scripts) causes the new Helm chart to use CRS. But the specified sensor (4.6.1) doesn't know how to do CRS, hence sensor would immediately try to start normally as a service and fails to find certificates?

@mclasmeier
Copy link
Contributor Author

/retest

2 similar comments
@mclasmeier
Copy link
Contributor Author

/retest

@mclasmeier
Copy link
Contributor Author

/retest

@lvalerom
Copy link
Contributor

I think what you are outlining above seems like somewhat of an artificial breakage caused by combining code fragments from different versions that were not designed to play nice with each other.

Agree.

ROX_DEPLOY_SENSOR_WITH_CRS=true (within the deploy scripts) causes the new Helm chart to use CRS.

Just to clarify. I didn't use the deploy scripts. I generated the helm charts with roxctl v4.7 and I had the ROX_DEPLOY_SENSOR_WITH_CRS set to true in my terminal, but I deployed directly with helm.

@lvalerom
Copy link
Contributor

hence sensor would immediately try to start normally as a service and fails to find certificates?

Yeah sensor panics because it cannot find the certificates. There's only one way I can see we can avoid this (if we want to):

  • Patching old sensor to not panic if the certificates are missing.
  • I don't know if it's possible but we could add a check in the helm charts that makes sure the CRS path is not taken if sensor version < 4.7 (?)

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the changes!

@mclasmeier
Copy link
Contributor Author

  • Patching old sensor to not panic if the certificates are missing.

Isn't sensor deliberately panicking in such a case? In any case, it would still require an upgrade on the user's side to benefit from whatever we change here.

  • I don't know if it's possible but we could add a check in the helm charts that makes sure the CRS path is not taken if sensor version < 4.7 (?)

Thank you, I think this one -- if feasible -- would be the way to go. But not as part of this PR.

@mclasmeier
Copy link
Contributor Author

Just to clarify. I didn't use the deploy scripts. I generated the helm charts with roxctl v4.7 and I had the ROX_DEPLOY_SENSOR_WITH_CRS set to true in my terminal, but I deployed directly with helm.

Hmm.
I think ROX_DEPLOY_SENSOR_WITH_CRS is only relevant for our deploy scripts, it's not consumed during generation of the Helm chart with roxctl? That's why I thought you were using the deploy scripts.

@mclasmeier
Copy link
Contributor Author

LGTM! Thank you for the changes!

Thank you so much!

@mclasmeier mclasmeier requested a review from porridge January 16, 2025 09:16
@mclasmeier mclasmeier merged commit 0ee4721 into master Jan 16, 2025
@mclasmeier mclasmeier deleted the mc/crs-40-central-cluster-registration branch January 16, 2025 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants