Skip to content

ROX-8790: Create gRPC endpoint to generate Scanner certificate#219

Merged
juanrh merged 36 commits intomasterfrom
juanrh/ROX-8790
Jan 17, 2022
Merged

ROX-8790: Create gRPC endpoint to generate Scanner certificate#219
juanrh merged 36 commits intomasterfrom
juanrh/ROX-8790

Conversation

@juanrh
Copy link
Contributor

@juanrh juanrh commented Jan 3, 2022

Description

Create new gRPC endpoint in Central that Sensor can use to ask for the TLS certificates for a local Scanner. See design doc for additional context.

Note: this also has includes commits from #211, which hasn't been merged. First relevant commit is 76d65ee.

Checklist

  • Investigated and inspected CI test results
  • Unit test and regression tests added
  • Evaluated and added CHANGELOG entry if required
  • Determined and documented upgrade steps

This creates a new gRPC endpoint, so it doesn't change the behaviour of the system, therefore there no need to add neither a CHANGELOG entry nor upgrade instructions.

Testing Performed

Added additional unit test.

@ghost
Copy link

ghost commented Jan 3, 2022

Tag for build #107531 is 3.67.x-267-g747e03b381.

💻 For deploying this image using the dev scripts, run the following first:

export MAIN_IMAGE_TAG='3.67.x-267-g747e03b381'

📦 You can also generate an installation bundle with:

docker run -i --rm stackrox/main:3.67.x-267-g747e03b381 central generate interactive > bundle.zip

🕹️ A roxctl binary artifact can be downloaded from CircleCI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be inferred from the connection credentials, rather than specified explicitly by the client?
As things are now, this would allow cluster A to generate scanner certs for cluster B, which shouldn't be allowed or necessary IMHO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now getting the cluster id from authn.IdentityFromContextOrNil(ctx) following what is done in certdistribution service. Please let me know if that works

Copy link
Contributor Author

@juanrh juanrh Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here followed what it's done in serviceImpl.Communicate and serviceImpl.getClusterForConnection from central/sensor/service/service_impl.go to get the cluster id using authn.IdentityFromContext. However, after investigating how authn.IdentityFromContext works, I understand this would use as cluster id whatever was used for the Identifier field of the mtls.Subject that was used to issue the sensor certificates. Except for statically registered clusters, that should always be either centralsensor.RegisteredInitCertClusterID or centralsensor.EphemeralInitCertClusterID, not the cluster id that Central eventually assigns during the hello protocol, that is returned in the CentralHello message.

I also see that Sensor caches CentralHello.cluster_id in /var/cache/stackrox/cluster-id (with helmconfig.StoreCachedClusterID in centralCommunicationImpl.initialSync), and adds it to sensorHello.HelmManagedConfigInit.ClusterId when it is available (with helmconfig.LoadCachedClusterIDin centralCommunicationImpl.sendEvents), which I guess happens on a restart of the sensor Pod, when the sensor has already performed the hello protocol once, and already got a cluster id from central. This is an example of an existing gRPC in central that receives the cluster id twice, once in the request context as part of the TLS auth, and another in an argument of the rpc operation. This makes sense to me because, in order to be able to get the cluster id just from the auth, Sensor would need to create new TLS certificates that use the cluster id returned in CentralHello.cluster_id for the Identity field of the mtls.Subject, and then restart the connection to Central by calling client.Communicate with the new credentials, which is not currently being done.

So I was thinking on the following:

  • adding a parameter for the cluster id to IssueLocalScannerCertsRequest, and in central/localscanner/service.go for authorizeAndGetClusterID do something similar to what it's done in getClusterForConnection at central/sensor/service/service_impl.go that uses centralsensor.GetClusterID to check that the cluster id obtained from the auth is compatible with the one sent in the request. That would cover "would allow cluster A to generate scanner certs for cluster B", at least partially.
  • On a future PR, for the client side I was thinking of adding a public method to sensor/common/sensor/sensor.go to return s.centralConnection, and pass that to the operator in sensor/kubernetes/main.go. Checking how grpcUtil.LazyClientConn works, that would block any operation on sensor that uses s.centralConnection until the connection setup is completed in sensor.gRPCConnectToCentralWithRetries that calls s.centralConnection.Set(centralConnection). I could then get the cluster id from clusterid.Get(), which also blocks until centralCommunicationImpl.initialSync calls clusterid.Set(clusterID) with the cluster id returned in CentralHello. That would ensure that the operator only calls IssueLocalScannerCerts when the connection to central is ready, and using the cluster id returned by central during hello.

Do you think that could work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the first bullet point.

As for the second bullet point, I'm not sure all this complexity is worth it currently. Perhaps for a request from a sensor that presents an init cert cluster ID we just create an init-style (wildcard ID) scanner certificates?

After all, if Mallory has an init bundle, he could impersonate any cluster created with that bundle. (Potentially any other cluster as well, I don't know whether our code checks that.)

Once we tackle https://issues.redhat.com/browse/ROX-8091 we could tighten this up.

WDYT @SimonBaeumer ?

Copy link
Contributor

@SimonBaeumer SimonBaeumer Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah the complexity worries me too.
In the beginning I had the idea of an "embedded" operator running orthogonal to Sensor but given authn/authz complexities emerging now I would rather take a step back and ignore this for now.
Running the secret creation as part of Sensor itself and using the already established bi-directional connection to me looks like the better approach without speculative improvements.

The only downside I see is that the cert is generated only after the initial syncs.

For using the gRPC stream the message needs to be dispatched which is done in the connection_impl.go:runRecv which calls the connection_impl.go:handleMessage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Simon,

I've saved the previous prototype using the embedded operator as client in juanrh/ROX-8752-embedded-operator.
I'll now pivot to the new approach you outline. Just so we are on the same page, let me write down what I understand here:

  • The client will be now sensor/common/sensor/sensor.go
  • Instead of a new gRPC service LocalScannerService we use the existing bidirectional rcp Communicate from service SensorService, which implies adding new variants to message MsgFromSensor and message MsgToSensor, and using the methods you point out to handle the new messsages.
  • With initial sync I understand you mean the Hello protocol, because s.manager.HandleConnection(server.Context(), sensorHello, cluster, eventPipeline, server) is only called in serviceImpl.Communicate after the Centrall hello response is sent in server.Send(&central.MsgToSensor{Msg: &central.MsgToSensor_Hello{Hello: centralHello}})

I'll send a draft PR when I have to prototype to foster alignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client will be now sensor/common/sensor/sensor.go

Yes, more specifically started from the Sensor start-up routine and not outside of it (i.e. in main.go).

Instead of a new gRPC service LocalScannerService we use the existing bidirectional rcp Communicate from service SensorService, which implies adding new variants to message MsgFromSensor and message MsgToSensor, and using the methods you point out to handle the new messsages.

Exactly

With initial sync I understand you mean the Hello protocol, because s.manager.HandleConnection(server.Context(), sensorHello, cluster, eventPipeline, server) is only called in serviceImpl.Communicate after the Centrall hello response is sent in server.Send(&central.MsgToSensor{Msg: &central.MsgToSensor_Hello{Hello: centralHello}})

Yes. To clarify I meant the time to wait until the first scanner cert can be generated in Central.

@juanrh juanrh marked this pull request as ready for review January 4, 2022 16:54
Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. Can you please rebase on the tip of the other branch for a final pass?

@juanrh juanrh requested a review from porridge January 5, 2022 13:43
Also add validations on caller identity
@juanrh juanrh requested a review from porridge January 5, 2022 16:12
Copy link
Contributor

@SimonBaeumer SimonBaeumer Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah the complexity worries me too.
In the beginning I had the idea of an "embedded" operator running orthogonal to Sensor but given authn/authz complexities emerging now I would rather take a step back and ignore this for now.
Running the secret creation as part of Sensor itself and using the already established bi-directional connection to me looks like the better approach without speculative improvements.

The only downside I see is that the cert is generated only after the initial syncs.

For using the gRPC stream the message needs to be dispatched which is done in the connection_impl.go:runRecv which calls the connection_impl.go:handleMessage.

Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool to see how you already navigate around the source!

Imho let's handle the cert gen message inside the stream instead of doing it outside to prevent duplicating logic and issues due to cluster initialization.

Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks already very promising!
@misberner could you take a look at the review & changes as well?

certs, err := localscanner.IssueLocalScannerCerts(namespace, c.clusterID)
errMsgTemplate := "Error issuing local Scanner certificates for cluster with ID %s and namespace %s"
if err != nil {
return errors.Wrapf(err, errMsgTemplate, c.clusterID, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, with this approach Sensor never get notified that the certificate was not generated. Imho Sensor should expect a valid response, even if containing an error (and Sensor should imho expect a response).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I modified IssueLocalScannerCertsResponse to use a oneof that returns either the certificates or the error message

@misberner
Copy link
Contributor

This looks already very promising! @misberner could you take a look at the review & changes as well?

I'm happy to TAL, but given the size of this PR I won't be able to do so before Friday at best. Is there maybe some subset of the PR that you specifically would like me to take a look at?

@juanrh
Copy link
Contributor Author

juanrh commented Jan 13, 2022

This looks already very promising! @misberner could you take a look at the review & changes as well?

I'm happy to TAL, but given the size of this PR I won't be able to do so before Friday at best. Is there maybe some subset of the PR that you specifically would like me to take a look at?

Thanks for taking a look @misberner. Here we are adding new variants to message MsgFromSensor and message MsgToSensor so Sensor can make a request to Central through the existing rpc Communicate. So I'd ask your feedback about the changes to central/sensor/service/connection/connection_impl.go and the proto files.

`envisolator` doesn't enable env vars in release builds

package central;

message LocalScannerCertificates {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, per the design doc, the reason to not request the certificates via the SensorHello protocol is that the local scanner cert's validity might be shorter than the lifetime of the sensor<>central connection. That makes sense. However, I find it unfortunate that you deviate from the format that significantly. Encoding the service name into the file name might not be the most elegant thing, but neither is sending two distinct ca certificates even if they can only ever possibly be the same.

If you want to keep it more explicitly structured, I'd still suggest to generalize this message to message ServiceCertificate (there really isn't anything scanner-specific here), and then structure request/response as follows:

message TypedServiceCertificate {
  storage.ServiceType service_type = 1;
  ServiceCertificate cert = 2;
}
message IssueServiceCertsRequest {
  repeated storage.ServiceType service_types = 1;
}
message IssueServiceCertsResponse {
  repeated TypedServiceCertificate service_certs = 1;
}

this will hardly be any harder to use, but way more future-proof. However, I would maintain that the map<string, string> cert_bundle approach is perfectly fine as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I modified the protos to avoid sending the ca twice, and avoid local scanner specific stuff when it's not needed. However, I still think IssueLocalScannerCertsRequest make sense because local scanner certificates should have a specific expiration time of days. Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense about the expiration, though I'd say that could be addressed by an extra parameter. In the end, we might want to switch to more short-lived certificates for other services as well.

Juan Rodriguez Hortala and others added 5 commits January 14, 2022 14:09
Co-authored-by: Malte Isberner <2822367+misberner@users.noreply.github.com>
Co-authored-by: Malte Isberner <2822367+misberner@users.noreply.github.com>
instead of assertions with info
Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Please wait for @porridge comment on the race condition in subtests before merging 👍

Comment on lines +142 to +143
for tcName, tc := range testCases {
s.Run(tcName, func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@porridge Is here a possible race? I remember you working on a problem here in tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was rumour from someone (unfortunately don't remember who that was) that the suite tests are run in parallel by default, in which case this would be racy. I vaguely remember the same person later claimed they might have been wrong. So it's inconclusive.
If you'd like to be safe, just add tc := tc right before s.Run() and mention something // TODO(ROX-8730): just in case this is racy so we remember to remove it if this turns out to not be necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never heard that, but I can promise you that suite tests are not run in parallel, and in fact cannot be run in parallel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There. So it's safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make that change anyway, as it doesn't hurt

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few optional nitpicks inline.
This looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants