ROX-8790: Create gRPC endpoint to generate Scanner certificate#219
ROX-8790: Create gRPC endpoint to generate Scanner certificate#219
Conversation
|
Tag for build #107531 is 💻 For deploying this image using the dev scripts, run the following first: export MAIN_IMAGE_TAG='3.67.x-267-g747e03b381'📦 You can also generate an installation bundle with: docker run -i --rm stackrox/main:3.67.x-267-g747e03b381 central generate interactive > bundle.zip🕹️ A |
central/localscanner/service.go
Outdated
There was a problem hiding this comment.
Shouldn't this be inferred from the connection credentials, rather than specified explicitly by the client?
As things are now, this would allow cluster A to generate scanner certs for cluster B, which shouldn't be allowed or necessary IMHO?
There was a problem hiding this comment.
Now getting the cluster id from authn.IdentityFromContextOrNil(ctx) following what is done in certdistribution service. Please let me know if that works
There was a problem hiding this comment.
Here followed what it's done in serviceImpl.Communicate and serviceImpl.getClusterForConnection from central/sensor/service/service_impl.go to get the cluster id using authn.IdentityFromContext. However, after investigating how authn.IdentityFromContext works, I understand this would use as cluster id whatever was used for the Identifier field of the mtls.Subject that was used to issue the sensor certificates. Except for statically registered clusters, that should always be either centralsensor.RegisteredInitCertClusterID or centralsensor.EphemeralInitCertClusterID, not the cluster id that Central eventually assigns during the hello protocol, that is returned in the CentralHello message.
I also see that Sensor caches CentralHello.cluster_id in /var/cache/stackrox/cluster-id (with helmconfig.StoreCachedClusterID in centralCommunicationImpl.initialSync), and adds it to sensorHello.HelmManagedConfigInit.ClusterId when it is available (with helmconfig.LoadCachedClusterIDin centralCommunicationImpl.sendEvents), which I guess happens on a restart of the sensor Pod, when the sensor has already performed the hello protocol once, and already got a cluster id from central. This is an example of an existing gRPC in central that receives the cluster id twice, once in the request context as part of the TLS auth, and another in an argument of the rpc operation. This makes sense to me because, in order to be able to get the cluster id just from the auth, Sensor would need to create new TLS certificates that use the cluster id returned in CentralHello.cluster_id for the Identity field of the mtls.Subject, and then restart the connection to Central by calling client.Communicate with the new credentials, which is not currently being done.
So I was thinking on the following:
- adding a parameter for the cluster id to
IssueLocalScannerCertsRequest, and in central/localscanner/service.go forauthorizeAndGetClusterIDdo something similar to what it's done ingetClusterForConnectionat central/sensor/service/service_impl.go that usescentralsensor.GetClusterIDto check that the cluster id obtained from the auth is compatible with the one sent in the request. That would cover "would allow cluster A to generate scanner certs for cluster B", at least partially. - On a future PR, for the client side I was thinking of adding a public method to sensor/common/sensor/sensor.go to return
s.centralConnection, and pass that to the operator in sensor/kubernetes/main.go. Checking howgrpcUtil.LazyClientConnworks, that would block any operation on sensor that usess.centralConnectionuntil the connection setup is completed insensor.gRPCConnectToCentralWithRetriesthat callss.centralConnection.Set(centralConnection). I could then get the cluster id fromclusterid.Get(), which also blocks untilcentralCommunicationImpl.initialSynccallsclusterid.Set(clusterID)with the cluster id returned inCentralHello. That would ensure that the operator only callsIssueLocalScannerCertswhen the connection to central is ready, and using the cluster id returned by central during hello.
Do you think that could work?
There was a problem hiding this comment.
I agree with the first bullet point.
As for the second bullet point, I'm not sure all this complexity is worth it currently. Perhaps for a request from a sensor that presents an init cert cluster ID we just create an init-style (wildcard ID) scanner certificates?
After all, if Mallory has an init bundle, he could impersonate any cluster created with that bundle. (Potentially any other cluster as well, I don't know whether our code checks that.)
Once we tackle https://issues.redhat.com/browse/ROX-8091 we could tighten this up.
WDYT @SimonBaeumer ?
There was a problem hiding this comment.
Hm, yeah the complexity worries me too.
In the beginning I had the idea of an "embedded" operator running orthogonal to Sensor but given authn/authz complexities emerging now I would rather take a step back and ignore this for now.
Running the secret creation as part of Sensor itself and using the already established bi-directional connection to me looks like the better approach without speculative improvements.
The only downside I see is that the cert is generated only after the initial syncs.
For using the gRPC stream the message needs to be dispatched which is done in the connection_impl.go:runRecv which calls the connection_impl.go:handleMessage.
There was a problem hiding this comment.
Hi Simon,
I've saved the previous prototype using the embedded operator as client in juanrh/ROX-8752-embedded-operator.
I'll now pivot to the new approach you outline. Just so we are on the same page, let me write down what I understand here:
- The client will be now sensor/common/sensor/sensor.go
- Instead of a new gRPC
service LocalScannerServicewe use the existing bidirectional rcpCommunicatefromservice SensorService, which implies adding new variants tomessage MsgFromSensorandmessage MsgToSensor, and using the methods you point out to handle the new messsages. - With initial sync I understand you mean the Hello protocol, because
s.manager.HandleConnection(server.Context(), sensorHello, cluster, eventPipeline, server)is only called inserviceImpl.Communicateafter the Centrall hello response is sent inserver.Send(¢ral.MsgToSensor{Msg: ¢ral.MsgToSensor_Hello{Hello: centralHello}})
I'll send a draft PR when I have to prototype to foster alignment.
There was a problem hiding this comment.
The client will be now sensor/common/sensor/sensor.go
Yes, more specifically started from the Sensor start-up routine and not outside of it (i.e. in main.go).
Instead of a new gRPC service LocalScannerService we use the existing bidirectional rcp Communicate from service SensorService, which implies adding new variants to message MsgFromSensor and message MsgToSensor, and using the methods you point out to handle the new messsages.
Exactly
With initial sync I understand you mean the Hello protocol, because s.manager.HandleConnection(server.Context(), sensorHello, cluster, eventPipeline, server) is only called in serviceImpl.Communicate after the Centrall hello response is sent in server.Send(¢ral.MsgToSensor{Msg: ¢ral.MsgToSensor_Hello{Hello: centralHello}})
Yes. To clarify I meant the time to wait until the first scanner cert can be generated in Central.
porridge
left a comment
There was a problem hiding this comment.
I think this makes sense. Can you please rebase on the tip of the other branch for a final pass?
cf581a1 to
3e23314
Compare
Also add validations on caller identity
central/localscanner/service.go
Outdated
There was a problem hiding this comment.
Hm, yeah the complexity worries me too.
In the beginning I had the idea of an "embedded" operator running orthogonal to Sensor but given authn/authz complexities emerging now I would rather take a step back and ignore this for now.
Running the secret creation as part of Sensor itself and using the already established bi-directional connection to me looks like the better approach without speculative improvements.
The only downside I see is that the cert is generated only after the initial syncs.
For using the gRPC stream the message needs to be dispatched which is done in the connection_impl.go:runRecv which calls the connection_impl.go:handleMessage.
SimonBaeumer
left a comment
There was a problem hiding this comment.
Cool to see how you already navigate around the source!
Imho let's handle the cert gen message inside the stream instead of doing it outside to prevent duplicating logic and issues due to cluster initialization.
SimonBaeumer
left a comment
There was a problem hiding this comment.
This looks already very promising!
@misberner could you take a look at the review & changes as well?
| certs, err := localscanner.IssueLocalScannerCerts(namespace, c.clusterID) | ||
| errMsgTemplate := "Error issuing local Scanner certificates for cluster with ID %s and namespace %s" | ||
| if err != nil { | ||
| return errors.Wrapf(err, errMsgTemplate, c.clusterID, namespace) |
There was a problem hiding this comment.
Hm, with this approach Sensor never get notified that the certificate was not generated. Imho Sensor should expect a valid response, even if containing an error (and Sensor should imho expect a response).
There was a problem hiding this comment.
That's right, I modified IssueLocalScannerCertsResponse to use a oneof that returns either the certificates or the error message
I'm happy to TAL, but given the size of this PR I won't be able to do so before Friday at best. Is there maybe some subset of the PR that you specifically would like me to take a look at? |
Thanks for taking a look @misberner. Here we are adding new variants to |
`envisolator` doesn't enable env vars in release builds
|
|
||
| package central; | ||
|
|
||
| message LocalScannerCertificates { |
There was a problem hiding this comment.
So, per the design doc, the reason to not request the certificates via the SensorHello protocol is that the local scanner cert's validity might be shorter than the lifetime of the sensor<>central connection. That makes sense. However, I find it unfortunate that you deviate from the format that significantly. Encoding the service name into the file name might not be the most elegant thing, but neither is sending two distinct ca certificates even if they can only ever possibly be the same.
If you want to keep it more explicitly structured, I'd still suggest to generalize this message to message ServiceCertificate (there really isn't anything scanner-specific here), and then structure request/response as follows:
message TypedServiceCertificate {
storage.ServiceType service_type = 1;
ServiceCertificate cert = 2;
}
message IssueServiceCertsRequest {
repeated storage.ServiceType service_types = 1;
}
message IssueServiceCertsResponse {
repeated TypedServiceCertificate service_certs = 1;
}
this will hardly be any harder to use, but way more future-proof. However, I would maintain that the map<string, string> cert_bundle approach is perfectly fine as well.
There was a problem hiding this comment.
You are right. I modified the protos to avoid sending the ca twice, and avoid local scanner specific stuff when it's not needed. However, I still think IssueLocalScannerCertsRequest make sense because local scanner certificates should have a specific expiration time of days. Let me know what you think
There was a problem hiding this comment.
Makes sense about the expiration, though I'd say that could be addressed by an extra parameter. In the end, we might want to switch to more short-lived certificates for other services as well.
Also add test to check the right feature flag is being used.
Also extend test cases to cover more missing parameters combinations
Co-authored-by: Malte Isberner <2822367+misberner@users.noreply.github.com>
Co-authored-by: Malte Isberner <2822367+misberner@users.noreply.github.com>
instead of assertions with info
SimonBaeumer
left a comment
There was a problem hiding this comment.
Looks good to me!
Please wait for @porridge comment on the race condition in subtests before merging 👍
| for tcName, tc := range testCases { | ||
| s.Run(tcName, func() { |
There was a problem hiding this comment.
@porridge Is here a possible race? I remember you working on a problem here in tests.
There was a problem hiding this comment.
There was rumour from someone (unfortunately don't remember who that was) that the suite tests are run in parallel by default, in which case this would be racy. I vaguely remember the same person later claimed they might have been wrong. So it's inconclusive.
If you'd like to be safe, just add tc := tc right before s.Run() and mention something // TODO(ROX-8730): just in case this is racy so we remember to remove it if this turns out to not be necessary.
There was a problem hiding this comment.
Never heard that, but I can promise you that suite tests are not run in parallel, and in fact cannot be run in parallel
There was a problem hiding this comment.
I'll make that change anyway, as it doesn't hurt
porridge
left a comment
There was a problem hiding this comment.
Just a few optional nitpicks inline.
This looks great!
Description
Create new gRPC endpoint in Central that Sensor can use to ask for the TLS certificates for a local Scanner. See design doc for additional context.
Note: this also has includes commits from #211, which hasn't been merged. First relevant commit is 76d65ee.
Checklist
This creates a new gRPC endpoint, so it doesn't change the behaviour of the system, therefore there no need to add neither a CHANGELOG entry nor upgrade instructions.
Testing Performed
Added additional unit test.