ROX-9015: new SensorComponent to keep local scanner secrets up to date#565
ROX-9015: new SensorComponent to keep local scanner secrets up to date#565
Conversation
|
Tag for build #244115 is 💻 For deploying this image using the dev scripts, run the following first: export MAIN_IMAGE_TAG='3.68.x-268-g7de21b009e'📦 You can also generate an installation bundle with: docker run -i --rm stackrox/main:3.68.x-268-g7de21b009e central generate interactive > bundle.zip🕹️ A |
a90d4eb to
c39b882
Compare
2f46f67 to
b4c91f2
Compare
There was a problem hiding this comment.
Our notion of recoverable error is very narrow and confined to specific application-level situations.
In particular a transient API availability error would not count as recoverable here.
So I'm not convinced it is a good idea to refuse to Start() in this case (which presumably causes a sensor crash).
There was a problem hiding this comment.
I agree this should be much more resilient. I'll remove the call to certsRepo.getServiceCertificates(ctx) altogether, as the refresher already will to that in refreshCertificates, and it'll stop itself in case the secrets have the wrong owner. refreshCertificates is called right after starting the refresher, and it will retry with backoff for API errors. This should be simpler than retrying on Start (for example with retry.OnError), but let me know what you think.
In any case a start failure in a sensor component doesn't lead to a crash, just to a log with panic level as seen in func (s *Sensor) Start() at sensor/common/sensor/sensor.go, but I agree that starting this component in a reliable way is critical
for _, component := range s.components {
if err := component.Start(); err != nil {
log.Panicf("Sensor component %T failed to start: %v", component, err)
}
}There was a problem hiding this comment.
On second though, we still are calling the k8s API on fetchSensorDeploymentOwnerRef to fetch the replica set. This component cannot continue if we are not able to get that replica set, so I'll use retry.OnError for that, as it would be reasonable to expect connectivity to the k8s API on sensor startup.
However, if that is not a reasonable assumption, one option is changing serviceCertificatesRepoSecretsImpl so it's factory takes a func() metav1.OwnerReference instead of a metav1.OwnerReference. The repo would fetch the owner reference until that reference is fetched successfully once, which implies a retry mechanism because the refresher could retry failed attempts.
I'll implement the first alternative, but please let me know if you think the second one if better
There was a problem hiding this comment.
@juanrh note that log.Panicf does cause an actual panic, i.e. a crash with stack trace.
There was a problem hiding this comment.
I didn't know that, thanks for the clarification
There was a problem hiding this comment.
Sorry, but I'm a little lost: can you remind me what will stop i.requester if we return an error here?
There was a problem hiding this comment.
nothing will stop that, so I'm adding a abortStart method to stop the component if there is an error at startup
That would be my preference. Otherwise we just shift debugging to later when suddenly scanner has stopped working because its cert expired. With the 1 year validity this might not feel urgent, but I'm a believer in the "fail loud and early" approach. |
to make test more stable
|
Added integration tests. |
@SimonBaeumer what are your thoughts on this? |
No, it should not fail if the TLSIssuer can't start. Sensor can run without the TLSIssuer and is still capable of enforcing the policies even without the LocalScanner, such a restart is not required, like i.e. the connection to Central can't be established or a missing certificate. Imagine we implemented an error in the TLSIssuer and it could not start on all thousands of Sensor installations, this means the error directly is escalated as a The other btw: it is still an issue for customers that Sensor restarts if it looses the connection to Central (i.e. on every update) |
|
OK, this convinces me @SimonBaeumer |
SimonBaeumer
left a comment
There was a problem hiding this comment.
Still reviewing the tests, but looking good so far.
Send the review so you can already start addressing the comment around the mangedBy field.
|
|
||
| sensorOwnerReference, fetchSensorDeploymentErr := i.fetchSensorDeploymentOwnerRef(ctx, fetchSensorDeploymentOwnerRefBackoff) | ||
| if fetchSensorDeploymentErr != nil { | ||
| return i.abortStart(errors.Wrap(fetchSensorDeploymentErr, "fetching sensor deployment")) |
| ok := concurrency.PollWithTimeout(func() bool { | ||
| return tlsIssuer.certRefresher != nil && tlsIssuer.certRefresher.Stopped() | ||
| }, 10*time.Millisecond, 100*time.Millisecond) | ||
| s.True(ok) |
There was a problem hiding this comment.
Can you add a message here what you have asserted?
| ctx, cancel := context.WithTimeout(context.Background(), startTimeout) | ||
| defer cancel() | ||
|
|
||
| if i.sensorManagedBy != storage.ManagerType_MANAGER_TYPE_HELM_CHART && |
There was a problem hiding this comment.
I would expect the component not being started in the first place. The information how the Sensor is managed should be accessible from CreateSensor, with that Sensor can be configured with the components necessary.
There was a problem hiding this comment.
moved to CreateSensor
@porridge @SimonBaeumer I have created https://issues.redhat.com/browse/ROX-9476 for this, and added both ideas (crash on dev builds only using |
Good point, started a thread on Slack |
SimonBaeumer
left a comment
There was a problem hiding this comment.
Good job, especially on the testing parts.
Some smaller things and we are there 👊
| ca, err := mtls.CAForSigning() | ||
| s.Require().NoError(err) | ||
| scannerCert := s.getCertificate() | ||
| scannerDBCert := s.getCertificate() |
There was a problem hiding this comment.
I think this always issues SCANNER_SERVICE service types, can you add the SCANNER_DB service type as well?
| var secrets *v1.SecretList | ||
| ok := concurrency.PollWithTimeout(func() bool { | ||
| secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{}) | ||
| s.Require().NoError(err) | ||
| return len(secrets.Items) == 2 && len(secrets.Items[0].Data) > 0 && len(secrets.Items[1].Data) > 0 | ||
| }, 10*time.Millisecond, testTimeout) | ||
| s.Require().True(ok) | ||
| for _, secret := range secrets.Items { | ||
| var expectedCert *mtls.IssuedCert |
There was a problem hiding this comment.
You can restructure this test to make it a bit more easier, i.e. moving the return of the PollWithTimeout into assertions.
| var secrets *v1.SecretList | |
| ok := concurrency.PollWithTimeout(func() bool { | |
| secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{}) | |
| s.Require().NoError(err) | |
| return len(secrets.Items) == 2 && len(secrets.Items[0].Data) > 0 && len(secrets.Items[1].Data) > 0 | |
| }, 10*time.Millisecond, testTimeout) | |
| s.Require().True(ok) | |
| for _, secret := range secrets.Items { | |
| var expectedCert *mtls.IssuedCert | |
| var secrets *v1.SecretList | |
| _ = concurrency.PollWithTimeout(func() bool { | |
| secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{}) | |
| s.Require().NoError(err) | |
| return true | |
| }, 10*time.Millisecond, testTimeout) | |
| s.Require().Len(secrets.Items, 2) | |
| for _, secret := range secrets.Items { | |
| s.Require().NotEmpty(secret.Data) | |
| var expectedCert *mtls.IssuedCert |
There was a problem hiding this comment.
If we return true in the first argument of PollWithTimeout then polling would complete on the first iteration, before the secrets are written to the k8s API fake. Also, I discovered that the fake has some kind of eventual consistency behaviour, and it takes some time until the data is populated in the secrets, that's why I added 832fa82 to ensure that we not only have 2 secrets but they have non-empty data. You have to run the tests 100+ times to detect that race, with this additional check I'm not seeing that problem anymore.
I've added a message to s.Require().True(ok) to the test is easier to understand.
There was a problem hiding this comment.
Interesting, did not know that. 👍
sensor/kubernetes/sensor/sensor.go
Outdated
| return nil, errors.Wrap(err, "creating central client") | ||
| } | ||
|
|
||
| if features.LocalImageScanning.Enabled() && (helmManagedConfig.GetManagedBy() == storage.ManagerType_MANAGER_TYPE_HELM_CHART || |
There was a problem hiding this comment.
We can assume that future managedBy types allow local image scanning to be allowed.
For this the check should be changed to check not allow managedBy except unknown and manual.
I think it should look like this:
(helmManagedConfig.GetManagedBy() != storage.ManagerType_MANAGER_TYPE_UNKNOWN &&
helmManagedConfig.GetManagedBy() != storage.ManagerType_MANAGER_TYPE_MANUAL)
Description
This is not rebasing right on top of #474 for some reason, but it should take it from there starting 32c468b08.
Checklist
No upgrade steps required, as seen below this is compatible with older versions of central.
CHANGELOG will be updated as part of parent epic.
Testing Performed
New unit tests
Run new unit test, and unit tests for modified types:
Manual end to end testing