Skip to content

ROX-9015: new SensorComponent to keep local scanner secrets up to date#565

Merged
juanrh merged 46 commits intomasterfrom
juanrh/ROX-9015
Feb 24, 2022
Merged

ROX-9015: new SensorComponent to keep local scanner secrets up to date#565
juanrh merged 46 commits intomasterfrom
juanrh/ROX-9015

Conversation

@juanrh
Copy link
Contributor

@juanrh juanrh commented Feb 4, 2022

Description

This is not rebasing right on top of #474 for some reason, but it should take it from there starting 32c468b08.

Checklist

  • Investigated and inspected CI test results
  • Unit test and regression tests added
  • Evaluated and added CHANGELOG entry if required
  • Determined and documented upgrade steps

No upgrade steps required, as seen below this is compatible with older versions of central.
CHANGELOG will be updated as part of parent epic.

Testing Performed

New unit tests

Run new unit test, and unit tests for modified types:

go test -count=100 -race -v -p=1 github.com/stackrox/rox/sensor/kubernetes/localscanner -run TestLocalScannerTLSIssuer && \
go test -count=100 -race -v -p=1 github.com/stackrox/rox/pkg/concurrency -run TestRetryTicker && \
go test -count=100 -race -v -p=1 github.com/stackrox/rox/sensor/kubernetes/localscanner -run TestCertRefresher && \
go test -count=100 -race -v -p=1 github.com/stackrox/rox/sensor/kubernetes/localscanner -run TestGetSecretRenewalTime
  • Manual end to end testing in local k8s cluster

Manual end to end testing

function rox_get_pod_name_by_component {
  component="${1}"

  POD_NAME=$(kubectl -n stackrox get pods -l app.kubernetes.io/component=${component} -o jsonpath='{.items[0].metadata.name}')
  echo ${POD_NAME}
}

function rox_sensor_logs {
  kubectl logs -n stackrox $(rox_get_pod_name_by_component sensor) -f
}

teardown
make image

########
## Test: TLS issuer does not start for manual deployments PASS

# deploy
export STORAGE=pvc
export SKIP_UI_BUILD=1
export HOTRELOAD=true
./deploy/k8s/deploy-local.sh

# Error from server (NotFound): secrets "scanner-slim-tls" not found
# Error from server (NotFound): secrets "scanner-db-slim-tls" not found
kubectl -n stackrox get secret scanner-slim-tls
kubectl -n stackrox get secret scanner-db-slim-tls

# no mention to tls_issuer
rox_sensor_logs

# enable feature: this already restarts the pod
kubectl -n stackrox set env deployment sensor ROX_LOCAL_IMAGE_SCANNING=true

# no mention to tls_issuer: this is a manual deployment
rox_sensor_logs

########
## Test Create secrets with certificates if certificates are missing: PASS

teardown
./deploy/k8s/central.sh
ADMIN_PWD=$(cat ./deploy/k8s/central-deploy/password)
echo $ADMIN_PWD
kubectl -n stackrox port-forward svc/central 8443:443 &
sleep 1
rm cluster-init-bundle.yaml 
./bin/darwin/roxctl central init-bundles generate init_bundle --output cluster-init-bundle.yaml  --insecure-skip-tls-verify -p ${ADMIN_PWD}
INIT_BUNDLE_PATH=cluster-init-bundle.yaml  
roxctl helm output secured-cluster-services --debug --remove

helm -n stackrox install stackrox-secured-cluster-services ./stackrox-secured-cluster-services-chart \
  --set clusterName=docker-desktop \
  -f ${INIT_BUNDLE_PATH} \
  --set mainImagePullSecrets.allowNone=true --set collectorImagePullSecrets.allowNone=true

# enable feature: this already restarts the pod
kubectl -n stackrox set env deployment sensor ROX_LOCAL_IMAGE_SCANNING=true

# kubernetes/localscanner: 2022/02/24 14:42:25.384914 cert_refresher.go:54: Error: refreshing local scanner credentials: central refused to issue certificates: issuing local Scanner certificates for request ID "2e6290e5-c6ea-419b-86c9-0e7b22e25918", cluster ID "d72ed4f9-acb4-4d31-8e34-f2c8088432ff" and namespace "stackrox": feature 'Enable OpenShift local-image scanning' is disabled
rox_sensor_logs

# enable feature in central
kubectl -n stackrox set env deployment central ROX_LOCAL_IMAGE_SCANNING=true

# kubernetes/localscanner: 2022/02/24 14:44:12.787698 cert_refresher.go:92: Info: successfully refreshed local scanner credentials
# kubernetes/localscanner: 2022/02/24 14:44:12.787812 cert_refresher.go:58: Info: local scanner credentials scheduled to be refreshed in 22h44m32.2122016s
rox_sensor_logs


# scanner-slim-tls:
# 	ca.pem:
# -----BEGIN CERTIFICATE---
# 	cert.pem: -----BEGIN CERTIFICATE---
# 	key.pem: -----BEGIN EC PRIVATE KEY
# 
# scanner-db-slim-tls:
# 	ca.pem: -----BEGIN CERTIFICATE---
# 	cert.pem: -----BEGIN CERTIFICATE---
# 	key.pem: -----BEGIN EC PRIVATE KEY
for secret in scanner-slim-tls scanner-db-slim-tls
do
	echo "${secret}:"
	for filename in ca.pem cert.pem key.pem
	do
		printf "\t${filename}: "
		kubectl -n stackrox get secret $secret -o json | jq -r ".data[\"${filename}\"]" | base64 -d | head -c 25
		echo
	done
	echo
done 


########
## Test TLS issuer start crash does not crash sensor startup PASS
kubectl -n stackrox set env deployment sensor POD_NAME=wrong_name

# kubernetes/localscanner: 2022/02/24 14:49:02.997008 tls_issuer.go:125: Error: local scanner TLS issuer start aborted due to error: fetching sensor deployment: fetching sensor pod with name "wrong_name": pods "wrong_name" not found
rox_sensor_logs

$ kubectl -n stackrox get pods -l app=sensor
NAME                      READY   STATUS    RESTARTS   AGE
sensor-675d59cd96-kw4md   1/1     Running   0          111s

########
## Test TLS issuer stops on unexpected owner PASS

# recreate secret with wrong owner
kubectl -n stackrox delete secret scanner-slim-tls
kubectl -n stackrox  create secret generic  scanner-slim-tls

# restart sensor with correct env var
kubectl -n stackrox delete deployment sensor
helm -n stackrox upgrade stackrox-secured-cluster-services ./stackrox-secured-cluster-services-chart \
  --set clusterName=docker-desktop \
  -f ${INIT_BUNDLE_PATH} \
  --set mainImagePullSecrets.allowNone=true --set collectorImagePullSecrets.allowNone=true

kubectl -n stackrox set env deployment sensor LOGLEVEL=debug
kubectl -n stackrox set env deployment sensor ROX_LOCAL_IMAGE_SCANNING=true

# kubernetes/localscanner: 2022/02/24 15:06:44.295040 cert_refresher.go:50: Error: non-recoverable error refreshing local scanner credentials, automatic refresh will be stopped: 2 errors occurred:
# 	* secrets "scanner-db-slim-tls" not found
# 	* unexpected owner for certificate secrets
rox_sensor_logs

@juanrh juanrh changed the title Juanrh/rox 9015 ROX-9015: new SensorComponent to keep local scanner secrets up to date Feb 4, 2022
@ghost
Copy link

ghost commented Feb 4, 2022

Tag for build #244115 is 3.68.x-268-g7de21b009e.

💻 For deploying this image using the dev scripts, run the following first:

export MAIN_IMAGE_TAG='3.68.x-268-g7de21b009e'

📦 You can also generate an installation bundle with:

docker run -i --rm stackrox/main:3.68.x-268-g7de21b009e central generate interactive > bundle.zip

🕹️ A roxctl binary artifact can be downloaded from CircleCI.

@juanrh juanrh marked this pull request as ready for review February 8, 2022 18:36
@juanrh juanrh requested a review from SimonBaeumer February 8, 2022 18:36
@juanrh juanrh marked this pull request as draft February 10, 2022 18:39
@juanrh juanrh changed the base branch from master to juanrh/ROX-9148 February 15, 2022 16:23
@juanrh juanrh marked this pull request as ready for review February 15, 2022 17:05
@juanrh juanrh requested a review from porridge February 15, 2022 17:09
Base automatically changed from juanrh/ROX-9148 to master February 17, 2022 08:34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our notion of recoverable error is very narrow and confined to specific application-level situations.
In particular a transient API availability error would not count as recoverable here.
So I'm not convinced it is a good idea to refuse to Start() in this case (which presumably causes a sensor crash).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be much more resilient. I'll remove the call to certsRepo.getServiceCertificates(ctx) altogether, as the refresher already will to that in refreshCertificates, and it'll stop itself in case the secrets have the wrong owner. refreshCertificates is called right after starting the refresher, and it will retry with backoff for API errors. This should be simpler than retrying on Start (for example with retry.OnError), but let me know what you think.

In any case a start failure in a sensor component doesn't lead to a crash, just to a log with panic level as seen in func (s *Sensor) Start() at sensor/common/sensor/sensor.go, but I agree that starting this component in a reliable way is critical

for _, component := range s.components {
	if err := component.Start(); err != nil {
		log.Panicf("Sensor component %T failed to start: %v", component, err)
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second though, we still are calling the k8s API on fetchSensorDeploymentOwnerRef to fetch the replica set. This component cannot continue if we are not able to get that replica set, so I'll use retry.OnError for that, as it would be reasonable to expect connectivity to the k8s API on sensor startup.

However, if that is not a reasonable assumption, one option is changing serviceCertificatesRepoSecretsImpl so it's factory takes a func() metav1.OwnerReference instead of a metav1.OwnerReference. The repo would fetch the owner reference until that reference is fetched successfully once, which implies a retry mechanism because the refresher could retry failed attempts.

I'll implement the first alternative, but please let me know if you think the second one if better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanrh note that log.Panicf does cause an actual panic, i.e. a crash with stack trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that, thanks for the clarification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I'm a little lost: can you remind me what will stop i.requester if we return an error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing will stop that, so I'm adding a abortStart method to stop the component if there is an error at startup

@porridge
Copy link
Contributor

Or do you think it's better to just crash the sensor if this component fails to start?

That would be my preference. Otherwise we just shift debugging to later when suddenly scanner has stopped working because its cert expired. With the 1 year validity this might not feel urgent, but I'm a believer in the "fail loud and early" approach.

@juanrh
Copy link
Contributor Author

juanrh commented Feb 23, 2022

Added integration tests.
Still pending helm test and a bit more manual testing, but ready for a new review round

@juanrh juanrh requested a review from SimonBaeumer February 23, 2022 17:29
@juanrh
Copy link
Contributor Author

juanrh commented Feb 23, 2022

Or do you think it's better to just crash the sensor if this component fails to start?

That would be my preference. Otherwise we just shift debugging to later when suddenly scanner has stopped working because its cert expired. With the 1 year validity this might not feel urgent, but I'm a believer in the "fail loud and early" approach.

@SimonBaeumer what are your thoughts on this?

@SimonBaeumer
Copy link
Contributor

SimonBaeumer commented Feb 24, 2022

Or do you think it's better to just crash the sensor if this component fails to start?

That would be my preference. Otherwise we just shift debugging to later when suddenly scanner has stopped working because its cert expired. With the 1 year validity this might not feel urgent, but I'm a believer in the "fail loud and early" approach.

No, it should not fail if the TLSIssuer can't start.
For debugging purposes we can writes a log message in a release build and panic in development with utils.Should.

Sensor can run without the TLSIssuer and is still capable of enforcing the policies even without the LocalScanner, such a restart is not required, like i.e. the connection to Central can't be established or a missing certificate.

Imagine we implemented an error in the TLSIssuer and it could not start on all thousands of Sensor installations, this means the error directly is escalated as a Blocker which affects all customers not being able to secure their clusters - basically the worst case.

The other SensorComponent.Start never return an error. We can refactor the components start loop to not cause Sensor to crash, using utils.Should instead here, but definitely needs more discussion.

btw: it is still an issue for customers that Sensor restarts if it looses the connection to Central (i.e. on every update)

@porridge
Copy link
Contributor

OK, this convinces me @SimonBaeumer
Although in the long term I think we need something better than logging messages that nobody looks as.
I think there is a notion of sensor health level that can be shown in central UI right? Perhaps we can take advantage of that (not in this PR, but I'd say it's something worth creating a ticket).

Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the tests, but looking good so far.
Send the review so you can already start addressing the comment around the mangedBy field.


sensorOwnerReference, fetchSensorDeploymentErr := i.fetchSensorDeploymentOwnerRef(ctx, fetchSensorDeploymentOwnerRefBackoff)
if fetchSensorDeploymentErr != nil {
return i.abortStart(errors.Wrap(fetchSensorDeploymentErr, "fetching sensor deployment"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

ok := concurrency.PollWithTimeout(func() bool {
return tlsIssuer.certRefresher != nil && tlsIssuer.certRefresher.Stopped()
}, 10*time.Millisecond, 100*time.Millisecond)
s.True(ok)
Copy link
Contributor

@SimonBaeumer SimonBaeumer Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a message here what you have asserted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ctx, cancel := context.WithTimeout(context.Background(), startTimeout)
defer cancel()

if i.sensorManagedBy != storage.ManagerType_MANAGER_TYPE_HELM_CHART &&
Copy link
Contributor

@SimonBaeumer SimonBaeumer Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the component not being started in the first place. The information how the Sensor is managed should be accessible from CreateSensor, with that Sensor can be configured with the components necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to CreateSensor

@juanrh juanrh requested a review from SimonBaeumer February 24, 2022 11:29
@juanrh
Copy link
Contributor Author

juanrh commented Feb 24, 2022

I think there is a notion of sensor health level that can be shown in central UI right? Perhaps we can take advantage of that (not in this PR, but I'd say it's something worth creating a ticket).

@porridge @SimonBaeumer I have created https://issues.redhat.com/browse/ROX-9476 for this, and added both ideas (crash on dev builds only using utils.Should, and use the sensor health level feature of the UI to make the problem visible to customers)

@SimonBaeumer
Copy link
Contributor

OK, this convinces me @SimonBaeumer Although in the long term I think we need something better than logging messages that nobody looks as. I think there is a notion of sensor health level that can be shown in central UI right? Perhaps we can take advantage of that (not in this PR, but I'd say it's something worth creating a ticket).

Good point, started a thread on Slack

Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, especially on the testing parts.
Some smaller things and we are there 👊

ca, err := mtls.CAForSigning()
s.Require().NoError(err)
scannerCert := s.getCertificate()
scannerDBCert := s.getCertificate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this always issues SCANNER_SERVICE service types, can you add the SCANNER_DB service type as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +303 to +311
var secrets *v1.SecretList
ok := concurrency.PollWithTimeout(func() bool {
secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{})
s.Require().NoError(err)
return len(secrets.Items) == 2 && len(secrets.Items[0].Data) > 0 && len(secrets.Items[1].Data) > 0
}, 10*time.Millisecond, testTimeout)
s.Require().True(ok)
for _, secret := range secrets.Items {
var expectedCert *mtls.IssuedCert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can restructure this test to make it a bit more easier, i.e. moving the return of the PollWithTimeout into assertions.

Suggested change
var secrets *v1.SecretList
ok := concurrency.PollWithTimeout(func() bool {
secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{})
s.Require().NoError(err)
return len(secrets.Items) == 2 && len(secrets.Items[0].Data) > 0 && len(secrets.Items[1].Data) > 0
}, 10*time.Millisecond, testTimeout)
s.Require().True(ok)
for _, secret := range secrets.Items {
var expectedCert *mtls.IssuedCert
var secrets *v1.SecretList
_ = concurrency.PollWithTimeout(func() bool {
secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{})
s.Require().NoError(err)
return true
}, 10*time.Millisecond, testTimeout)
s.Require().Len(secrets.Items, 2)
for _, secret := range secrets.Items {
s.Require().NotEmpty(secret.Data)
var expectedCert *mtls.IssuedCert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return true in the first argument of PollWithTimeout then polling would complete on the first iteration, before the secrets are written to the k8s API fake. Also, I discovered that the fake has some kind of eventual consistency behaviour, and it takes some time until the data is populated in the secrets, that's why I added 832fa82 to ensure that we not only have 2 secrets but they have non-empty data. You have to run the tests 100+ times to detect that race, with this additional check I'm not seeing that problem anymore.

I've added a message to s.Require().True(ok) to the test is easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, did not know that. 👍

return nil, errors.Wrap(err, "creating central client")
}

if features.LocalImageScanning.Enabled() && (helmManagedConfig.GetManagedBy() == storage.ManagerType_MANAGER_TYPE_HELM_CHART ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assume that future managedBy types allow local image scanning to be allowed.
For this the check should be changed to check not allow managedBy except unknown and manual.

I think it should look like this:

(helmManagedConfig.GetManagedBy() != storage.ManagerType_MANAGER_TYPE_UNKNOWN &&
		helmManagedConfig.GetManagedBy() != storage.ManagerType_MANAGER_TYPE_MANUAL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@juanrh juanrh requested a review from SimonBaeumer February 24, 2022 15:13
Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

@juanrh juanrh merged commit e0ac710 into master Feb 24, 2022
@juanrh juanrh deleted the juanrh/ROX-9015 branch February 24, 2022 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants