ROX-9015: new SensorComponent to keep local scanner secrets up to date by juanrh · Pull Request #565 · stackrox/stackrox

juanrh · 2022-02-04T15:47:52Z

Description

This is not rebasing right on top of #474 for some reason, but it should take it from there starting 32c468b08.

Checklist

Investigated and inspected CI test results
Unit test and regression tests added
Evaluated and added CHANGELOG entry if required
Determined and documented upgrade steps

No upgrade steps required, as seen below this is compatible with older versions of central.
CHANGELOG will be updated as part of parent epic.

Testing Performed

New unit tests

Run new unit test, and unit tests for modified types:

go test -count=100 -race -v -p=1 github.com/stackrox/rox/sensor/kubernetes/localscanner -run TestLocalScannerTLSIssuer && \
go test -count=100 -race -v -p=1 github.com/stackrox/rox/pkg/concurrency -run TestRetryTicker && \
go test -count=100 -race -v -p=1 github.com/stackrox/rox/sensor/kubernetes/localscanner -run TestCertRefresher && \
go test -count=100 -race -v -p=1 github.com/stackrox/rox/sensor/kubernetes/localscanner -run TestGetSecretRenewalTime

Manual end to end testing in local k8s cluster

Manual end to end testing

function rox_get_pod_name_by_component {
  component="${1}"

  POD_NAME=$(kubectl -n stackrox get pods -l app.kubernetes.io/component=${component} -o jsonpath='{.items[0].metadata.name}')
  echo ${POD_NAME}
}

function rox_sensor_logs {
  kubectl logs -n stackrox $(rox_get_pod_name_by_component sensor) -f
}

teardown
make image

########
## Test: TLS issuer does not start for manual deployments PASS

# deploy
export STORAGE=pvc
export SKIP_UI_BUILD=1
export HOTRELOAD=true
./deploy/k8s/deploy-local.sh

# Error from server (NotFound): secrets "scanner-slim-tls" not found
# Error from server (NotFound): secrets "scanner-db-slim-tls" not found
kubectl -n stackrox get secret scanner-slim-tls
kubectl -n stackrox get secret scanner-db-slim-tls

# no mention to tls_issuer
rox_sensor_logs

# enable feature: this already restarts the pod
kubectl -n stackrox set env deployment sensor ROX_LOCAL_IMAGE_SCANNING=true

# no mention to tls_issuer: this is a manual deployment
rox_sensor_logs

########
## Test Create secrets with certificates if certificates are missing: PASS

teardown
./deploy/k8s/central.sh
ADMIN_PWD=$(cat ./deploy/k8s/central-deploy/password)
echo $ADMIN_PWD
kubectl -n stackrox port-forward svc/central 8443:443 &
sleep 1
rm cluster-init-bundle.yaml 
./bin/darwin/roxctl central init-bundles generate init_bundle --output cluster-init-bundle.yaml  --insecure-skip-tls-verify -p ${ADMIN_PWD}
INIT_BUNDLE_PATH=cluster-init-bundle.yaml  
roxctl helm output secured-cluster-services --debug --remove

helm -n stackrox install stackrox-secured-cluster-services ./stackrox-secured-cluster-services-chart \
  --set clusterName=docker-desktop \
  -f ${INIT_BUNDLE_PATH} \
  --set mainImagePullSecrets.allowNone=true --set collectorImagePullSecrets.allowNone=true

# enable feature: this already restarts the pod
kubectl -n stackrox set env deployment sensor ROX_LOCAL_IMAGE_SCANNING=true

# kubernetes/localscanner: 2022/02/24 14:42:25.384914 cert_refresher.go:54: Error: refreshing local scanner credentials: central refused to issue certificates: issuing local Scanner certificates for request ID "2e6290e5-c6ea-419b-86c9-0e7b22e25918", cluster ID "d72ed4f9-acb4-4d31-8e34-f2c8088432ff" and namespace "stackrox": feature 'Enable OpenShift local-image scanning' is disabled
rox_sensor_logs

# enable feature in central
kubectl -n stackrox set env deployment central ROX_LOCAL_IMAGE_SCANNING=true

# kubernetes/localscanner: 2022/02/24 14:44:12.787698 cert_refresher.go:92: Info: successfully refreshed local scanner credentials
# kubernetes/localscanner: 2022/02/24 14:44:12.787812 cert_refresher.go:58: Info: local scanner credentials scheduled to be refreshed in 22h44m32.2122016s
rox_sensor_logs


# scanner-slim-tls:
# 	ca.pem:
# -----BEGIN CERTIFICATE---
# 	cert.pem: -----BEGIN CERTIFICATE---
# 	key.pem: -----BEGIN EC PRIVATE KEY
# 
# scanner-db-slim-tls:
# 	ca.pem: -----BEGIN CERTIFICATE---
# 	cert.pem: -----BEGIN CERTIFICATE---
# 	key.pem: -----BEGIN EC PRIVATE KEY
for secret in scanner-slim-tls scanner-db-slim-tls
do
	echo "${secret}:"
	for filename in ca.pem cert.pem key.pem
	do
		printf "\t${filename}: "
		kubectl -n stackrox get secret $secret -o json | jq -r ".data[\"${filename}\"]" | base64 -d | head -c 25
		echo
	done
	echo
done 


########
## Test TLS issuer start crash does not crash sensor startup PASS
kubectl -n stackrox set env deployment sensor POD_NAME=wrong_name

# kubernetes/localscanner: 2022/02/24 14:49:02.997008 tls_issuer.go:125: Error: local scanner TLS issuer start aborted due to error: fetching sensor deployment: fetching sensor pod with name "wrong_name": pods "wrong_name" not found
rox_sensor_logs

$ kubectl -n stackrox get pods -l app=sensor
NAME                      READY   STATUS    RESTARTS   AGE
sensor-675d59cd96-kw4md   1/1     Running   0          111s

########
## Test TLS issuer stops on unexpected owner PASS

# recreate secret with wrong owner
kubectl -n stackrox delete secret scanner-slim-tls
kubectl -n stackrox  create secret generic  scanner-slim-tls

# restart sensor with correct env var
kubectl -n stackrox delete deployment sensor
helm -n stackrox upgrade stackrox-secured-cluster-services ./stackrox-secured-cluster-services-chart \
  --set clusterName=docker-desktop \
  -f ${INIT_BUNDLE_PATH} \
  --set mainImagePullSecrets.allowNone=true --set collectorImagePullSecrets.allowNone=true

kubectl -n stackrox set env deployment sensor LOGLEVEL=debug
kubectl -n stackrox set env deployment sensor ROX_LOCAL_IMAGE_SCANNING=true

# kubernetes/localscanner: 2022/02/24 15:06:44.295040 cert_refresher.go:50: Error: non-recoverable error refreshing local scanner credentials, automatic refresh will be stopped: 2 errors occurred:
# 	* secrets "scanner-db-slim-tls" not found
# 	* unexpected owner for certificate secrets
rox_sensor_logs

ghost · 2022-02-04T16:09:08Z

Tag for build #244115 is 3.68.x-268-g7de21b009e.

💻 For deploying this image using the dev scripts, run the following first:

export MAIN_IMAGE_TAG='3.68.x-268-g7de21b009e'

📦 You can also generate an installation bundle with:

docker run -i --rm stackrox/main:3.68.x-268-g7de21b009e central generate interactive > bundle.zip

🕹️ A roxctl binary artifact can be downloaded from CircleCI.

sensor/kubernetes/localscanner/tls_issuer.go

porridge · 2022-02-17T08:33:12Z

sensor/kubernetes/localscanner/tls_issuer.go

Our notion of recoverable error is very narrow and confined to specific application-level situations.
In particular a transient API availability error would not count as recoverable here.
So I'm not convinced it is a good idea to refuse to Start() in this case (which presumably causes a sensor crash).

I agree this should be much more resilient. I'll remove the call to certsRepo.getServiceCertificates(ctx) altogether, as the refresher already will to that in refreshCertificates, and it'll stop itself in case the secrets have the wrong owner. refreshCertificates is called right after starting the refresher, and it will retry with backoff for API errors. This should be simpler than retrying on Start (for example with retry.OnError), but let me know what you think.

In any case a start failure in a sensor component doesn't lead to a crash, just to a log with panic level as seen in func (s *Sensor) Start() at sensor/common/sensor/sensor.go, but I agree that starting this component in a reliable way is critical

for _, component := range s.components { if err := component.Start(); err != nil { log.Panicf("Sensor component %T failed to start: %v", component, err) } }

On second though, we still are calling the k8s API on fetchSensorDeploymentOwnerRef to fetch the replica set. This component cannot continue if we are not able to get that replica set, so I'll use retry.OnError for that, as it would be reasonable to expect connectivity to the k8s API on sensor startup.

However, if that is not a reasonable assumption, one option is changing serviceCertificatesRepoSecretsImpl so it's factory takes a func() metav1.OwnerReference instead of a metav1.OwnerReference. The repo would fetch the owner reference until that reference is fetched successfully once, which implies a retry mechanism because the refresher could retry failed attempts.

I'll implement the first alternative, but please let me know if you think the second one if better

@juanrh note that log.Panicf does cause an actual panic, i.e. a crash with stack trace.

I didn't know that, thanks for the clarification

porridge · 2022-02-17T08:40:03Z

sensor/kubernetes/localscanner/tls_issuer.go

Sorry, but I'm a little lost: can you remind me what will stop i.requester if we return an error here?

nothing will stop that, so I'm adding a abortStart method to stop the component if there is an error at startup

porridge · 2022-02-23T11:03:30Z

Or do you think it's better to just crash the sensor if this component fails to start?

That would be my preference. Otherwise we just shift debugging to later when suddenly scanner has stopped working because its cert expired. With the 1 year validity this might not feel urgent, but I'm a believer in the "fail loud and early" approach.

to make test more stable

juanrh · 2022-02-23T17:29:32Z

Added integration tests.
Still pending helm test and a bit more manual testing, but ready for a new review round

juanrh · 2022-02-23T17:30:51Z

Or do you think it's better to just crash the sensor if this component fails to start?

That would be my preference. Otherwise we just shift debugging to later when suddenly scanner has stopped working because its cert expired. With the 1 year validity this might not feel urgent, but I'm a believer in the "fail loud and early" approach.

@SimonBaeumer what are your thoughts on this?

SimonBaeumer · 2022-02-24T10:40:57Z

Or do you think it's better to just crash the sensor if this component fails to start?

That would be my preference. Otherwise we just shift debugging to later when suddenly scanner has stopped working because its cert expired. With the 1 year validity this might not feel urgent, but I'm a believer in the "fail loud and early" approach.

No, it should not fail if the TLSIssuer can't start.
For debugging purposes we can writes a log message in a release build and panic in development with utils.Should.

Sensor can run without the TLSIssuer and is still capable of enforcing the policies even without the LocalScanner, such a restart is not required, like i.e. the connection to Central can't be established or a missing certificate.

Imagine we implemented an error in the TLSIssuer and it could not start on all thousands of Sensor installations, this means the error directly is escalated as a Blocker which affects all customers not being able to secure their clusters - basically the worst case.

The other SensorComponent.Start never return an error. We can refactor the components start loop to not cause Sensor to crash, using utils.Should instead here, but definitely needs more discussion.

btw: it is still an issue for customers that Sensor restarts if it looses the connection to Central (i.e. on every update)

porridge · 2022-02-24T10:55:46Z

OK, this convinces me @SimonBaeumer
Although in the long term I think we need something better than logging messages that nobody looks as.
I think there is a notion of sensor health level that can be shown in central UI right? Perhaps we can take advantage of that (not in this PR, but I'd say it's something worth creating a ticket).

SimonBaeumer

Still reviewing the tests, but looking good so far.
Send the review so you can already start addressing the comment around the mangedBy field.

SimonBaeumer · 2022-02-24T10:49:55Z

sensor/kubernetes/localscanner/tls_issuer.go

+
+	sensorOwnerReference, fetchSensorDeploymentErr := i.fetchSensorDeploymentOwnerRef(ctx, fetchSensorDeploymentOwnerRefBackoff)
+	if fetchSensorDeploymentErr != nil {
+		return i.abortStart(errors.Wrap(fetchSensorDeploymentErr, "fetching sensor deployment"))


Looks good!

SimonBaeumer · 2022-02-24T10:52:34Z

sensor/kubernetes/localscanner/tls_issuer_test.go

+			ok := concurrency.PollWithTimeout(func() bool {
+				return tlsIssuer.certRefresher != nil && tlsIssuer.certRefresher.Stopped()
+			}, 10*time.Millisecond, 100*time.Millisecond)
+			s.True(ok)


Can you add a message here what you have asserted?

SimonBaeumer · 2022-02-24T11:00:52Z

sensor/kubernetes/localscanner/tls_issuer.go

+	ctx, cancel := context.WithTimeout(context.Background(), startTimeout)
+	defer cancel()
+
+	if i.sensorManagedBy != storage.ManagerType_MANAGER_TYPE_HELM_CHART &&


I would expect the component not being started in the first place. The information how the Sensor is managed should be accessible from CreateSensor, with that Sensor can be configured with the components necessary.

moved to CreateSensor

juanrh · 2022-02-24T11:44:24Z

I think there is a notion of sensor health level that can be shown in central UI right? Perhaps we can take advantage of that (not in this PR, but I'd say it's something worth creating a ticket).

@porridge @SimonBaeumer I have created https://issues.redhat.com/browse/ROX-9476 for this, and added both ideas (crash on dev builds only using utils.Should, and use the sensor health level feature of the UI to make the problem visible to customers)

SimonBaeumer · 2022-02-24T13:10:01Z

OK, this convinces me @SimonBaeumer Although in the long term I think we need something better than logging messages that nobody looks as. I think there is a notion of sensor health level that can be shown in central UI right? Perhaps we can take advantage of that (not in this PR, but I'd say it's something worth creating a ticket).

Good point, started a thread on Slack

SimonBaeumer

Good job, especially on the testing parts.
Some smaller things and we are there 👊

SimonBaeumer · 2022-02-24T13:44:35Z

sensor/kubernetes/localscanner/tls_issuer_test.go

+			ca, err := mtls.CAForSigning()
+			s.Require().NoError(err)
+			scannerCert := s.getCertificate()
+			scannerDBCert := s.getCertificate()


I think this always issues SCANNER_SERVICE service types, can you add the SCANNER_DB service type as well?

SimonBaeumer · 2022-02-24T13:53:09Z

sensor/kubernetes/localscanner/tls_issuer_test.go

+			var secrets *v1.SecretList
+			ok := concurrency.PollWithTimeout(func() bool {
+				secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{})
+				s.Require().NoError(err)
+				return len(secrets.Items) == 2 && len(secrets.Items[0].Data) > 0 && len(secrets.Items[1].Data) > 0
+			}, 10*time.Millisecond, testTimeout)
+			s.Require().True(ok)
+			for _, secret := range secrets.Items {
+				var expectedCert *mtls.IssuedCert


You can restructure this test to make it a bit more easier, i.e. moving the return of the PollWithTimeout into assertions.

Suggested change

var secrets *v1.SecretList

ok := concurrency.PollWithTimeout(func() bool {

secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{})

s.Require().NoError(err)

return len(secrets.Items) == 2 && len(secrets.Items[0].Data) > 0 && len(secrets.Items[1].Data) > 0

}, 10*time.Millisecond, testTimeout)

s.Require().True(ok)

for _, secret := range secrets.Items {

var expectedCert *mtls.IssuedCert

var secrets *v1.SecretList

_ = concurrency.PollWithTimeout(func() bool {

secrets, err = k8sClient.CoreV1().Secrets(sensorNamespace).List(context.Background(), metav1.ListOptions{})

s.Require().NoError(err)

return true

}, 10*time.Millisecond, testTimeout)

s.Require().Len(secrets.Items, 2)

for _, secret := range secrets.Items {

s.Require().NotEmpty(secret.Data)

var expectedCert *mtls.IssuedCert

If we return true in the first argument of PollWithTimeout then polling would complete on the first iteration, before the secrets are written to the k8s API fake. Also, I discovered that the fake has some kind of eventual consistency behaviour, and it takes some time until the data is populated in the secrets, that's why I added 832fa82 to ensure that we not only have 2 secrets but they have non-empty data. You have to run the tests 100+ times to detect that race, with this additional check I'm not seeing that problem anymore.

I've added a message to s.Require().True(ok) to the test is easier to understand.

Interesting, did not know that. 👍

SimonBaeumer · 2022-02-24T13:56:34Z

sensor/kubernetes/sensor/sensor.go

 		return nil, errors.Wrap(err, "creating central client")
 	}

+	if features.LocalImageScanning.Enabled() && (helmManagedConfig.GetManagedBy() == storage.ManagerType_MANAGER_TYPE_HELM_CHART ||


We can assume that future managedBy types allow local image scanning to be allowed.
For this the check should be changed to check not allow managedBy except unknown and manual.

I think it should look like this:

(helmManagedConfig.GetManagedBy() != storage.ManagerType_MANAGER_TYPE_UNKNOWN && helmManagedConfig.GetManagedBy() != storage.ManagerType_MANAGER_TYPE_MANUAL)

SimonBaeumer

LGTM 🚢

#565)

github-actions bot added the area/sensor label Feb 4, 2022

juanrh changed the title ~~Juanrh/rox 9015~~ ROX-9015: new SensorComponent to keep local scanner secrets up to date Feb 4, 2022

juanrh mentioned this pull request Feb 8, 2022

ROX-8758: Parameterize ALL scanner-related resource names. #558

Merged

5 tasks

juanrh force-pushed the juanrh/ROX-9015 branch from a90d4eb to c39b882 Compare February 8, 2022 14:30

github-actions bot added the area/helm label Feb 8, 2022

juanrh marked this pull request as ready for review February 8, 2022 18:36

juanrh requested a review from SimonBaeumer February 8, 2022 18:36

juanrh marked this pull request as draft February 10, 2022 18:39

juanrh force-pushed the juanrh/ROX-9015 branch from 2f46f67 to b4c91f2 Compare February 15, 2022 14:44

juanrh changed the base branch from master to juanrh/ROX-9148 February 15, 2022 16:23

juanrh marked this pull request as ready for review February 15, 2022 17:05

juanrh requested a review from porridge February 15, 2022 17:09

Base automatically changed from juanrh/ROX-9148 to master February 17, 2022 08:34

porridge reviewed Feb 17, 2022

View reviewed changes

Juan Rodriguez Hortala added 15 commits February 17, 2022 17:11

new SensorComponent to keep local scanner secrets up to date

6e6dd26

fix style

84816d9

fetch sensor deployment starting with sensor pod owner name

16b5d08

add some tests

dfe2121

fix style

d583faa

wip adapting to changes in repo and refresher

d3f9d15

rebase

b52c3f5

rebase tests

40da452

add test for fetch sensor deployment

60844b5

complete tests for start

43be8fe

complete tests

d6e956d

improve comment

d4f7ced

launch cert refresher if feature flag enabled

39c6b83

minor refactor

efc86b6

build sensor deployment owner reference from replica set owner

d13bf17

Juan Rodriguez Hortala added 8 commits February 23, 2022 13:12

add integration test for happy path

9d2db92

check wrapped errors not only outer error

41e94fc

extend happy path integ test to corrupted secrets

0578fbd

test cert issue retries on integration test

707bb64

add public method to check if ticker was stopped

0565a66

give priority to ErrUnexpectedSecretsOwner on get secrets call

fb4af92

add integ test for tls issuer stop on wrong secrets owner

aef8dfa

account for eventual consistency of mock k8s client

832fa82

to make test more stable

juanrh requested a review from SimonBaeumer February 23, 2022 17:29

SimonBaeumer suggested changes Feb 24, 2022

View reviewed changes

Juan Rodriguez Hortala added 2 commits February 24, 2022 12:18

add message to assertion

8a3173c

decide component launch based on managedBy on CreateSensor

41218cf

juanrh requested a review from SimonBaeumer February 24, 2022 11:29

SimonBaeumer suggested changes Feb 24, 2022

View reviewed changes

Juan Rodriguez Hortala added 3 commits February 24, 2022 15:05

use deny list instead of allow list for tls issuer component enablement

bb322d7

use certificates for both scanner and scanner db in integ test

c4c2b07

add message to assertion derived from a poll, ot make it more clear

7de21b0

juanrh requested a review from SimonBaeumer February 24, 2022 15:13

SimonBaeumer approved these changes Feb 24, 2022

View reviewed changes

juanrh merged commit e0ac710 into master Feb 24, 2022

juanrh deleted the juanrh/ROX-9015 branch February 24, 2022 15:51

RTann pushed a commit that referenced this pull request Apr 6, 2022

ROX-9015: new SensorComponent to keep local scanner secrets up to date (

4150096

#565)

porridge mentioned this pull request May 5, 2022

ROX-9476 ROX-8533: Show health info of a local scanner #1445

Merged

5 tasks

Conversation

juanrh commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Testing Performed

Manual end to end testing

Uh oh!

ghost commented Feb 4, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

porridge commented Feb 23, 2022

Uh oh!

juanrh commented Feb 23, 2022

Uh oh!

juanrh commented Feb 23, 2022

Uh oh!

SimonBaeumer commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

porridge commented Feb 24, 2022

Uh oh!

SimonBaeumer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juanrh commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimonBaeumer commented Feb 24, 2022

Uh oh!

SimonBaeumer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer left a comment

Choose a reason for hiding this comment

Uh oh!

juanrh commented Feb 4, 2022 •

edited

Loading

ghost commented Feb 4, 2022 •

edited by ghost

Loading

SimonBaeumer commented Feb 24, 2022 •

edited

Loading

SimonBaeumer Feb 24, 2022 •

edited

Loading

SimonBaeumer Feb 24, 2022 •

edited

Loading

juanrh commented Feb 24, 2022 •

edited

Loading