Skip to content

ROX-13943: Hide "sourced" autogenerated image integrations behind a feature flag#4160

Merged
RTann merged 9 commits intomasterfrom
ROX-13943
Dec 15, 2022
Merged

ROX-13943: Hide "sourced" autogenerated image integrations behind a feature flag#4160
RTann merged 9 commits intomasterfrom
ROX-13943

Conversation

@RTann
Copy link
Contributor

@RTann RTann commented Dec 15, 2022

Description

This change adds a feature flag around the changes introduced in #2572.

Checklist

  • Investigated and inspected CI test results
  • [ ] Unit test and regression tests added
  • Evaluated and added CHANGELOG entry if required
  • [ ] Determined and documented upgrade steps
  • [ ] Documented user facing changes (create PR based on openshift/openshift-docs and merge into rhacs-docs)

Testing Performed

Given:

config.json:

{"10.2.3.9:5000":{"username":"serviceaccount","password":"<some-password>"}}

create-kubernetes-secrets.sh:

#!/bin/bash
for i in {1..5000}
do
    kubectl create secret docker-registry some-secret-$i --from-file=.dockerconfigjson=config.json
done
  1. Create a Kubernetes/OpenShift cluster
  2. Create a ton of dockerconfigjson secrets via create-kubernetes-secrets.sh
  3. Deploy ACS/StackRox 3.73.0
  4. Observe within the UI that a lot of image integration are being added and reconciled
    Screenshot 2022-12-15 at 02 01 32
  5. After a while, restart central by killing the pod
kubectl delete pod -l app=central -n stackrox
  1. Observe within the logs that central's log output is stuck
kubectl logs -l app=central -n stackrox
...
imageintegration/datastore: 2022/12/15 01:02:08.913010 datastore_impl.go:124: Info: [STARTUP] Found **1764** Image Integrations to be indexed
imageintegration/datastore: 2022/12/15 01:02:08.976003 datastore_impl.go:132: Info: [STARTUP] Successfully indexed 1764 Image Integrations
rbac/k8srole/datastore: 2022/12/15 01:02:09.789312 datastore_impl.go:63: Info: [STARTUP] Successfully indexed 753 roles
rbac/k8srolebinding/datastore: 2022/12/15 01:02:09.845649 datastore_impl.go:38: Info: [STARTUP] Indexing rolebindings
rbac/k8srolebinding/datastore: 2022/12/15 01:02:10.046431 datastore_impl.go:62: Info: [STARTUP] Successfully indexed 1135 rolebindings
policy/store/boltdb: 2022/12/15 01:02:10.190252 store.go:139: Info: Loaded 0 new default Policies
cve/fetcher: 2022/12/15 01:02:10.255639 manager_impl.go:55: Info: successfully copied preloaded CVE Istio files to persistent volume: "/var/lib/stackrox/cve/istio"
cve/fetcher: 2022/12/15 01:02:10.255774 orchestrator.go:62: Info: Found 2 clusters to scan for orchestrator vulnerabilities.
cve/fetcher: 2022/12/15 01:02:10.257331 orchestrator.go:237: Info: Successfully fetched 0 Kubernetes CVEs
cve/fetcher: 2022/12/15 01:02:10.257424 orchestrator.go:70: Error: failed to reconcile orchestrator OpenShift CVEs: no orchestrator scanners are integrated
cve/fetcher: 2022/12/15 01:02:10.272806 istio.go:51: Info: successfully fetched 23 istio CVEs
vulnerabilityrequest/datastore: 2022/12/15 01:02:10.299980 datastore_impl.go:52: Info: [STARTUP] Found 0 vulnerability requests to index
vulnerabilityrequest/datastore: 2022/12/15 01:02:10.300061 datastore_impl.go:66: Info: [STARTUP] Successfully indexed all vulnerability requests
  1. (optional): Set the log level to debug and observe a continuous stream of logs from pkg/registries/docker
kubectl set env -n stackrox deploy/central LOGLEVEL=Debug

kubectl logs -l app=central -n stackrox
...
cve/fetcher: 2022/12/15 01:03:32.219524 orchestrator.go:62: Info: Found 2 clusters to scan for orchestrator vulnerabilities.
cve/fetcher: 2022/12/15 01:03:32.221184 orchestrator.go:237: Info: Successfully fetched 0 Kubernetes CVEs
cve/fetcher: 2022/12/15 01:03:32.221281 orchestrator.go:70: Error: failed to reconcile orchestrator OpenShift CVEs: no orchestrator scanners are integrated
cve/fetcher: 2022/12/15 01:03:32.237856 istio.go:51: Info: successfully fetched 23 istio CVEs
vulnerabilityrequest/datastore: 2022/12/15 01:03:32.263665 datastore_impl.go:52: Info: [STARTUP] Found 0 vulnerability requests to index
vulnerabilityrequest/datastore: 2022/12/15 01:03:32.263751 datastore_impl.go:66: Info: [STARTUP] Successfully indexed all vulnerability requests
pkg/registries/docker: 2022/12/15 01:03:32.317643 docker.go:107: Debug: could not update repo list for integration Autogenerated https://172.30.181.135:5000 for cluster production from openshift-kube-controller-manager/builder-dockercfg-nrwvn: Get "https://172.30.181.135:5000/v2/_catalog": http: non-successful response (status=401 body="{\"errors\":[{\"code\":\"UNAUTHORIZED\",\"message\":\"authentication required\",\"detail\":[{\"Type\":\"registry\",\"Class\":\"\",\"Name\":\"catalog\",\"Action\":\"*\"}]}]}\n")
pkg/registries/docker: 2022/12/15 01:03:32.381961 docker.go:107: Debug: could not update repo list for integration Autogenerated https://image-registry.openshift-image-registry.svc:5000 for cluster nakul from openshift-kube-scheduler-operator/builder-dockercfg-xshn9: Get "https://image-registry.openshift-image-registry.svc:5000/v2/_catalog": http: non-successful response (status=401 body="{\"errors\":[{\"code\":\"UNAUTHORIZED\",\"message\":\"authentication required\",\"detail\":[{\"Type\":\"registry\",\"Class\":\"\",\"Name\":\"catalog\",\"Action\":\"*\"}]}]}\n")
pkg/registries/docker: 2022/12/15 01:03:32.418928 docker.go:107: Debug: could not update repo list for integration Autogenerated https://image-registry.openshift-image-registry.svc.cluster.local:5000 for cluster nakul from openshift-infra/image-trigger-controller-dockercfg-746lr: Get "https://image-registry.openshift-image-registry.svc.cluster.local:5000/v2/_catalog": http: non-successful response (status=401 body="{\"errors\":[{\"code\":\"UNAUTHORIZED\",\"message\":\"authentication required\",\"detail\":[{\"Type\":\"registry\",\"Class\":\"\",\"Name\":\"catalog\",\"Action\":\"*\"}]}]}\n")
pkg/registries/docker: 2022/12/15 01:03:37.420360 docker.go:107: Debug: could not update repo list for integration Autogenerated https://172.30.211.149:5000 for cluster nakul from openshift-infra/node-bootstrapper-dockercfg-hzlp8: Get "https://172.30.211.149:5000/v2/_catalog": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
pkg/registries/docker: 2022/12/15 01:03:37.475001 docker.go:107: Debug: could not update repo list for integration Autogenerated https://image-registry.openshift-image-registry.svc:5000 for cluster nakul from openshift-kube-scheduler/builder-dockercfg-l6g7x: Get "https://image-registry.openshift-image-registry.svc:5000/v2/_catalog": http: non-successful response (status=401 body="{\"errors\":[{\"code\":\"UNAUTHORIZED\",\"message\":\"authentication required\",\"detail\":[{\"Type\":\"registry\",\"Class\":\"\",\"Name\":\"catalog\",\"Action\":\"*\"}]}]}\n")
pkg/registries/docker: 2022/12/15 01:03:37.497478 docker.go:107: Debug: could not update repo list for integration Autogenerated https://image-registry.openshift-image-registry.svc:5000 for cluster nakul from openshift-network-diagnostics/deployer-dockercfg-87g6x: Get "https://image-registry.openshift-image-registry.svc:5000/v2/_catalog": http: non-successful response (status=401 body="{\"errors\":[{\"code\":\"UNAUTHORIZED\",\"message\":\"authentication required\",\"detail\":[{\"Type\":\"registry\",\"Class\":\"\",\"Name\":\"catalog\",\"Action\":\"*\"}]}]}\n")
  1. Bump the version of sensor and central by changing the image name to the tag of this PR
kubectl set image -n stackrox deploy/central central=quay.io/stackrox-io/main:.73.x-235-g53652ca213

kubectl set image -n stackrox deploy/sensor sensor=quay.io/stackrox-io/main:.73.x-235-g53652ca213
  1. Observe within the logs that central log isn't stuck and the number of indexed image integration has lowered drastically (compared to the 1764 from 3.73.0)
kubectl logs -l app=central -n stackrox
...
alert/datastore: 2022/12/15 01:55:14.990503 datastore_impl.go:429: Info: [STARTUP] Successfully indexed all out of sync alerts
imageintegration/datastore: 2022/12/15 01:55:15.018256 datastore_impl.go:125: Info: [STARTUP] Found **18** Image Integrations to be indexed
imageintegration/datastore: 2022/12/15 01:55:15.020582 datastore_impl.go:133: Info: [STARTUP] Successfully indexed 18 Image Integrations
...
secret/datastore: 2022/12/15 01:55:15.489349 datastore_impl.go:34: Info: [STARTUP] Indexing secrets
secret/datastore: 2022/12/15 01:55:15.761372 datastore_impl.go:50: Info: [STARTUP] Successfully indexed secrets
serviceaccount/datastore: 2022/12/15 01:55:15.898567 datastore_impl.go:35: Info: [STARTUP] Indexing service accounts
serviceaccount/datastore: 2022/12/15 01:55:15.982337 datastore_impl.go:57: Info: [STARTUP] Successfully indexed 717 service accounts
rbac/k8srole/datastore: 2022/12/15 01:55:15.982444 datastore_impl.go:39: Info: [STARTUP] Indexing roles
rbac/k8srole/datastore: 2022/12/15 01:55:16.150993 datastore_impl.go:63: Info: [STARTUP] Successfully indexed 753 roles
rbac/k8srolebinding/datastore: 2022/12/15 01:55:16.227863 datastore_impl.go:38: Info: [STARTUP] Indexing rolebindings
rbac/k8srolebinding/datastore: 2022/12/15 01:55:16.475709 datastore_impl.go:62: Info: [STARTUP] Successfully indexed 1135 rolebindings
policy/store/boltdb: 2022/12/15 01:55:16.696793 store.go:139: Info: Loaded 0 new default Policies
cve/fetcher: 2022/12/15 01:55:16.782266 manager_impl.go:55: Info: successfully copied preloaded CVE Istio files to persistent volume: "/var/lib/stackrox/cve/istio"
cve/fetcher: 2022/12/15 01:55:16.782399 orchestrator.go:62: Info: Found 2 clusters to scan for orchestrator vulnerabilities.
cve/fetcher: 2022/12/15 01:55:16.785945 orchestrator.go:250: Info: Successfully fetched 0 Kubernetes CVEs
cve/fetcher: 2022/12/15 01:55:16.786064 orchestrator.go:70: Error: failed to reconcile orchestrator OpenShift CVEs: no orchestrator scanners are integrated
cve/fetcher: 2022/12/15 01:55:16.787697 orchestrator.go:250: Info: Successfully fetched 0 Istio CVEs
vulnerabilityrequest/datastore: 2022/12/15 01:55:16.816730 datastore_impl.go:52: Info: [STARTUP] Found 0 vulnerability requests to index
vulnerabilityrequest/datastore: 2022/12/15 01:55:16.816815 datastore_impl.go:66: Info: [STARTUP] Successfully indexed all vulnerability requests
cve/fetcher: 2022/12/15 01:55:17.233920 orchestrator.go:62: Info: Found 2 clusters to scan for orchestrator vulnerabilities.
cve/fetcher: 2022/12/15 01:55:17.234629 orchestrator.go:250: Info: Successfully fetched 0 Kubernetes CVEs
cve/fetcher: 2022/12/15 01:55:17.253700 orchestrator.go:250: Info: Successfully fetched 0 OpenShift CVEs
cve/fetcher: 2022/12/15 01:55:17.254408 orchestrator.go:250: Info: Successfully fetched 0 Istio CVEs
detection/lifecycle: 2022/12/15 01:55:24.911478 singleton.go:41: Info: Injecting 84 policies into detectors.
detection/lifecycle: 2022/12/15 01:55:24.947495 singleton.go:46: Info: Done injecting policies.
reportconfigurations/datastore: 2022/12/15 01:55:24.947729 datastore_impl.go:40: Info: [STARTUP] Indexing report configurations
reportconfigurations/datastore: 2022/12/15 01:55:24.947794 datastore_impl.go:56: Info: [STARTUP] Successfully indexed report configurations
detection/lifecycle: 2022/12/15 01:55:25.005708 manager_impl.go:123: Info: Cleaning up 36 processes as a part of building process filter
detection/lifecycle: 2022/12/15 01:55:25.012346 manager_impl.go:127: Info: Successfully cleaned up those 36 processes
pruning: 2022/12/15 01:55:25.129198 pruning.go:141: Info: [Pruning] Starting a garbage collection cycle
cve/suppress: 2022/12/15 01:55:25.130005 reprocessor.go:120: Info: Successfully unsuppressed 0 CVEs
reprocessor: 2022/12/15 01:55:25.130068 reprocessor.go:272: Info: Found 12 nodes to scan
vulnerabilityrequest/manager/requestmgr: 2022/12/15 01:55:25.130438 manager_impl.go:506: Info: [STARTUP] Successfully cached all vulnerability requests
pruning: 2022/12/15 01:55:25.131074 pruning.go:501: Info: [Image pruning] Found 0 image search results
pkg/grpc: 2022/12/15 01:55:25.131063 server.go:216: Info: Launching backend gRPC listener
pkg/grpc: 2022/12/15 01:55:25.138359 server.go:371: Info: TLS-enabled multiplexed HTTP/gRPC server listening on [::]:8443
pruning: 2022/12/15 01:55:25.150599 pruning.go:357: Info: [Process pruning] Found 0 orphaned processes
pruning: 2022/12/15 01:55:25.151322 pruning.go:418: Info: [Process baseline pruning] Removed 0 process baselines
pruning: 2022/12/15 01:55:25.179487 pruning.go:449: Info: [Alert pruning] Found 0 orphaned alerts
pruning: 2022/12/15 01:55:25.184207 pruning.go:221: Info: [Pruning] Found no orphaned pods...
pruning: 2022/12/15 01:55:25.185965 pruning.go:241: Info: [Pruning] Found no orphaned service accounts...
pruning: 2022/12/15 01:55:25.187909 pruning.go:241: Info: [Pruning] Found no orphaned K8S roles...
pruning: 2022/12/15 01:55:25.191358 pruning.go:241: Info: [Pruning] Found no orphaned K8S role bindings...
pruning: 2022/12/15 01:55:25.207575 pruning.go:774: Info: [Risk pruning] Removing 0 deployment risks
pruning: 2022/12/15 01:55:25.208638 pruning.go:787: Info: [Risk pruning] Removing 0 image risks
pruning: 2022/12/15 01:55:25.223243 pruning.go:800: Info: [Risk pruning] Removing 0 image component risks
pruning: 2022/12/15 01:55:25.237008 pruning.go:813: Info: [Risk pruning] Removing 0 node risks
pruning: 2022/12/15 01:55:25.249212 pruning.go:537: Info: [Cluster Pruning] pruning is disabled.
pruning: 2022/12/15 01:55:25.249316 pruning.go:158: Info: [Pruning] Finished garbage collection cycle
networkgraph/entity/gatherer: 2022/12/15 01:55:25.322093 load_bundled.go:52: Info: Successfully loaded 28389 external networks
networkgraph/entity/gatherer: 2022/12/15 01:55:25.370469 load_bundled.go:64: Info: Found 28389 external networks in DB. Successfully stored 0/28389 new external networks
reprocessor: 2022/12/15 01:55:26.450142 reprocessor.go:301: Info: Successfully reprocessed 12/12 nodes
reprocessor: 2022/12/15 01:55:26.450307 reprocessor.go:272: Info: Found 0 watched images to scan
reprocessor: 2022/12/15 01:55:26.450334 reprocessor.go:301: Info: Successfully reprocessed 0/0 w
  1. Observe also within the UI that the number of integrations has lowered and stays stable
    Screenshot 2022-12-15 at 03 00 09

@RTann RTann requested a review from dhaus67 December 15, 2022 00:17
@dhaus67 dhaus67 added this to the 3.73.1-rc.3 milestone Dec 15, 2022
@dhaus67 dhaus67 added the turbo-build Uses a faster path to images label Dec 15, 2022
@ghost
Copy link

ghost commented Dec 15, 2022

Images are ready for the commit at 30c8e64.

To use with deploy scripts, first export MAIN_IMAGE_TAG=3.73.x-242-g30c8e64cde.

Copy link
Contributor

@md2119 md2119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable approach.

I want to point out that we are releasing 3.73.1 patch which include fixes for other issues as well.

What does it mean?
Some customers (with the small environment) will most likely have Central up and running. However, we are recommending patch for other issues as well. Therefore, after applying 3.73.1, the behaviour introduced in 3.73 will change.

What can we do?
I think we should document the change in changelog. That is, basically capturing the rationale in ROX-11657, something like:

## [3.73.1]

### Added Features

### Removed Features

### Deprecated Features

### Technical Changes
3.73 introduced a change to ACS autogenerated image integration workflows. 
However, this change in workflow caused issue...(details [here](https://access.redhat.com/node/6990153)). 
To fix the issue introduced in 3.73, 3.73.1 will reinstate the old workflow....
Therefore, autogenerated integrations may not work successfully in environments where many different credentials
 are used for multiple repos within the global repository. 

@RTann RTann requested review from dhaus67 and md2119 December 15, 2022 03:34
Copy link
Contributor

@janisz janisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to add regression test for this?

Comment on lines +74 to +77
// Only upsert autogenerated integrations with a source if the feature is enabled.
if !features.SourcedAutogeneratedIntegrations.Enabled() && ii.GetAutogenerated() && ii.GetSource() != nil {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mental boolean compiler claims the code and the comment do not quite match. Applying De Morgan's rule:

		if features.SourcedAutogeneratedIntegrations.Enabled() || !ii.GetAutogenerated() || ii.GetSource() == nil {
			//... upsert
		}

This reads to me as "Upsert all if feature is enabled, otherwise skip autogenerated and those with non-nil source". The autogenerated with non-nil source indeed will be inserted only when feature is enabled but what about non-autogenerated with non-nil source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Source field was added into 73 and its addition has caused some issues. The goal is to ensure when the feature flag is turned off, we revert back to the old implementation, which inserted all autogenerated registries. The Source field should not be populated when the flag is off, so this check is meant to ensure that, if it did somehow pass through, only registries without Source are upserted

}
source := imageIntegration.GetSource()
if source == nil {
if !features.SourcedAutogeneratedIntegrations.Enabled() || source == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we don't speak about autogenerated when the flag is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is specifically for the new version of autogenerated registries introduced in 3.73.0: ones with this Source field. We have a separate environment setting (not feature flag) to ignore all autogenerated registry integrations: https://github.com/stackrox/stackrox/blob/master/pkg/env/autogenerated_registries.go

Copy link
Contributor

@dhaus67 dhaus67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One clarifying comment on how autogenerated registries are re-synced after deleting them on startup, otherwise LGTM.


var attempted, deleted int
for _, ii := range iis {
if ii.GetAutogenerated() && ii.GetSource() != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there's customer who migrate from 3.72.0 -> 3.73.0, they will have the mix of autogenerated integrations with Source != nil and Source == nil.

In case of a "clean" installation of 3.73.0, all autogenerated image integrations will have Source != nil set.

So, we will delete all autogenerated image integrations then.

In that case, how will the recovery look like for this? Will it be eventual consistent such that the autogenerated registries we expect to have (i.e. for internal OpenShift registry) will show up after some time, or does that require bumps of sensors to send the autogenerated registries once more?

(I honestly don't think that the OpenShift internal ones are that important from an autogenerated image integration perspective, unless central can reach them, which won't be the case for most of them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a clean install, I'd expect us to eventually find all the autogenerated registries. When it comes to OpenShift internal registry, we do not use the stuff sent to Central for that

Co-authored-by: Alex Rukletsov <alexr@stackrox.com>
@md2119
Copy link
Contributor

md2119 commented Dec 15, 2022

@dhaus67 @RTann, help me understand this. The behavior change for auto-generated image integrations was released in 3.73.0. Therefore, until Central gRPC listeners are ready and start receiving messages from Sensor, Central DB should not have had many image integrations because Central is beginning to receive messages from Sensor past the readiness probe.

Is the above understanding correct?
If yes, then,

  • If the observation was strictly made during the first time Central was being upgraded to 3.73.0, then I do not think we have covered everything.
  • If otherwise, then we have covered everything.

@RTann
Copy link
Contributor Author

RTann commented Dec 15, 2022

@dhaus67 @RTann, help me understand this. The behavior change for auto-generated image integrations was released in 3.73.0. Therefore, until Central gRPC listeners are ready and start receiving messages from Sensor, Central DB should not have had many image integrations because Central is beginning to receive messages from Sensor past the readiness probe.

Is the above understanding correct? If yes, then,

  • If the observation was strictly made during the first time Central was being upgraded to 3.73.0, then I do not think we have covered everything.
  • If otherwise, then we have covered everything.

I agree with your assessment. I am not sure if the cases we found were created after Central had already restarted due to some reason or if this was the first 3.73.0 upgrade attempt.

@RTann RTann requested a review from rukletsov December 15, 2022 18:48
@md2119
Copy link
Contributor

md2119 commented Dec 15, 2022

@dhaus67 @RTann, help me understand this. The behavior change for auto-generated image integrations was released in 3.73.0. Therefore, until Central gRPC listeners are ready and start receiving messages from Sensor, Central DB should not have had many image integrations because Central is beginning to receive messages from Sensor past the readiness probe.
Is the above understanding correct? If yes, then,

  • If the observation was strictly made during the first time Central was being upgraded to 3.73.0, then I do not think we have covered everything.
  • If otherwise, then we have covered everything.

I agree with your assessment. I am not sure if the cases we found were created after Central had already restarted due to some reason or if this was the first 3.73.0 upgrade attempt.

Okay, let's run by that theory.

@RTann RTann merged commit d283709 into master Dec 15, 2022
@RTann RTann deleted the ROX-13943 branch December 15, 2022 20:30
@github-actions
Copy link
Contributor

Please merge the changes to branch release-3.73.

davdhacs pushed a commit that referenced this pull request Dec 16, 2022
…eature flag (#4160)

minus the changes to central/image/service/service_impl.go
davdhacs added a commit that referenced this pull request Dec 16, 2022
* ROX-13943: Hide "sourced" autogenerated image integrations behind a feature flag (#4160)

Co-authored-by: Ross Tannenbaum <rtannenb@redhat.com>
Co-authored-by: dhaus67 <dhaus@redhat.com>
@davdhacs davdhacs removed this from the 3.73.1-rc.3 milestone Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/central area/sensor turbo-build Uses a faster path to images

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants