Skip to content

ROX-11657: Scope autogenerated image integrations by the image pull secrets#2572

Merged
dhaus67 merged 5 commits intomasterfrom
cgorman-image-integration-fix
Oct 18, 2022
Merged

ROX-11657: Scope autogenerated image integrations by the image pull secrets#2572
dhaus67 merged 5 commits intomasterfrom
cgorman-image-integration-fix

Conversation

@connorgorman
Copy link
Contributor

@connorgorman connorgorman commented Aug 1, 2022

Description

This PR adds a new message Source to the ImageIntegration proto with the goal to scope autogenerated integrations to a specific cluster, namespace, and pull secrets.

For this to work, a couple of changes were required:

  • Proto changes to the ImageIntegration proto, adding the Source message.
  • API changes to the ScanImageInternalRequest, allowing sending the cluster_id, namespace, and pullsecrets that should be used for scanning.
  • A new ServiceAccountStore/Dispatcher within sensor, allowing to manage image pull secrets associated with a service account for a specific namespace.
  • Changes in the enricher within sensor to send associated image pull secrets within the ScanImageInternalRequest.
  • Changes in the image enricher within central to filter registries based on the attached image pull secrets within the enrichment context for internal images.
  • Implementing a reconciliation for internal image integrations, ensuring changes from secured clusters are correctly reflected in central.

Checklist

  • Investigated and inspected CI test results
    - [ ] Unit test and regression tests added
    - [ ] Evaluated and added CHANGELOG entry if required
    - [ ] Determined and documented upgrade steps
    - [ ] Documented user facing changes (create PR based on openshift/openshift-docs and merge into rhacs-docs)

Testing Performed

  • see CI.

@openshift-ci
Copy link

openshift-ci bot commented Aug 1, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@connorgorman
Copy link
Contributor Author

/test

@openshift-ci
Copy link

openshift-ci bot commented Aug 2, 2022

@connorgorman: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test go-postgres-tests
  • /test go-unit-tests
  • /test go-unit-tests-release
  • /test grouped-static-checks
  • /test images
  • /test integration-unit-tests
  • /test mitre-bundles-checks
  • /test policy-checks
  • /test push-images
  • /test shell-unit-tests
  • /test stackrox_branding-images
  • /test stackrox_branding-push-images
  • /test style-checks
  • /test ui-unit-tests

The following commands are available to trigger optional jobs:

  • /test eks-qa-e2e-tests
  • /test gke-kernel-qa-e2e-tests
  • /test gke-nongroovy-e2e-tests
  • /test gke-postgres-qa-e2e-tests
  • /test gke-postgres-scale-tests
  • /test gke-qa-e2e-tests
  • /test gke-race-condition-qa-e2e-tests
  • /test gke-scale-tests
  • /test gke-ui-e2e-tests
  • /test gke-upgrade-tests
  • /test local-roxctl-tests
  • /test openshift-4-operator-e2e-tests
  • /test openshift-4-qa-e2e-tests

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-stackrox-stackrox-master-gke-kernel-qa-e2e-tests
  • pull-ci-stackrox-stackrox-master-gke-nongroovy-e2e-tests
  • pull-ci-stackrox-stackrox-master-gke-postgres-qa-e2e-tests
  • pull-ci-stackrox-stackrox-master-gke-postgres-scale-tests
  • pull-ci-stackrox-stackrox-master-gke-qa-e2e-tests
  • pull-ci-stackrox-stackrox-master-gke-race-condition-qa-e2e-tests
  • pull-ci-stackrox-stackrox-master-gke-scale-tests
  • pull-ci-stackrox-stackrox-master-gke-ui-e2e-tests
  • pull-ci-stackrox-stackrox-master-gke-upgrade-tests
  • pull-ci-stackrox-stackrox-master-go-postgres-tests
  • pull-ci-stackrox-stackrox-master-go-unit-tests
  • pull-ci-stackrox-stackrox-master-go-unit-tests-release
  • pull-ci-stackrox-stackrox-master-grouped-static-checks
  • pull-ci-stackrox-stackrox-master-images
  • pull-ci-stackrox-stackrox-master-integration-unit-tests
  • pull-ci-stackrox-stackrox-master-local-roxctl-tests
  • pull-ci-stackrox-stackrox-master-mitre-bundles-checks
  • pull-ci-stackrox-stackrox-master-openshift-4-operator-e2e-tests
  • pull-ci-stackrox-stackrox-master-policy-checks
  • pull-ci-stackrox-stackrox-master-push-images
  • pull-ci-stackrox-stackrox-master-shell-unit-tests
  • pull-ci-stackrox-stackrox-master-stackrox_branding-images
  • pull-ci-stackrox-stackrox-master-stackrox_branding-push-images
  • pull-ci-stackrox-stackrox-master-style-checks
  • pull-ci-stackrox-stackrox-master-ui-unit-tests
Details

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@connorgorman
Copy link
Contributor Author

/test images push-images

@connorgorman connorgorman force-pushed the cgorman-image-integration-fix branch 2 times, most recently from 84457ed to c94d024 Compare August 3, 2022 17:11
@connorgorman connorgorman force-pushed the cgorman-image-integration-fix branch from c94d024 to 15cebd1 Compare August 3, 2022 17:47
@connorgorman connorgorman marked this pull request as ready for review August 3, 2022 17:58
@connorgorman connorgorman changed the title Cgorman image integration fix Scope autogenerated image integrations by the image pull secrets Aug 3, 2022
@connorgorman
Copy link
Contributor Author

/retest

@ghost
Copy link

ghost commented Aug 4, 2022

Images are ready for the commit at 9048c66.

To use with deploy scripts, first export MAIN_IMAGE_TAG=3.72.x-201-g9048c66fbb.

@connorgorman
Copy link
Contributor Author

/test all

@connorgorman connorgorman requested a review from dhaus67 August 10, 2022 15:56
return nil
}

integrations, err := s.datastore.GetImageIntegrations(ctx, &v1.GetImageIntegrationsRequest{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can already filter the integrations by cluster here instead of within line 90.

Suggested change
integrations, err := s.datastore.GetImageIntegrations(ctx, &v1.GetImageIntegrationsRequest{})
integrations, err := s.datastore.GetImageIntegrations(ctx, &v1.GetImageIntegrationsRequest{Cluster: clusterID})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, so long story, this doesn't work. I'll fix it in a follow up

return errors.Wrap(err, "getting image integrations for reconciliation")
}
for _, integration := range integrations {
if integration.GetEcr() != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: why do we skip ECR integrations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ECR autogenerated integrations are special and use AWS IAM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here, then?

@connorgorman
Copy link
Contributor Author

@dhaus67 I apologize for taking forever to clean this up, but I have made the requested changes and also fixed some issues. Fingers crossed on the e2es

@connorgorman connorgorman force-pushed the cgorman-image-integration-fix branch 2 times, most recently from c44363a to 08f6ed6 Compare September 2, 2022 02:42
@connorgorman connorgorman force-pushed the cgorman-image-integration-fix branch from 08f6ed6 to 5750b72 Compare September 26, 2022 22:20
@connorgorman
Copy link
Contributor Author

/retest

@connorgorman
Copy link
Contributor Author

@dhaus67 This is good to go. The TestScan is a test that was written for summaries, but summaries are checked in qa-backend-e2e

@connorgorman connorgorman force-pushed the cgorman-image-integration-fix branch from 29e4770 to d81fd6d Compare October 3, 2022 01:32
@connorgorman
Copy link
Contributor Author

/retest

@connorgorman connorgorman changed the title Scope autogenerated image integrations by the image pull secrets ROX-11657: Scope autogenerated image integrations by the image pull secrets Oct 3, 2022
@connorgorman connorgorman added the pls-merge Indicates to the *reviewer* that the PR can be merged once tests pass and approval is granted label Oct 4, 2022
@dhaus67 dhaus67 requested a review from misberner October 6, 2022 23:14
reserved 19; // Previously, scannerv2

message Source {
string cluster_id = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've created ROX-12996 as a follow-up to remove exactly this.

return errors.Wrap(err, "getting image integrations for reconciliation")
}
for _, integration := range integrations {
if integration.GetEcr() != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here, then?

if integration.GetEcr() != nil {
continue
}
if integration.GetClusterId() != clusterID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at v1.GetImageIntegrationsRequest and it seems that it supports specifying a cluster (name only, not ID?). However, the handling of that field is extremely weird (`central/imageintegration/datastore/datastore_impl.go:55-64):

	integrations, err := ds.storage.GetAll(ctx)
	if err != nil {
		return nil, err
	}

	integrationSlice := integrations[:0]
	for _, integration := range integrations {
		if request.GetCluster() != "" {
			continue
		}

So if request specifies a non-empty cluster, it will always return an empty slice, while still doing a read on the DB. Not sure if there's a bug or this param was just never supported, but either way, this probably deserves a JIRA filed for it and addressed some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, seems like the original PR #658 was actually removing support for this parameter (but apparently, this wasn't done because there was still something left for autogenerated image integrations that uses this parameter).

I do agree, it does justify a JIRA, I created one: ROX-13071

Comment on lines +280 to +293
_, exists, err = s.datastore.GetImageIntegration(ctx, imageIntegration.GetId())
if err != nil {
return errors.Wrapf(err, "retrieving image integration %q", imageIntegration.GetId())
}

if _, err := s.datastore.AddImageIntegration(ctx, imageIntegration); err != nil {
return errors.Wrapf(err, "adding image integration %q", imageIntegration.GetId())
}
if err := s.integrationManager.Upsert(imageIntegration); err != nil {
return errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId())
}
if !exists {
s.enrichAndDetectLoop.ShortCircuit()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, I find it a bit ugly that the pipeline is in charge of all of this. IMHO, the manager should both abstract away the store insertion logic as well as the need to run detection. However, that may be hard to refactor. Still, let's factor this out into its own method, at least?

if err := s.integrationManager.Upsert(imageIntegration); err != nil {
return errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId())
}
if !exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make too much sense imho. A modified integration might just as well be a reason to run detection again. It would be nice if we could somehow scope the re-evaluation to images potentially using that integration, but that's far, far beyond the scope of this PR.

I wonder if the concern is that doing this on every update might be too frequent. I don't think however that updates of image pull secrets are very different in frequency from creations - most are probably never updated. A concern is the initial cluster connection. when we receive many new integrations and would start the detection loop several times. To avoid this, it might make sense to add a hasSynced concurrency.Flag, that is set to true on Reconcile followed by triggering detection, and then trigger detection here on every update whenever s.hasSynced.Get() is true.

Copy link
Contributor

@dhaus67 dhaus67 Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@misberner Hm, not sure I completely get it from your comment, sorry.
Your suggestion is to set the flag within Reconcile() and then only short-circuit if the flag is set, right?

Regarding the frequencies, I saw this comment indicating a "higher" volume of updates due to service accounts, but to be fair I'm not quite sure whether this is of any concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to rephrase:

  • Ideally, if an image integration is updated, we would want to reprocess all deployments with previously unscannable/unretrievable images that potentially use this registry integration.
  • As an approximation of the above, we could reprocess all deployments that potentially use this registry integration, unscannable images or not.
  • As an approximation of the above, we could reprocess all deployments that are somewhat related to this registry integration (e.g., same cluster), even if they don't use it, and regardless of whether they have unscannable images.

The point is that an updated registry integration can only affect a deployment evaluation if we can now scan an image used by the deployment, that we previously could not scan (due to the content-addressing nature of IDs, "being able to scan an image" is a binary property that we can assume never reverts to false once it became `true). Otherwise, there is no information that can be gained from the updated image integration and that could change anything about the evaluation.

The existing code however only reprocesses a deployment when a new image integration is created. This is neither an over- nor under-approximation of the "ideal" handling stated at the top of this post. It unnecessarily reprocesses all deployments (even those not using the updated image integration nor using an image from one of the registries this integration covers, it might even reprocess deployments in other clusters, and it also reprocesses deployment for which we already have comprehensive scan results for all images). At the same time, it misses cases when a previously existing, erroneous image integration is fixed, and there actually are deployments with still unscanned images due to the previously erroneous image integration. Those will only be fixed after ~1h, whereas those with a missing image integration are fixed directly on creation.

My idea was therefore to just run reprocessing on every image integration update, in order to arrive at something that at least is a consistent over-approximation. Still reprocessing far too much, but at least not missing anything; a fully-overblown but sufficient solution rather than a half-overblown yet still insufficient one.

My suspicion was that the case this approach wasn't chosen was due to the fact that during reconciliation (i.e., a sensor reconnects), we would receive a flood of image integration updates, causing an avalanche of reprocessings. However: (a) this is not how ShortCircuit() works anyway, it just triggers a signal, but if the enricher/detector is still busy, a subseqent call will just do nothing (note: this alone makes my comment on setting a flag and calling ShortCircuit once after reconciliation pointless, so just ignore it), and (b) the comment you pointed out suggests that this is due to a high volume of (spurious?) updates on OpenShift rather than the reconciliation phase.

So, in summary, keep the existing behavior (which I think now has accidentally been flipped), until we either can do reprocessing in a more targeted way (unscanned images only?), or understand why image integration updates on OpenShift are so noisy.

func (e *enricher) getImages(deployment *storage.Deployment) []*storage.Image {
imageChan := make(chan imageChanResult, len(deployment.GetContainers()))

pullSecretSet := set.NewStringSet(e.serviceAccountStore.GetImagePullSecrets(deployment.GetNamespace(), deployment.GetServiceAccount())...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lvalerom @fredrb here's another fun thing to consider for the sensor resync project: if a secret changes, and that secret is referenced as an image pull secret by a service account, then all deployments that use this service account need to be re-evaluated.

Refactor image integration pipeline, refactor proto messages
@dhaus67 dhaus67 requested a review from misberner October 12, 2022 04:32
Comment on lines +174 to 179
if tlsErr != nil {
log.Debugf("reaching out for TLS check to %s: %v", dockerCfg.GetEndpoint(), err)
// Assume TLS is true
validTLS = true
}
dockerCfg.Insecure = !validTLS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if tlsErr != nil {
log.Debugf("reaching out for TLS check to %s: %v", dockerCfg.GetEndpoint(), err)
// Assume TLS is true
validTLS = true
}
dockerCfg.Insecure = !validTLS
if tlsErr != nil {
log.Debugf("reaching out for TLS check to %s: %v", dockerCfg.GetEndpoint(), err)
// Not enough evidence that we can skip TLS, so conservatively require it
dockerCfg.Insecure = false
} else {
dockerCfg.Insecure = !validTLS
}

Don't overload variables if the name doesn't fit the usage any more. validTLS is a misnomer in l.177.


func (s *pipelineImpl) OnFinish(_ string) {}

func (s *pipelineImpl) upsertImageIntegration(ctx context.Context, imageIntegration *storage.ImageIntegration) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a named return value for the bool return value, or at least add a comment

if err := s.integrationManager.Upsert(imageIntegration); err != nil {
return exists, errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId())
}
return exists, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you now flip when to do a short-circuit? Why would we not want to short circuit upon creation of an image integration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, yup.

if err := s.integrationManager.Upsert(imageIntegration); err != nil {
return errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId())
}
if !exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to rephrase:

  • Ideally, if an image integration is updated, we would want to reprocess all deployments with previously unscannable/unretrievable images that potentially use this registry integration.
  • As an approximation of the above, we could reprocess all deployments that potentially use this registry integration, unscannable images or not.
  • As an approximation of the above, we could reprocess all deployments that are somewhat related to this registry integration (e.g., same cluster), even if they don't use it, and regardless of whether they have unscannable images.

The point is that an updated registry integration can only affect a deployment evaluation if we can now scan an image used by the deployment, that we previously could not scan (due to the content-addressing nature of IDs, "being able to scan an image" is a binary property that we can assume never reverts to false once it became `true). Otherwise, there is no information that can be gained from the updated image integration and that could change anything about the evaluation.

The existing code however only reprocesses a deployment when a new image integration is created. This is neither an over- nor under-approximation of the "ideal" handling stated at the top of this post. It unnecessarily reprocesses all deployments (even those not using the updated image integration nor using an image from one of the registries this integration covers, it might even reprocess deployments in other clusters, and it also reprocesses deployment for which we already have comprehensive scan results for all images). At the same time, it misses cases when a previously existing, erroneous image integration is fixed, and there actually are deployments with still unscanned images due to the previously erroneous image integration. Those will only be fixed after ~1h, whereas those with a missing image integration are fixed directly on creation.

My idea was therefore to just run reprocessing on every image integration update, in order to arrive at something that at least is a consistent over-approximation. Still reprocessing far too much, but at least not missing anything; a fully-overblown but sufficient solution rather than a half-overblown yet still insufficient one.

My suspicion was that the case this approach wasn't chosen was due to the fact that during reconciliation (i.e., a sensor reconnects), we would receive a flood of image integration updates, causing an avalanche of reprocessings. However: (a) this is not how ShortCircuit() works anyway, it just triggers a signal, but if the enricher/detector is still busy, a subseqent call will just do nothing (note: this alone makes my comment on setting a flag and calling ShortCircuit once after reconciliation pointless, so just ignore it), and (b) the comment you pointed out suggests that this is due to a high volume of (spurious?) updates on OpenShift rather than the reconciliation phase.

So, in summary, keep the existing behavior (which I think now has accidentally been flipped), until we either can do reprocessing in a more targeted way (unscanned images only?), or understand why image integration updates on OpenShift are so noisy.

Copy link
Contributor

@misberner misberner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, except for what I think is an accidental flip in the behavior wrt when to reprocess

@dhaus67 dhaus67 requested a review from misberner October 18, 2022 04:45
@openshift-ci
Copy link

openshift-ci bot commented Oct 18, 2022

@connorgorman: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/osd-aws-qa-e2e-tests 7a51194 link false /test osd-aws-qa-e2e-tests
ci/prow/openshift-oldest-qa-e2e-tests 7a51194 link false /test openshift-oldest-qa-e2e-tests
ci/prow/rosa-qa-e2e-tests 7a51194 link false /test rosa-qa-e2e-tests
ci/prow/gke-qa-e2e-tests 9048c66 link false /test gke-qa-e2e-tests
ci/prow/gke-postgres-qa-e2e-tests 9048c66 link false /test gke-postgres-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dhaus67
Copy link
Contributor

dhaus67 commented Oct 18, 2022

Merging since the CI failures are fixed in master within #3415

@dhaus67 dhaus67 merged commit 2061f15 into master Oct 18, 2022
@dhaus67 dhaus67 deleted the cgorman-image-integration-fix branch October 18, 2022 11:03
ivan-degtiarenko pushed a commit that referenced this pull request Nov 5, 2022
…ecrets (#2572)

Co-authored-by: Daniel Haus <dhaus@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/central area/helm area/operator area/postgres area/sensor area/ui pls-merge Indicates to the *reviewer* that the PR can be merged once tests pass and approval is granted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants