ROX-11657: Scope autogenerated image integrations by the image pull secrets#2572
ROX-11657: Scope autogenerated image integrations by the image pull secrets#2572
Conversation
|
Skipping CI for Draft Pull Request. |
|
/test |
|
@connorgorman: The
The following commands are available to trigger optional jobs:
Use
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test images push-images |
84457ed to
c94d024
Compare
c94d024 to
15cebd1
Compare
|
/retest |
|
Images are ready for the commit at 9048c66. To use with deploy scripts, first |
|
/test all |
| return nil | ||
| } | ||
|
|
||
| integrations, err := s.datastore.GetImageIntegrations(ctx, &v1.GetImageIntegrationsRequest{}) |
There was a problem hiding this comment.
Can already filter the integrations by cluster here instead of within line 90.
| integrations, err := s.datastore.GetImageIntegrations(ctx, &v1.GetImageIntegrationsRequest{}) | |
| integrations, err := s.datastore.GetImageIntegrations(ctx, &v1.GetImageIntegrationsRequest{Cluster: clusterID}) |
There was a problem hiding this comment.
Well, so long story, this doesn't work. I'll fix it in a follow up
| return errors.Wrap(err, "getting image integrations for reconciliation") | ||
| } | ||
| for _, integration := range integrations { | ||
| if integration.GetEcr() != nil { |
There was a problem hiding this comment.
out of curiosity: why do we skip ECR integrations?
There was a problem hiding this comment.
The ECR autogenerated integrations are special and use AWS IAM
There was a problem hiding this comment.
Maybe add a comment here, then?
aa6275b to
9027659
Compare
|
@dhaus67 I apologize for taking forever to clean this up, but I have made the requested changes and also fixed some issues. Fingers crossed on the e2es |
c44363a to
08f6ed6
Compare
08f6ed6 to
5750b72
Compare
|
/retest |
|
@dhaus67 This is good to go. The TestScan is a test that was written for summaries, but summaries are checked in qa-backend-e2e |
29e4770 to
d81fd6d
Compare
|
/retest |
| reserved 19; // Previously, scannerv2 | ||
|
|
||
| message Source { | ||
| string cluster_id = 1; |
There was a problem hiding this comment.
Sounds good, I've created ROX-12996 as a follow-up to remove exactly this.
| return errors.Wrap(err, "getting image integrations for reconciliation") | ||
| } | ||
| for _, integration := range integrations { | ||
| if integration.GetEcr() != nil { |
There was a problem hiding this comment.
Maybe add a comment here, then?
| if integration.GetEcr() != nil { | ||
| continue | ||
| } | ||
| if integration.GetClusterId() != clusterID { |
There was a problem hiding this comment.
I looked at v1.GetImageIntegrationsRequest and it seems that it supports specifying a cluster (name only, not ID?). However, the handling of that field is extremely weird (`central/imageintegration/datastore/datastore_impl.go:55-64):
integrations, err := ds.storage.GetAll(ctx)
if err != nil {
return nil, err
}
integrationSlice := integrations[:0]
for _, integration := range integrations {
if request.GetCluster() != "" {
continue
}
So if request specifies a non-empty cluster, it will always return an empty slice, while still doing a read on the DB. Not sure if there's a bug or this param was just never supported, but either way, this probably deserves a JIRA filed for it and addressed some time.
There was a problem hiding this comment.
| _, exists, err = s.datastore.GetImageIntegration(ctx, imageIntegration.GetId()) | ||
| if err != nil { | ||
| return errors.Wrapf(err, "retrieving image integration %q", imageIntegration.GetId()) | ||
| } | ||
|
|
||
| if _, err := s.datastore.AddImageIntegration(ctx, imageIntegration); err != nil { | ||
| return errors.Wrapf(err, "adding image integration %q", imageIntegration.GetId()) | ||
| } | ||
| if err := s.integrationManager.Upsert(imageIntegration); err != nil { | ||
| return errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId()) | ||
| } | ||
| if !exists { | ||
| s.enrichAndDetectLoop.ShortCircuit() | ||
| } |
There was a problem hiding this comment.
Tbh, I find it a bit ugly that the pipeline is in charge of all of this. IMHO, the manager should both abstract away the store insertion logic as well as the need to run detection. However, that may be hard to refactor. Still, let's factor this out into its own method, at least?
| if err := s.integrationManager.Upsert(imageIntegration); err != nil { | ||
| return errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId()) | ||
| } | ||
| if !exists { |
There was a problem hiding this comment.
This doesn't make too much sense imho. A modified integration might just as well be a reason to run detection again. It would be nice if we could somehow scope the re-evaluation to images potentially using that integration, but that's far, far beyond the scope of this PR.
I wonder if the concern is that doing this on every update might be too frequent. I don't think however that updates of image pull secrets are very different in frequency from creations - most are probably never updated. A concern is the initial cluster connection. when we receive many new integrations and would start the detection loop several times. To avoid this, it might make sense to add a hasSynced concurrency.Flag, that is set to true on Reconcile followed by triggering detection, and then trigger detection here on every update whenever s.hasSynced.Get() is true.
There was a problem hiding this comment.
@misberner Hm, not sure I completely get it from your comment, sorry.
Your suggestion is to set the flag within Reconcile() and then only short-circuit if the flag is set, right?
Regarding the frequencies, I saw this comment indicating a "higher" volume of updates due to service accounts, but to be fair I'm not quite sure whether this is of any concern.
There was a problem hiding this comment.
Let me try to rephrase:
- Ideally, if an image integration is updated, we would want to reprocess all deployments with previously unscannable/unretrievable images that potentially use this registry integration.
- As an approximation of the above, we could reprocess all deployments that potentially use this registry integration, unscannable images or not.
- As an approximation of the above, we could reprocess all deployments that are somewhat related to this registry integration (e.g., same cluster), even if they don't use it, and regardless of whether they have unscannable images.
The point is that an updated registry integration can only affect a deployment evaluation if we can now scan an image used by the deployment, that we previously could not scan (due to the content-addressing nature of IDs, "being able to scan an image" is a binary property that we can assume never reverts to false once it became `true). Otherwise, there is no information that can be gained from the updated image integration and that could change anything about the evaluation.
The existing code however only reprocesses a deployment when a new image integration is created. This is neither an over- nor under-approximation of the "ideal" handling stated at the top of this post. It unnecessarily reprocesses all deployments (even those not using the updated image integration nor using an image from one of the registries this integration covers, it might even reprocess deployments in other clusters, and it also reprocesses deployment for which we already have comprehensive scan results for all images). At the same time, it misses cases when a previously existing, erroneous image integration is fixed, and there actually are deployments with still unscanned images due to the previously erroneous image integration. Those will only be fixed after ~1h, whereas those with a missing image integration are fixed directly on creation.
My idea was therefore to just run reprocessing on every image integration update, in order to arrive at something that at least is a consistent over-approximation. Still reprocessing far too much, but at least not missing anything; a fully-overblown but sufficient solution rather than a half-overblown yet still insufficient one.
My suspicion was that the case this approach wasn't chosen was due to the fact that during reconciliation (i.e., a sensor reconnects), we would receive a flood of image integration updates, causing an avalanche of reprocessings. However: (a) this is not how ShortCircuit() works anyway, it just triggers a signal, but if the enricher/detector is still busy, a subseqent call will just do nothing (note: this alone makes my comment on setting a flag and calling ShortCircuit once after reconciliation pointless, so just ignore it), and (b) the comment you pointed out suggests that this is due to a high volume of (spurious?) updates on OpenShift rather than the reconciliation phase.
So, in summary, keep the existing behavior (which I think now has accidentally been flipped), until we either can do reprocessing in a more targeted way (unscanned images only?), or understand why image integration updates on OpenShift are so noisy.
| func (e *enricher) getImages(deployment *storage.Deployment) []*storage.Image { | ||
| imageChan := make(chan imageChanResult, len(deployment.GetContainers())) | ||
|
|
||
| pullSecretSet := set.NewStringSet(e.serviceAccountStore.GetImagePullSecrets(deployment.GetNamespace(), deployment.GetServiceAccount())...) |
Refactor image integration pipeline, refactor proto messages
| if tlsErr != nil { | ||
| log.Debugf("reaching out for TLS check to %s: %v", dockerCfg.GetEndpoint(), err) | ||
| // Assume TLS is true | ||
| validTLS = true | ||
| } | ||
| dockerCfg.Insecure = !validTLS |
There was a problem hiding this comment.
| if tlsErr != nil { | |
| log.Debugf("reaching out for TLS check to %s: %v", dockerCfg.GetEndpoint(), err) | |
| // Assume TLS is true | |
| validTLS = true | |
| } | |
| dockerCfg.Insecure = !validTLS | |
| if tlsErr != nil { | |
| log.Debugf("reaching out for TLS check to %s: %v", dockerCfg.GetEndpoint(), err) | |
| // Not enough evidence that we can skip TLS, so conservatively require it | |
| dockerCfg.Insecure = false | |
| } else { | |
| dockerCfg.Insecure = !validTLS | |
| } |
Don't overload variables if the name doesn't fit the usage any more. validTLS is a misnomer in l.177.
|
|
||
| func (s *pipelineImpl) OnFinish(_ string) {} | ||
|
|
||
| func (s *pipelineImpl) upsertImageIntegration(ctx context.Context, imageIntegration *storage.ImageIntegration) (bool, error) { |
There was a problem hiding this comment.
Use a named return value for the bool return value, or at least add a comment
| if err := s.integrationManager.Upsert(imageIntegration); err != nil { | ||
| return exists, errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId()) | ||
| } | ||
| return exists, nil |
There was a problem hiding this comment.
Didn't you now flip when to do a short-circuit? Why would we not want to short circuit upon creation of an image integration?
| if err := s.integrationManager.Upsert(imageIntegration); err != nil { | ||
| return errors.Wrapf(err, "notifying of update for image integration %q", imageIntegration.GetId()) | ||
| } | ||
| if !exists { |
There was a problem hiding this comment.
Let me try to rephrase:
- Ideally, if an image integration is updated, we would want to reprocess all deployments with previously unscannable/unretrievable images that potentially use this registry integration.
- As an approximation of the above, we could reprocess all deployments that potentially use this registry integration, unscannable images or not.
- As an approximation of the above, we could reprocess all deployments that are somewhat related to this registry integration (e.g., same cluster), even if they don't use it, and regardless of whether they have unscannable images.
The point is that an updated registry integration can only affect a deployment evaluation if we can now scan an image used by the deployment, that we previously could not scan (due to the content-addressing nature of IDs, "being able to scan an image" is a binary property that we can assume never reverts to false once it became `true). Otherwise, there is no information that can be gained from the updated image integration and that could change anything about the evaluation.
The existing code however only reprocesses a deployment when a new image integration is created. This is neither an over- nor under-approximation of the "ideal" handling stated at the top of this post. It unnecessarily reprocesses all deployments (even those not using the updated image integration nor using an image from one of the registries this integration covers, it might even reprocess deployments in other clusters, and it also reprocesses deployment for which we already have comprehensive scan results for all images). At the same time, it misses cases when a previously existing, erroneous image integration is fixed, and there actually are deployments with still unscanned images due to the previously erroneous image integration. Those will only be fixed after ~1h, whereas those with a missing image integration are fixed directly on creation.
My idea was therefore to just run reprocessing on every image integration update, in order to arrive at something that at least is a consistent over-approximation. Still reprocessing far too much, but at least not missing anything; a fully-overblown but sufficient solution rather than a half-overblown yet still insufficient one.
My suspicion was that the case this approach wasn't chosen was due to the fact that during reconciliation (i.e., a sensor reconnects), we would receive a flood of image integration updates, causing an avalanche of reprocessings. However: (a) this is not how ShortCircuit() works anyway, it just triggers a signal, but if the enricher/detector is still busy, a subseqent call will just do nothing (note: this alone makes my comment on setting a flag and calling ShortCircuit once after reconciliation pointless, so just ignore it), and (b) the comment you pointed out suggests that this is due to a high volume of (spurious?) updates on OpenShift rather than the reconciliation phase.
So, in summary, keep the existing behavior (which I think now has accidentally been flipped), until we either can do reprocessing in a more targeted way (unscanned images only?), or understand why image integration updates on OpenShift are so noisy.
misberner
left a comment
There was a problem hiding this comment.
Mostly looks good, except for what I think is an accidental flip in the behavior wrt when to reprocess
|
@connorgorman: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Merging since the CI failures are fixed in master within #3415 |
…ecrets (#2572) Co-authored-by: Daniel Haus <dhaus@redhat.com>
Description
This PR adds a new message
Sourceto theImageIntegrationproto with the goal to scope autogenerated integrations to a specific cluster, namespace, and pull secrets.For this to work, a couple of changes were required:
ImageIntegrationproto, adding theSourcemessage.ScanImageInternalRequest, allowing sending thecluster_id, namespace, and pullsecretsthat should be used for scanning.ServiceAccountStore/Dispatcherwithin sensor, allowing to manage image pull secrets associated with a service account for a specific namespace.ScanImageInternalRequest.Checklist
- [ ] Unit test and regression tests added- [ ] Evaluated and added CHANGELOG entry if required- [ ] Determined and documented upgrade steps- [ ] Documented user facing changes (create PR based on openshift/openshift-docs and merge into rhacs-docs)Testing Performed