fix: Harden informer cache with label selectors and memory optimizations by jyejare · Pull Request #6242 · feast-dev/feast

jyejare · 2026-04-08T17:15:21Z

Summary

The feast-operator's Owns() calls create cluster-wide informers for ConfigMaps, Deployments, Services, and other resource types. On clusters with a large number of these objects, the informer cache can grow beyond the operator's 256Mi memory limit, causing OOMKill and restarts.

Changes

`ByObject` label selectors for all owned resource types

Restrict informer caches to only objects with app.kubernetes.io/managed-by: feast-operator. Covers all 10 owned types: ConfigMap, Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB. Extracted into newCacheOptions() for clarity.

`DefaultTransform: cache.TransformStripManagedFields()`

Strip managedFields from all cached objects, reducing per-object memory footprint by ~30-50%.

`GOMEMLIMIT=230MiB`

Set Go runtime soft memory limit (90% of 256Mi container limit). Triggers GC pressure before hard OOMKill as defense-in-depth.

Additional changes

Add app.kubernetes.io/managed-by: feast-operator label to getLabels() so all FeatureStore-managed resources carry it
Introduce getSelectorLabels() for immutable selectors (Deployment spec.selector, Service spec.selector, TopologySpreadConstraints, PodAffinity) to avoid breaking existing resources on upgrade
Standardize notebook controller's managed-by label to app.kubernetes.io/managed-by
Use shared constants (services.ManagedByLabelKey/Value) throughout

Test Results

Verified on cluster with a large number of ConfigMaps pre-loaded:

Metric	Before	After
Memory usage	254Mi (at limit)	25Mi
Stability	OOMKilled, CrashLoopBackOff	Stable, no restarts

Test plan

Deploy fixed operator on RHOAI 3.3.0 cluster
Verify memory usage stays well below 256Mi limit under load
Verify no OOMKill or CrashLoopBackOff
Run existing unit tests (make test) — all pass
Verify getSelectorLabels() prevents immutable selector breakage on upgrade

Summary by CodeRabbit

Chores
- Operator cache now only watches/caches resources marked with the managed-by label, reducing resource usage.
- Added GOMEMLIMIT=230MiB to the controller-manager container.
- Standardized managed-by label constants and separated immutable selector labels from mutable metadata to improve stability.
Tests
- Updated tests to use the shared label constants.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

devin-ai-integration

Devin Review found 2 new potential issues.

⚠️ 1 issue in files not directly in the diff

⚠️ Status.Selector reports getLabels() instead of getSelectorLabels(), mismatches actual Deployment selector (`infra/feast-operator/internal/controller/services/scaling.go:252-253`)

At scaling.go:252-253, cr.Status.Selector is computed from feast.getLabels() which now includes the ManagedByLabelKey. However, the Deployment's actual spec.selector uses getSelectorLabels() (only the name label). The Status.Selector field should reflect the actual selector used to identify the pods belonging to this FeatureStore's deployment. Any tooling or user inspecting .status.selector will see feast.dev/name=X,app.kubernetes.io/managed-by=feast-operator while the real Deployment selector is just feast.dev/name=X.

View 9 additional findings in Devin Review.

devin-ai-integration · 2026-04-09T09:07:04Z

infra/feast-operator/cmd/main.go

+		ByObject: map[client.Object]cache.ByObject{
+			&corev1.ConfigMap{}:                      managedByFilter,
+			&appsv1.Deployment{}:                     managedByFilter,
+			&corev1.Service{}:                        managedByFilter,
+			&corev1.ServiceAccount{}:                 managedByFilter,
+			&corev1.PersistentVolumeClaim{}:          managedByFilter,
+			&rbacv1.RoleBinding{}:                    managedByFilter,
+			&rbacv1.Role{}:                           managedByFilter,
+			&batchv1.CronJob{}:                       managedByFilter,
+			&autoscalingv2.HorizontalPodAutoscaler{}: managedByFilter,
+			&policyv1.PodDisruptionBudget{}:          managedByFilter,
+		},
+	}


🔴 Cache label filter breaks operator for pre-existing resources on upgrade

The new cache ByObject configuration at infra/feast-operator/cmd/main.go:77-88 restricts informer caches to only watch resources with the app.kubernetes.io/managed-by: feast-operator label. However, only ConfigMap and Secret are in the DisableFor list (bypassing cache). All other types (Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB) are read through the label-filtered cache.

On upgrade from a prior version, existing operator-managed resources won't have the managed-by label. When the operator reconciles an existing FeatureStore:

controllerutil.CreateOrUpdate() calls Get() → goes through the cache → cache returns NotFound (object exists in API server but lacks the label, so the informer never cached it)

CreateOrUpdate then calls Create() → fails with AlreadyExists

Error propagates up, reconciliation fails and retries forever

This is a permanent deadlock: the operator can never update the resource to add the label because it can't find it through the cache. This affects all existing FeatureStore instances after an operator upgrade.

Prompt for agents

The cache ByObject label-selector filtering causes all cached Get/List calls to miss pre-existing resources that lack the managed-by label. On upgrade, this creates a permanent deadlock where the operator cannot reconcile existing FeatureStore instances. Possible approaches to fix: 1. Add the resource types that might pre-exist (Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB) to the Client.CacheOptions.DisableFor list, similar to ConfigMap and Secret. This bypasses the cache for reads but still benefits from the label-filtered watch for triggering reconciliation events. 2. Implement a startup migration that labels all existing operator-owned resources (identified by owner references to FeatureStore CRs) with the managed-by label before starting the informers. 3. Use a cache.TransformFunc that adds the label to objects as they are fetched, though this doesn't help with the initial list. Option 1 is simplest but loses cache benefits for reads. Option 2 is more correct but requires additional startup logic. The key files involved are cmd/main.go (cache config), and all resource creation paths in internal/controller/services/services.go, scaling.go, client.go, cronjob.go, namespace_registry.go, and internal/controller/authz/authz.go.

Was this helpful? React with 👍 or 👎 to provide feedback.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 12 additional findings in Devin Review.

devin-ai-integration · 2026-04-09T09:16:27Z

infra/feast-operator/internal/controller/services/namespace_registry.go

 		}
 	}


🔴 Namespace registry Role Get from filtered cache fails after tolerated AlreadyExists on Create

In setNamespaceRegistryRoleBinding, the code creates a Role via Client.Create and tolerates AlreadyExists (line 187-189). It then immediately calls Client.Get on the same Role (line 193-198). Client.Get goes through the label-filtered cache. If the Role already existed without the managed-by label (pre-upgrade), or even if it was just created (cache hasn't synced the watch event yet), Get returns NotFound, causing the function to return an error. This breaks namespace registry reconciliation on upgrade and introduces a race condition even on fresh installs.

(Refers to lines 182-207)

Prompt for agents

In setNamespaceRegistryRoleBinding (namespace_registry.go), the Role is created with Client.Create (AlreadyExists tolerated), then immediately re-fetched with Client.Get through the filtered cache. This fails if (a) the Role existed before the managed-by label was introduced, or (b) the cache informer hasn't processed the watch event yet for a newly created Role. The fix should switch from the manual Create-then-Get pattern to using controllerutil.CreateOrUpdate for the Role (similar to how RoleBindings are handled), or use a direct API server read (bypassing cache) for the re-fetch. If using CreateOrUpdate, note that this is also affected by BUG-0001 for the upgrade case. A robust approach would be to use Server-Side Apply (Patch with ApplyPatchType) for the Role, similar to how HPA and PDB are handled in scaling.go, which avoids both the cache lookup and AlreadyExists issues entirely.

Was this helpful? React with 👍 or 👎 to provide feedback.

jyejare requested a review from a team as a code owner April 8, 2026 17:15

This comment was marked as resolved.

Sign in to view

jyejare force-pushed the ugo-harden-cache branch 3 times, most recently from eab7bf4 to aa69c5b Compare April 8, 2026 18:34

This comment was marked as resolved.

Sign in to view

fix: Harden informer cache with label selectors and memory optimizations

6a2995e

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

jyejare force-pushed the ugo-harden-cache branch from aa69c5b to 6a2995e Compare April 9, 2026 08:57

devin-ai-integration bot reviewed Apr 9, 2026

View reviewed changes

Additional Fixes on caching with PVC and HPA

46910cd

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

devin-ai-integration bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Harden informer cache with label selectors and memory optimizations#6242

fix: Harden informer cache with label selectors and memory optimizations#6242
jyejare wants to merge 2 commits intofeast-dev:masterfrom
jyejare:ugo-harden-cache

jyejare commented Apr 8, 2026 •

edited by ntkathole

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Apr 9, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jyejare commented Apr 8, 2026 • edited by ntkathole Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

ByObject label selectors for all owned resource types

DefaultTransform: cache.TransformStripManagedFields()

GOMEMLIMIT=230MiB

Additional changes

Test Results

Test plan

Summary by CodeRabbit

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

⚠️ Status.Selector reports getLabels() instead of getSelectorLabels(), mismatches actual Deployment selector (infra/feast-operator/internal/controller/services/scaling.go:252-253)

Uh oh!

devin-ai-integration bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jyejare commented Apr 8, 2026 •

edited by ntkathole

Loading

`ByObject` label selectors for all owned resource types

`DefaultTransform: cache.TransformStripManagedFields()`

`GOMEMLIMIT=230MiB`

⚠️ Status.Selector reports getLabels() instead of getSelectorLabels(), mismatches actual Deployment selector (`infra/feast-operator/internal/controller/services/scaling.go:252-253`)