Skip to content

fix: Harden informer cache with label selectors and memory optimizations#6242

Open
jyejare wants to merge 2 commits intofeast-dev:masterfrom
jyejare:ugo-harden-cache
Open

fix: Harden informer cache with label selectors and memory optimizations#6242
jyejare wants to merge 2 commits intofeast-dev:masterfrom
jyejare:ugo-harden-cache

Conversation

@jyejare
Copy link
Copy Markdown
Collaborator

@jyejare jyejare commented Apr 8, 2026

Summary

The feast-operator's Owns() calls create cluster-wide informers for ConfigMaps, Deployments, Services, and other resource types. On clusters with a large number of these objects, the informer cache can grow beyond the operator's 256Mi memory limit, causing OOMKill and restarts.

Changes

ByObject label selectors for all owned resource types

Restrict informer caches to only objects with app.kubernetes.io/managed-by: feast-operator. Covers all 10 owned types: ConfigMap, Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB. Extracted into newCacheOptions() for clarity.

DefaultTransform: cache.TransformStripManagedFields()

Strip managedFields from all cached objects, reducing per-object memory footprint by ~30-50%.

GOMEMLIMIT=230MiB

Set Go runtime soft memory limit (90% of 256Mi container limit). Triggers GC pressure before hard OOMKill as defense-in-depth.

Additional changes

  • Add app.kubernetes.io/managed-by: feast-operator label to getLabels() so all FeatureStore-managed resources carry it
  • Introduce getSelectorLabels() for immutable selectors (Deployment spec.selector, Service spec.selector, TopologySpreadConstraints, PodAffinity) to avoid breaking existing resources on upgrade
  • Standardize notebook controller's managed-by label to app.kubernetes.io/managed-by
  • Use shared constants (services.ManagedByLabelKey/Value) throughout

Test Results

Verified on cluster with a large number of ConfigMaps pre-loaded:

Metric Before After
Memory usage 254Mi (at limit) 25Mi
Stability OOMKilled, CrashLoopBackOff Stable, no restarts

Test plan

  • Deploy fixed operator on RHOAI 3.3.0 cluster
  • Verify memory usage stays well below 256Mi limit under load
  • Verify no OOMKill or CrashLoopBackOff
  • Run existing unit tests (make test) — all pass
  • Verify getSelectorLabels() prevents immutable selector breakage on upgrade

Summary by CodeRabbit

  • Chores
    • Operator cache now only watches/caches resources marked with the managed-by label, reducing resource usage.
    • Added GOMEMLIMIT=230MiB to the controller-manager container.
    • Standardized managed-by label constants and separated immutable selector labels from mutable metadata to improve stability.
  • Tests
    • Updated tests to use the shared label constants.

Open with Devin

@jyejare jyejare requested a review from a team as a code owner April 8, 2026 17:15
devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare force-pushed the ugo-harden-cache branch 3 times, most recently from eab7bf4 to aa69c5b Compare April 8, 2026 18:34
devin-ai-integration[bot]

This comment was marked as resolved.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare force-pushed the ugo-harden-cache branch from aa69c5b to 6a2995e Compare April 9, 2026 08:57
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

⚠️ 1 issue in files not directly in the diff

⚠️ Status.Selector reports getLabels() instead of getSelectorLabels(), mismatches actual Deployment selector (infra/feast-operator/internal/controller/services/scaling.go:252-253)

At scaling.go:252-253, cr.Status.Selector is computed from feast.getLabels() which now includes the ManagedByLabelKey. However, the Deployment's actual spec.selector uses getSelectorLabels() (only the name label). The Status.Selector field should reflect the actual selector used to identify the pods belonging to this FeatureStore's deployment. Any tooling or user inspecting .status.selector will see feast.dev/name=X,app.kubernetes.io/managed-by=feast-operator while the real Deployment selector is just feast.dev/name=X.

View 9 additional findings in Devin Review.

Open in Devin Review

Comment on lines +77 to +89
ByObject: map[client.Object]cache.ByObject{
&corev1.ConfigMap{}: managedByFilter,
&appsv1.Deployment{}: managedByFilter,
&corev1.Service{}: managedByFilter,
&corev1.ServiceAccount{}: managedByFilter,
&corev1.PersistentVolumeClaim{}: managedByFilter,
&rbacv1.RoleBinding{}: managedByFilter,
&rbacv1.Role{}: managedByFilter,
&batchv1.CronJob{}: managedByFilter,
&autoscalingv2.HorizontalPodAutoscaler{}: managedByFilter,
&policyv1.PodDisruptionBudget{}: managedByFilter,
},
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Cache label filter breaks operator for pre-existing resources on upgrade

The new cache ByObject configuration at infra/feast-operator/cmd/main.go:77-88 restricts informer caches to only watch resources with the app.kubernetes.io/managed-by: feast-operator label. However, only ConfigMap and Secret are in the DisableFor list (bypassing cache). All other types (Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB) are read through the label-filtered cache.

On upgrade from a prior version, existing operator-managed resources won't have the managed-by label. When the operator reconciles an existing FeatureStore:

  1. controllerutil.CreateOrUpdate() calls Get() → goes through the cache → cache returns NotFound (object exists in API server but lacks the label, so the informer never cached it)
  2. CreateOrUpdate then calls Create() → fails with AlreadyExists
  3. Error propagates up, reconciliation fails and retries forever

This is a permanent deadlock: the operator can never update the resource to add the label because it can't find it through the cache. This affects all existing FeatureStore instances after an operator upgrade.

Prompt for agents
The cache ByObject label-selector filtering causes all cached Get/List calls to miss pre-existing resources that lack the managed-by label. On upgrade, this creates a permanent deadlock where the operator cannot reconcile existing FeatureStore instances.

Possible approaches to fix:
1. Add the resource types that might pre-exist (Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB) to the Client.CacheOptions.DisableFor list, similar to ConfigMap and Secret. This bypasses the cache for reads but still benefits from the label-filtered watch for triggering reconciliation events.
2. Implement a startup migration that labels all existing operator-owned resources (identified by owner references to FeatureStore CRs) with the managed-by label before starting the informers.
3. Use a cache.TransformFunc that adds the label to objects as they are fetched, though this doesn't help with the initial list.

Option 1 is simplest but loses cache benefits for reads. Option 2 is more correct but requires additional startup logic. The key files involved are cmd/main.go (cache config), and all resource creation paths in internal/controller/services/services.go, scaling.go, client.go, cronjob.go, namespace_registry.go, and internal/controller/authz/authz.go.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 12 additional findings in Devin Review.

Open in Devin Review

Comment on lines 206 to 207
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Namespace registry Role Get from filtered cache fails after tolerated AlreadyExists on Create

In setNamespaceRegistryRoleBinding, the code creates a Role via Client.Create and tolerates AlreadyExists (line 187-189). It then immediately calls Client.Get on the same Role (line 193-198). Client.Get goes through the label-filtered cache. If the Role already existed without the managed-by label (pre-upgrade), or even if it was just created (cache hasn't synced the watch event yet), Get returns NotFound, causing the function to return an error. This breaks namespace registry reconciliation on upgrade and introduces a race condition even on fresh installs.

(Refers to lines 182-207)

Prompt for agents
In setNamespaceRegistryRoleBinding (namespace_registry.go), the Role is created with Client.Create (AlreadyExists tolerated), then immediately re-fetched with Client.Get through the filtered cache. This fails if (a) the Role existed before the managed-by label was introduced, or (b) the cache informer hasn't processed the watch event yet for a newly created Role.

The fix should switch from the manual Create-then-Get pattern to using controllerutil.CreateOrUpdate for the Role (similar to how RoleBindings are handled), or use a direct API server read (bypassing cache) for the re-fetch. If using CreateOrUpdate, note that this is also affected by BUG-0001 for the upgrade case. A robust approach would be to use Server-Side Apply (Patch with ApplyPatchType) for the Role, similar to how HPA and PDB are handled in scaling.go, which avoids both the cache lookup and AlreadyExists issues entirely.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant