Skip to content

fix: Harden informer cache with label selectors and memory optimizations#6242

Open
jyejare wants to merge 2 commits intofeast-dev:masterfrom
jyejare:ugo-harden-cache
Open

fix: Harden informer cache with label selectors and memory optimizations#6242
jyejare wants to merge 2 commits intofeast-dev:masterfrom
jyejare:ugo-harden-cache

Conversation

@jyejare
Copy link
Copy Markdown
Collaborator

@jyejare jyejare commented Apr 8, 2026

Summary

The feast-operator's Owns() calls create cluster-wide informers for ConfigMaps, Deployments, Services, and other resource types. On clusters with a large number of these objects, the informer cache can grow beyond the operator's 256Mi memory limit, causing OOMKill and restarts.

Changes

ByObject label selectors for all owned resource types

Restrict informer caches to only objects with app.kubernetes.io/managed-by: feast-operator. Covers all 10 owned types: ConfigMap, Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB. Extracted into newCacheOptions() for clarity.

DefaultTransform: cache.TransformStripManagedFields()

Strip managedFields from all cached objects, reducing per-object memory footprint by ~30-50%.

GOMEMLIMIT=230MiB

Set Go runtime soft memory limit (90% of 256Mi container limit). Triggers GC pressure before hard OOMKill as defense-in-depth.

Additional changes

  • Add app.kubernetes.io/managed-by: feast-operator label to getLabels() so all FeatureStore-managed resources carry it
  • Introduce getSelectorLabels() for immutable selectors (Deployment spec.selector, Service spec.selector, TopologySpreadConstraints, PodAffinity) to avoid breaking existing resources on upgrade
  • Standardize notebook controller's managed-by label to app.kubernetes.io/managed-by
  • Use shared constants (services.ManagedByLabelKey/Value) throughout

Test Results

Verified on cluster with a large number of ConfigMaps pre-loaded:

Metric Before After
Memory usage 254Mi (at limit) 25Mi
Stability OOMKilled, CrashLoopBackOff Stable, no restarts

Test plan

  • Deploy fixed operator on RHOAI 3.3.0 cluster
  • Verify memory usage stays well below 256Mi limit under load
  • Verify no OOMKill or CrashLoopBackOff
  • Run existing unit tests (make test) — all pass
  • Verify getSelectorLabels() prevents immutable selector breakage on upgrade

Summary by CodeRabbit

  • Chores
    • Operator cache now only watches/caches resources marked with the managed-by label, reducing resource usage.
    • Added GOMEMLIMIT=230MiB to the controller-manager container.
    • Standardized managed-by label constants and separated immutable selector labels from mutable metadata to improve stability.
  • Tests
    • Updated tests to use the shared label constants.

Open with Devin

@jyejare jyejare requested a review from a team as a code owner April 8, 2026 17:15
devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare force-pushed the ugo-harden-cache branch 3 times, most recently from eab7bf4 to aa69c5b Compare April 8, 2026 18:34
devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare force-pushed the ugo-harden-cache branch from aa69c5b to 6a2995e Compare April 9, 2026 08:57
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare force-pushed the ugo-harden-cache branch 4 times, most recently from 344c7a0 to b3237d2 Compare April 10, 2026 11:09
jyejare added 2 commits April 10, 2026 16:40
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 15 additional findings in Devin Review.

Open in Devin Review

return map[string]string{
services.NameLabelKey: authz.Handler.FeatureStore.Name,
services.ServiceTypeLabelKey: string(services.AuthzFeastType),
services.ManagedByLabelKey: services.ManagedByLabelValue,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 removeOrphanedRoles silently skips pre-upgrade custom auth Roles due to stricter label selector

The authz.getLabels() function now includes ManagedByLabelKey (authz.go:334), and removeOrphanedRoles uses this label set as a list selector (authz.go:85). Pre-upgrade custom auth Roles only have {NameLabelKey, ServiceTypeLabelKey} without ManagedByLabelKey, so the API server's label selector will never match them. These orphaned Roles will never be cleaned up by removeOrphanedRoles.

The main feast Role and RoleBinding are still cleaned up correctly via DeleteOwnedFeastObj (which looks up by name, not labels). Only custom auth roles from KubernetesAuthz.Roles are affected. The practical impact is limited: orphaned Roles have empty rules (no security impact) and have owner references for eventual GC on FeatureStore CR deletion. The window is narrow — it requires changing the Roles list concurrently with or very shortly after the operator upgrade, before the first reconciliation adds the label to existing Roles.

Prompt for agents
In authz.go, the removeOrphanedRoles function at line 81-101 lists Roles using authz.getLabels() as the label selector. Since getLabels() now includes ManagedByLabelKey, pre-upgrade Roles without this label are invisible to this cleanup function.

To fix: either (a) use a separate label set for removeOrphanedRoles that omits ManagedByLabelKey (matching by NameLabelKey and ServiceTypeLabelKey only), or (b) run a one-time migration during reconciliation that adds ManagedByLabelKey to all existing authz Roles before removeOrphanedRoles is called.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant