Skip to content

ROX-32798: Fix ipMap leak on IP recycling in clusterentities#19308

Open
vikin91 wants to merge 3 commits intomasterfrom
fix/ROX-32798-ip-recycling-pollution
Open

ROX-32798: Fix ipMap leak on IP recycling in clusterentities#19308
vikin91 wants to merge 3 commits intomasterfrom
fix/ROX-32798-ip-recycling-pollution

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Mar 5, 2026

Description

Strongly related to ROX-32798, but it has not been confirmed yet as the sole root cause of the issue.

What was the bug?

Sensor maintains an in-memory map (ipMap) that tracks which deployment currently owns each pod IP address. When a deployment was removed, the cleanup function (deleteDeploymentFromCurrent) was supposed to remove that deployment's entry from the map. However, if the IP was temporarily shared by two or more deployments, the cleanup skipped the removal entirely and only logged a warning. The stale deployment ID stayed in the map forever.

When could it happen?

Kubernetes routinely recycles pod IP addresses: when a pod dies, its IP can be reassigned to a new pod in a different deployment. Sensor receives these lifecycle events asynchronously — and on clusters with high pod churn, it is common for the "new pod got IP X" event to arrive before the "old pod with IP X is gone" event. During that short window, two deployments appear to share the same IP. This is the exact condition that triggered the bug.

What were the consequences?

Once the bug wasriggered, it was self-reinforcing:

  1. Memory leak — (major) ipMap accumulated stale deployment IDs that were never cleaned up. On an affected customer cluster, a single IP (10.131.x.y) had accumulated references to 4 deployments, only one of which was still alive.
  2. Accelerating noise — (minor, but many complained) because the stale entries inflated the cardinality check, every subsequent IP recycling event for the same address hit the buggy code path again, adding more stale entries and producing more warnings.
  3. Metric cardinality explosion — (minor) the existing ips_having_multiple_containers_total Prometheus metric used deployment UUIDs as a label. Each new combination of stale + active UUIDs created a new time series, resulting in ~50 distinct series from only ~30 affected IPs on the customer cluster.
  4. Incorrect network attribution — (minor, but hard to detect) network flow lookups using ipMap could attribute traffic to deployments that no longer existed and thus incorrect data to the network graph.

What does this PR change?

  • Bug fix (2 lines): deleteDeploymentFromCurrent now removes the specific deployment ID from the IP's set, and only deletes the map entry when the set is empty.
  • Metric fix: the containers label (comma-joined deployment UUIDs) is dropped from ips_having_multiple_containers_total. The ip label is kept — its cardinality is bounded by the pod CIDR and is sufficient to identify which IPs are being recycled.
  • Regression tests: 5 new test cases exercise the fix both at the low-level (deleteDeploymentFromCurrent) and via the public API (Store.Apply), simulating realistic out-of-order event delivery.

The bug fix and regression tests were generated with AI assistance. The metric change, test review, and validation were done by the author.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Ran the full cluster entities test suite locally:

go test -v -count=1 ./sensor/common/clusterentities/...

All 67 tests pass, including 5 new regression tests that:

  • directly exercise deleteDeploymentFromCurrent with shared IPs (3 subtests)
  • simulate realistic IP recycling via Store.Apply with out-of-order events (2 subtests)

These tests fail on master (confirming the bug exists) and pass with the fix applied.

@vikin91 vikin91 requested a review from a team as a code owner March 5, 2026 12:48
@vikin91
Copy link
Contributor Author

vikin91 commented Mar 5, 2026

This change is part of the following stack:

Change managed by git-spice.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In the new tests you repeatedly call net.ParseIP("10.0.0.1") inline (including as a map key); consider parsing once into a local variable and reusing it to avoid subtle issues and make the tests a bit clearer.
  • Helper-style functions in the test file like totalDeploymentsInIPMap, totalIPsInReverseIPMap, and ptr could be marked as test helpers (e.g., via t.Helper if refactored to take *testing.T) or grouped/placed closer to their callers to make the test flow easier to follow.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the new tests you repeatedly call net.ParseIP("10.0.0.1") inline (including as a map key); consider parsing once into a local variable and reusing it to avoid subtle issues and make the tests a bit clearer.
- Helper-style functions in the test file like totalDeploymentsInIPMap, totalIPsInReverseIPMap, and ptr could be marked as test helpers (e.g., via t.Helper if refactored to take *testing.T) or grouped/placed closer to their callers to make the test flow easier to follow.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@rhacs-bot
Copy link
Contributor

Images are ready for the commit at e8fb513.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-247-ge8fb513f77.

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.63%. Comparing base (092168e) to head (e8fb513).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #19308   +/-   ##
=======================================
  Coverage   49.62%   49.63%           
=======================================
  Files        2680     2680           
  Lines      202231   202230    -1     
=======================================
+ Hits       100366   100370    +4     
+ Misses      94379    94377    -2     
+ Partials     7486     7483    -3     
Flag Coverage Δ
go-unit-tests 49.63% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants