Skip to content

chore(scale): randomize node IPs and ports in fake workloads#19610

Open
vikin91 wants to merge 2 commits intomasterfrom
piotr/fake-node-ips
Open

chore(scale): randomize node IPs and ports in fake workloads#19610
vikin91 wants to merge 2 commits intomasterfrom
piotr/fake-node-ips

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Mar 25, 2026

Description

Fake workload nodes previously shared a single hardcoded IP pair and generated
random (possibly colliding) node ports. This caused unrealistic cardinality in
endpoint tracking during scale testing.

This was a required change to reproduce a real customer issue: ROX-33474.

Changes:

  • Generate per-node IPs from the node index so every fake node gets a unique
    internal and external address.
  • Assign node ports from a monotonically incrementing counter (30000-32767,
    wrapping) to avoid collisions in typical workload sizes.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • Run it using local sensor and observed the number of endpoints being processed in the metrics.
go run ./tools/local-sensor \
  -with-fakeworkload scale/workloads/np-load.yaml \
  -with-metrics \
  -with-pprof-server \
  -duration 20m \
  -skip-central-output \
  -no-mem-prof \
  -no-cpu-prof \
  2>&1 | tee /tmp/local-sensor-run.log

Previously all fake nodes shared a single hardcoded IP pair (10.138.28.6 /
35.185.217.58) and node ports were random (possibly colliding). This caused
unrealistic cardinality in endpoint tracking and masked issues that only
surface with many distinct IPs.

Generate per-node IPs from the node index so every fake node gets a unique
address, and assign node ports from a monotonically incrementing counter to
avoid collisions.

Partially generated by AI.

Made-with: Cursor
@vikin91
Copy link
Contributor Author

vikin91 commented Mar 25, 2026

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link

openshift-ci bot commented Mar 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The global nextNodePort counter is not concurrency-safe; if fake services are created from multiple goroutines, consider using atomic.AddInt32 or a sync.Mutex to avoid races and duplicate ports.
  • The getUniqueNodePort wrap-around at 32767 means ports will be reused once ~2.7k nodePorts are allocated; if scale tests can exceed this, consider increasing the range or tying the counter to the workload size to preserve uniqueness within a run.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The global `nextNodePort` counter is not concurrency-safe; if fake services are created from multiple goroutines, consider using `atomic.AddInt32` or a `sync.Mutex` to avoid races and duplicate ports.
- The `getUniqueNodePort` wrap-around at 32767 means ports will be reused once ~2.7k nodePorts are allocated; if scale tests can exceed this, consider increasing the range or tying the counter to the workload size to preserve uniqueness within a run.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@codecov
Copy link

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.28%. Comparing base (123b457) to head (f1f8116).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
sensor/kubernetes/fake/service.go 0.00% 9 Missing ⚠️
sensor/kubernetes/fake/nodes.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19610      +/-   ##
==========================================
+ Coverage   49.25%   49.28%   +0.02%     
==========================================
  Files        2735     2735              
  Lines      206138   206222      +84     
==========================================
+ Hits       101539   101634      +95     
+ Misses      97051    97047       -4     
+ Partials     7548     7541       -7     
Flag Coverage Δ
go-unit-tests 49.28% <0.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91
Copy link
Contributor Author

vikin91 commented Mar 25, 2026

  • The global nextNodePort counter is not concurrency-safe; if fake services are created from multiple goroutines, consider using atomic.AddInt32 or a sync.Mutex to avoid races and duplicate ports.

Thanks for the suggestion. I checked the call graph — getUniqueNodePort is only
reachable from getServices, which runs sequentially during initialization before
any goroutines are started. The surrounding helpers (getRandPort, generateIP,
rand.Intn) are likewise unsynchronized for the same reason. Adding atomics here
would mislead readers into thinking concurrent access is expected.

  • The getUniqueNodePort wrap-around at 32767 means ports will be reused once ~2.7k nodePorts are allocated; if scale tests can exceed this, consider increasing the range or tying the counter to the workload size to preserve uniqueness within a run.

Good callout on the range. The Kubernetes NodePort range is fixed at 30000–32767
(2768 ports), so there's no room to widen it without generating invalid ports. The
previous code (getRandPort() % 63556) could produce out-of-range values and
collided randomly; the new counter is collision-free for the full valid range and
wraps only when real clusters would face the same exhaustion. For our current scale
workloads this is more than sufficient, but I'll add a brief comment noting the
wrap-around for future readers.

@vikin91 vikin91 marked this pull request as ready for review March 25, 2026 17:14
@vikin91 vikin91 requested a review from a team as a code owner March 25, 2026 17:14
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The global nextNodePort counter is mutated without synchronization; if fake services are ever generated from multiple goroutines this will race and produce incorrect or duplicated ports, so consider guarding it with a mutex/atomic or making the allocator instance-scoped.
  • The IP generation logic using fmt.Sprintf("10.%d.%d.%d", 128+(i/(256*256))%128, (i/256)%256, i%256) is a bit opaque; extracting this into a small helper with a descriptive name or comment about the intended address space and max supported node count would improve readability and future modifications.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The global `nextNodePort` counter is mutated without synchronization; if fake services are ever generated from multiple goroutines this will race and produce incorrect or duplicated ports, so consider guarding it with a mutex/atomic or making the allocator instance-scoped.
- The IP generation logic using `fmt.Sprintf("10.%d.%d.%d", 128+(i/(256*256))%128, (i/256)%256, i%256)` is a bit opaque; extracting this into a small helper with a descriptive name or comment about the intended address space and max supported node count would improve readability and future modifications.

## Individual Comments

### Comment 1
<location path="sensor/kubernetes/fake/service.go" line_range="15-18" />
<code_context>
-	ipFamilies = [...]string{"IPv4", "IPv6"}
+	protocols          = [...]string{"TCP", "UDP", "SCTP"}
+	ipFamilies         = [...]string{"IPv4", "IPv6"}
+	nextNodePort int32 = 30000
 )

</code_context>
<issue_to_address>
**issue (bug_risk):** Global mutable nodePort counter is not concurrency-safe and may race under parallel usage.

If these helpers are called from multiple goroutines (common in concurrent tests), `nextNodePort` will race and may produce duplicate or invalid ports. Please synchronize access (e.g., with `atomic.AddInt32` or a mutex), or make the counter instance-scoped on a struct to keep concurrency behavior well-defined.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +15 to 18
nextNodePort int32 = 30000
)

func getRandProtocol() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Global mutable nodePort counter is not concurrency-safe and may race under parallel usage.

If these helpers are called from multiple goroutines (common in concurrent tests), nextNodePort will race and may produce duplicate or invalid ports. Please synchronize access (e.g., with atomic.AddInt32 or a mutex), or make the counter instance-scoped on a struct to keep concurrency behavior well-defined.

@rhacs-bot
Copy link
Contributor

Images are ready for the commit at f1f8116.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-431-gf1f8116859.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants