Skip to content

ROX-30860: Limit parallel port use by processes #16628

Merged
vikin91 merged 5 commits intomasterfrom
piotr/fake-workload-limit-port-reuse
Sep 25, 2025
Merged

ROX-30860: Limit parallel port use by processes #16628
vikin91 merged 5 commits intomasterfrom
piotr/fake-workload-limit-port-reuse

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Sep 2, 2025

Description

Introduced configurable port reuse probability to limit unrealistic endpoint sharing behavior in fake workload generation.

Key Changes:

  • Added OriginatorCache: Implements probabilistic endpoint-to-originator caching with configurable reuse probability
  • Added openPortReuseProbability field: Configurable parameter (default 0.05) controlling how often different processes reuse the same IP:port without closing endpoints
  • Updated all workload configurations: Set realistic 5% port reuse probability across all test scenarios
  • Enhanced network endpoint generation: Uses consistent originator caching instead of random assignment

Why: Previous behavior (100% port reuse in releases ≤4.8) was unrealistic and caused memory pressure in Sensor's enrichment pipeline. Multiple processes listening on the same IP:port without proper endpoint closure created excessive deduplication overhead, as Sensor would retain stale entries indefinitely when no close messages were sent between different originators on the same endpoint.

The new approach simulates realistic container behavior where processes typically bind consistently to endpoints (95% default) while allowing occasional port reuse scenarios (5% for restarts/takeovers).

Impact:

  • Reduces memory pressure in test environments while maintaining test coverage for edge cases.
  • Long running cluster measurements for 4.9 cannot be directly compared against versions ≤4.8 as the workload profile has changed.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • CI and added unit tests
  • Manual run on the long-running cluster

The workload runs fine on the long running cluster.
Sensor is crashing every 12 hours on 8GB of memory (due to OOMKill), which is a known pattern, but that is not caused by this PR, but by the: (1) bug in the memory management that is currently on master, (2) giving Sensor only 8GB of memory for the experiment.

The following chart confirms the rates of objects being generated by fake workflows
Screenshot 2025-09-23 at 10 39 33

This chart shows that the processesListening are being processed and sent to Central
Screenshot 2025-09-23 at 10 46 02

@openshift-ci
Copy link

openshift-ci bot commented Sep 2, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Sep 2, 2025

Images are ready for the commit at ddb5b45.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-712-gddb5b450d0.

@codecov
Copy link

codecov bot commented Sep 2, 2025

Codecov Report

❌ Patch coverage is 29.82456% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.03%. Comparing base (d3afc34) to head (ddb5b45).
⚠️ Report is 177 commits behind head on master.

Files with missing lines Patch % Lines
sensor/kubernetes/fake/flows.go 39.53% 25 Missing and 1 partial ⚠️
sensor/kubernetes/fake/fake.go 0.00% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #16628      +/-   ##
==========================================
+ Coverage   48.67%   49.03%   +0.36%     
==========================================
  Files        2675     2691      +16     
  Lines      199760   201669    +1909     
==========================================
+ Hits        97235    98897    +1662     
- Misses      94918    95103     +185     
- Partials     7607     7669      +62     
Flag Coverage Δ
go-unit-tests 49.03% <29.82%> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 changed the base branch from master to piotr/fake-workload-remove-globals September 2, 2025 13:08
@vikin91
Copy link
Contributor Author

vikin91 commented Sep 2, 2025

This change is part of the following stack:

Change managed by git-spice.

@vikin91 vikin91 force-pushed the piotr/fake-workload-limit-port-reuse branch from 01545b1 to f1977c0 Compare September 2, 2025 13:08
@vikin91 vikin91 changed the title feat(fake workload): Limit simultaneous port use by multiple processes ROX-30390: Limit simultaneous port use by multiple processes Sep 2, 2025
@vikin91 vikin91 force-pushed the piotr/fake-workload-limit-port-reuse branch 3 times, most recently from 67995be to 4a77577 Compare September 2, 2025 15:55
@vikin91 vikin91 force-pushed the piotr/fake-workload-remove-globals branch from 502b8cf to 6c08ff8 Compare September 2, 2025 16:35
@vikin91 vikin91 force-pushed the piotr/fake-workload-limit-port-reuse branch from 4a77577 to da9feb4 Compare September 2, 2025 16:35
@vikin91 vikin91 force-pushed the piotr/fake-workload-remove-globals branch from 6c08ff8 to b75539a Compare September 4, 2025 11:07
@vikin91 vikin91 force-pushed the piotr/fake-workload-limit-port-reuse branch from da9feb4 to 6a3376f Compare September 4, 2025 11:07
@vikin91 vikin91 force-pushed the piotr/fake-workload-remove-globals branch from b75539a to 7e33a34 Compare September 5, 2025 10:22
@vikin91 vikin91 force-pushed the piotr/fake-workload-limit-port-reuse branch from 6a3376f to d0c8150 Compare September 5, 2025 10:22
Base automatically changed from piotr/fake-workload-remove-globals to master September 9, 2025 08:05
@vikin91 vikin91 force-pushed the piotr/fake-workload-limit-port-reuse branch from d0c8150 to 6a81807 Compare September 9, 2025 08:06
@vikin91 vikin91 added the auto-retest PRs with this label will be automatically retested if prow checks fails label Sep 9, 2025
@vikin91 vikin91 force-pushed the piotr/fake-workload-limit-port-reuse branch from be53969 to 72a20e4 Compare September 9, 2025 09:18
@vikin91 vikin91 marked this pull request as ready for review September 9, 2025 12:25
@vikin91 vikin91 requested a review from a team as a code owner September 9, 2025 12:25
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `sensor/kubernetes/fake/flows.go:59` </location>
<code_context>
+// <container1, 1.1.1.1:80, apache2>, then Sensor will keep the nginx-entry forever, as there was no 'close' message in between.
+//
+// The probability logic is explicit and configurable for different testing scenarios.
+func (oc *OriginatorCache) GetOrSetOriginator(endpointKey string, containerID string, openPortReuseProbability float32, processPool *ProcessPool) *storage.NetworkProcessUniqueKey {
+	// Use panic-safe read lock to check cache
+	originator, exists := concurrency.WithRLock2(&oc.lock, func() (*storage.NetworkProcessUniqueKey, bool) {
</code_context>

<issue_to_address>
Consider validating openPortReuseProbability input range.

Without an explicit range check, passing values outside [0.0, 1.0] could cause incorrect probability calculations. Please add validation or clamp the input to avoid such issues.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@vikin91 vikin91 changed the title ROX-30390: Limit simultaneous port use by multiple processes ROX-30860 : Limit simultaneous port use by multiple processes Sep 11, 2025
@vikin91 vikin91 changed the title ROX-30860 : Limit simultaneous port use by multiple processes ROX-30860 : Limit simultaneous port use by processes Sep 12, 2025
@vikin91 vikin91 changed the title ROX-30860 : Limit simultaneous port use by processes ROX-30860 : Limit parallel port use by processes Sep 16, 2025
@vikin91 vikin91 changed the title ROX-30860 : Limit parallel port use by processes ROX-30860: Limit parallel port use by processes Sep 16, 2025
@vikin91
Copy link
Contributor Author

vikin91 commented Sep 18, 2025

I still haven't done the long run on a cluster, but I will trigger it tomorrow.
The expectation is that the load on Sensor and Central will be lower.

Edit: running on pr-09-19-long

@rhacs-bot
Copy link
Contributor

/retest

@red-hat-konflux

This comment was marked as off-topic.

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 23, 2025

/retest-required

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 23, 2025

@JoukoVirtanen PTAL again, I have observations from the long-running cluster that workload generation works fine.

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 23, 2025

/retest-required

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 23, 2025

/retest

Copy link
Contributor

@JoukoVirtanen JoukoVirtanen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vikin91 vikin91 merged commit c0b5403 into master Sep 25, 2025
105 of 107 checks passed
@vikin91 vikin91 deleted the piotr/fake-workload-limit-port-reuse branch September 25, 2025 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review area/sensor auto-retest PRs with this label will be automatically retested if prow checks fails

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants