Skip to content

ROX-28479: Add external IPs to fake workload generation#14574

Merged
JoukoVirtanen merged 13 commits intomasterfrom
jv-ROX-28479-add-external-ips-to-fake-workload-generation
Mar 25, 2025
Merged

ROX-28479: Add external IPs to fake workload generation#14574
JoukoVirtanen merged 13 commits intomasterfrom
jv-ROX-28479-add-external-ips-to-fake-workload-generation

Conversation

@JoukoVirtanen
Copy link
Contributor

@JoukoVirtanen JoukoVirtanen commented Mar 11, 2025

Description

Adds external IPs to fake workload generation. We need to be able to test the external IPs feature using the existing scale tests, in which Sensor creates fake data that it itself consumes. This will enable us to find bugs that might cause crashes, scalability issues, and leaks. This will be part of the long running cluster. It will also be run with K6 load testing, though at the current moment the API endpoints relevant to external IPs is not called by K6 load testing. That change can be made later.

User-facing documentation

  • CHANGELOG is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed

Testing and quality

  • the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
    - [ ] added e2e tests
    - [ ] added regression tests
    - [ ] added compatibility tests
    - [ ] modified existing tests

No code that runs in production is modified, so no testing is needed. The fake workload generation currently does not have any testing.

How I validated my change

Scale testing was run using the following commands from the root of the stackrox/stackrox repository.

cd scale/dev
./cluster.sh jv-0311-exip-scale
infractl artifacts jv-0311-exip-scale --download-dir /tmp/artifacts-jv-0311-exip-scale
export KUBECONFIG=/tmp/artifacts-jv-0311-exip-scale/kubeconfig
export ROX_EXTERNAL_IPS=true
./run-many.sh long-running 10

The following script was run to get metadata on the external entities.

#!/usr/bin/env bash
set -eou pipefail

ROX_ENDPOINT=${1:-localhost:8000}


clusters_json="$(curl --location --silent --request GET "https://${ROX_ENDPOINT}/v1/clusters" -k --header "Authorization: Bearer $ROX_API_TOKEN")"

cluster_id="$(echo "$clusters_json" | jq -r '.clusters[0].id')"

metadata="$(curl --location --silent --request GET "https://${ROX_ENDPOINT}/v1/networkgraph/cluster/$cluster_id/externalentities/metadata" -k --header "Authorization: Bearer $ROX_API_TOKEN")"

echo "$metadata" | jq

Part of the output is shown bellow

{
  "entities": [ 
    {
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "__MC41NC4yMTMuMTk3LzMy",
        "externalSource": {
          "name": "0.54.213.197",
          "cidr": "0.54.213.197/32",
          "default": false,
          "discovered": true
        }
      },
      "flowsCount": 1
    },
    {
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "__MC41NS45MS4yNS8zMg",
        "externalSource": {
          "name": "0.55.91.25",
          "cidr": "0.55.91.25/32",
          "default": false,
          "discovered": true
        }
      },
      "flowsCount": 1
    },

....

   { 
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "__OTkuMjQuNjguNjIvMzI",
        "externalSource": {
          "name": "99.24.68.62",
          "cidr": "99.24.68.62/32",
          "default": false,
          "discovered": true
        } 
      },  
      "flowsCount": 1
    },
    { 
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "__OTkuNi44Ny44My8zMg",
        "externalSource": {
          "name": "99.6.87.83",
          "cidr": "99.6.87.83/32",
          "default": false,
          "discovered": true
        } 
      },  
      "flowsCount": 1
    } 
  ],  
  "totalEntities": 2585

The script was run a few times and the value of totalEntities was seen to fluctuate.

Long running cluster

A long running cluster was also created, which essentially does the same thing as above, except that it is triggered through a github action and runs for much longer.

The following was done to run the github action.

git commit -m "Empty commit" --allow-empty
git tag -a 0.0.1 -m “Test tag for long running cluster"
git push origin 0.0.1
git push origin HEAD

The commit used for this tag was 8436aaa. The tag was later deleted.

The images were then built by CI.

Went to https://github.com/stackrox/test-gh-actions/actions/workflows/create-clusters.yml

Clicked on “Run workflow”. Set the image tag in “Version of the images” to 0.0.1. Selected “Create a long-running cluster on RC1” and clicked on the green button that says “Run workflow”.

infractl artifacts long-fake-load-0-0-1 --download-dir /tmp/artifacts-long-fake-load-0-0-1
export KUBECONFIG=/tmp/artifacts-long-fake-load-0-0-1/kubeconfig

Set the ROX_EXTERNAL_IPS environment variable by editing the central deployment manually.

Obtained the password and API token.

Ran the script from the first testing section and saw the following

{
  "entities": [
    {
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "1de6b523-4a36-4082-8fa9-51dc670990c7__MTA0LjE0Mi4xNjkuMTQ4LzMy",
        "externalSource": {
          "name": "104.142.169.148",
          "cidr": "104.142.169.148/32",
          "default": false,
          "discovered": true
        }
      },
      "flowsCount": 1
    },
    {
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "1de6b523-4a36-4082-8fa9-51dc670990c7__MTA0LjE0NS42LjQwLzMy",
        "externalSource": {
          "name": "104.145.6.40",
          "cidr": "104.145.6.40/32",
          "default": false,
          "discovered": true
        }
      },
      "flowsCount": 1
    },


...


    {
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "1de6b523-4a36-4082-8fa9-51dc670990c7__OTkuMjkuMTg2LjE1Mi8zMg",
        "externalSource": {
          "name": "99.29.186.152",
          "cidr": "99.29.186.152/32",
          "default": false,
          "discovered": true
        }
      },
      "flowsCount": 4
    },
    {
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "1de6b523-4a36-4082-8fa9-51dc670990c7__OTkuNDAuNzkuMjA2LzMy",
        "externalSource": {
          "name": "99.40.79.206",
          "cidr": "99.40.79.206/32",
          "default": false,
          "discovered": true
        }
      },
      "flowsCount": 1
    },
    {
      "entity": {
        "type": "EXTERNAL_SOURCE",
        "id": "1de6b523-4a36-4082-8fa9-51dc670990c7__OTkuNjcuMTc5LjU4LzMy",
        "externalSource": {
          "name": "99.67.179.58",
          "cidr": "99.67.179.58/32",
          "default": false,
          "discovered": true
        }
      },
      "flowsCount": 1
    }
  ],
  "totalEntities": 2697
}

Here are some screen shots from the long running cluster.

Screenshot from 2025-03-17 08-27-52

Screenshot from 2025-03-17 08-28-04

Screenshot from 2025-03-17 08-28-19

Screenshot from 2025-03-17 08-28-28

Screenshot from 2025-03-17 08-28-33

@openshift-ci
Copy link

openshift-ci bot commented Mar 11, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Mar 11, 2025

Images are ready for the commit at 68301c6.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.8.x-221-g68301c6a8b.

@codecov
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 58.66667% with 31 lines in your changes missing coverage. Please review.

Project coverage is 48.97%. Comparing base (3cfa85c) to head (68301c6).
Report is 111 commits behind head on master.

Files with missing lines Patch % Lines
sensor/kubernetes/fake/flows.go 58.66% 30 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #14574      +/-   ##
==========================================
- Coverage   49.26%   48.97%   -0.30%     
==========================================
  Files        2528     2543      +15     
  Lines      184979   186443    +1464     
==========================================
+ Hits        91130    91305     +175     
- Misses      86621    87909    +1288     
- Partials     7228     7229       +1     
Flag Coverage Δ
go-unit-tests 48.97% <58.66%> (-0.30%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@JoukoVirtanen JoukoVirtanen force-pushed the jv-ROX-28479-add-external-ips-to-fake-workload-generation branch from 3137adc to e406412 Compare March 12, 2025 02:36
@JoukoVirtanen JoukoVirtanen marked this pull request as ready for review March 12, 2025 02:36
@JoukoVirtanen JoukoVirtanen requested a review from a team as a code owner March 12, 2025 02:36
Copy link
Contributor

@Stringy Stringy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my perspective, just a couple of minor comments/suggested simplifications. Will defer approval to the Sensor folks, who have more context on the implications of this than I do :)

@JoukoVirtanen JoukoVirtanen requested a review from Stringy March 12, 2025 21:46
Copy link
Contributor

@parametalol parametalol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor a little.

Nits:
Not your change, but IPs could be integers until needed to be formatted.
Also, pools might not be needed at all, if sequential IPs are used instead of random.

Copy link
Contributor

@parametalol parametalol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a room for refactoring, but that's fine.

Copy link
Contributor

@Stringy Stringy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my review comments, this looks good!

Using that opportunity, I would like to highlight that this change may amplify the number of active connections being tracked in Sensor and Central (see https://issues.redhat.com/browse/ROX-28242, https://issues.redhat.com/browse/ROX-28259, and https://issues.redhat.com/browse/ROX-28315). The "bad" behavior is already there and it concerns mainly the long-running cluster with fake workloads, so this is not a blocker for this PR. However, I would keep the next release managers informed that such things could be observed (Sensor OOMing, maybe also Central). We have not seen that yet in production code (non-fake workload), but this could also theoretically happen over longer time spans.

Luis and me are trying to prevent this from happening in #14538 and #14483, but the proper fix for that would be the Collector closing the connections. We do not have a collector in fake-workloads, so adding that to the workload generation would be beneficial. If you have an appetite for implementing it (separately to this PR), I would be happy to review.

@JoukoVirtanen
Copy link
Contributor Author

We do not have a collector in fake-workloads, so adding that to the workload generation would be beneficial. If you have an appetite for implementing it (separately to this PR), I would be happy to review.

The long running cluster is actually two clusters, one of which has collector and workload generated by kube-burner creating berserker pods.

@JoukoVirtanen JoukoVirtanen merged commit 4a12a6b into master Mar 25, 2025
92 checks passed
@JoukoVirtanen JoukoVirtanen deleted the jv-ROX-28479-add-external-ips-to-fake-workload-generation branch March 25, 2025 14:41
@vikin91
Copy link
Contributor

vikin91 commented Mar 27, 2025

The long running cluster is actually two clusters, one of which has collector and workload generated by kube-burner creating berserker pods.

Yes, I am aware of this. I mentioned:

The "bad" behavior is already there and it concerns mainly the long-running cluster with fake workloads

and the hint about closing connections for this cluster is still valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants