Skip to content

ROX-31012: Optimize memory for UpdateComputer endpoint dedupers#17040

Merged
vikin91 merged 11 commits intomasterfrom
piotr/ROX-31012-less-memory-deduper
Oct 15, 2025
Merged

ROX-31012: Optimize memory for UpdateComputer endpoint dedupers#17040
vikin91 merged 11 commits intomasterfrom
piotr/ROX-31012-less-memory-deduper

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Sep 26, 2025

(OPTIONAL FOR 4.9)

Description

This PR changes the key-value type of endpoints deduper to use [8]byte uint64 instead of string. That significantly reduces the memory footprint of the deduper (in theory by 50%) without loosing accuracy. The 8-byte type is used because of the 64-bit hashing with collision probability of 2.7% for 100M entries*.

Drawbacks:

  • It is no longer possible to switch the hashingAlgo between string and hash. Constant length of the data type requires using hashing.

*) Long running cluster running with 100 open endpoints per second generates roughly 16M entries in 3 days.
Increasing the hash length to 128-bit lowers the collision chance to 1 in ~10⁻²⁰ while using the same amount of memory as the current map[string]string implementation.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Insights into deduper memory after 2 days of long-running cluster with fake workloads

Screenshot 2025-10-02 at 10 52 55 Screenshot 2025-10-02 at 10 52 31

Size of single deduper entry

  • Before (code from piotr/ROX-30941-no-deduping-processes)
Screenshot 2025-09-30 at 12 44 02 (Cluster was running over 6 days. Value = 3.60GB / 31.5 M deduper entries)
  • After (code from this branch):
Screenshot 2025-09-30 at 12 44 10 (Cluster was running over few hours. Value = 11.3MB / 393 K deduper entries)

@vikin91
Copy link
Contributor Author

vikin91 commented Sep 26, 2025

@openshift-ci
Copy link

openshift-ci bot commented Sep 26, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@vikin91 vikin91 changed the title ROX-31012: ROX-31012: Optimize memory for UpdateComputer dedupers Sep 26, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@vikin91 vikin91 changed the title ROX-31012: Optimize memory for UpdateComputer dedupers ROX-31012: Optimize memory for UpdateComputer endpoint dedupers Sep 26, 2025
@rhacs-bot
Copy link
Contributor

rhacs-bot commented Sep 26, 2025

Images are ready for the commit at 175c010.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-1002-g175c010b7a.

@codecov
Copy link

codecov bot commented Sep 26, 2025

Codecov Report

❌ Patch coverage is 93.33333% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.85%. Comparing base (bf99a92) to head (0c2abb2).
⚠️ Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
...mon/networkflow/updatecomputer/transition_based.go 86.66% 2 Missing ⚠️
...ensor/common/networkflow/updatecomputer/testing.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #17040      +/-   ##
==========================================
- Coverage   48.86%   48.85%   -0.02%     
==========================================
  Files        2720     2720              
  Lines      203364   203304      -60     
==========================================
- Hits        99374    99319      -55     
+ Misses      96166    96163       -3     
+ Partials     7824     7822       -2     
Flag Coverage Δ
go-unit-tests 48.85% <93.33%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 force-pushed the piotr/ROX-30941-no-deduping-processes branch from 0264519 to 26fee46 Compare October 1, 2025 08:01
@vikin91 vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from f636ddd to 425a08d Compare October 1, 2025 08:02
Base automatically changed from piotr/ROX-30941-no-deduping-processes to master October 1, 2025 10:33
@vikin91 vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from 5e67e12 to 677d6ce Compare October 2, 2025 08:47
@vikin91 vikin91 marked this pull request as ready for review October 2, 2025 08:55
@vikin91 vikin91 requested a review from a team as a code owner October 2, 2025 08:55
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@vikin91 vikin91 added the auto-retest PRs with this label will be automatically retested if prow checks fails label Oct 2, 2025
@vikin91
Copy link
Contributor Author

vikin91 commented Oct 2, 2025

/retest

@red-hat-konflux

This comment was marked as off-topic.

@janisz janisz requested a review from a team October 3, 2025 12:34
@vikin91 vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from 677d6ce to 175c010 Compare October 7, 2025 11:22
@rhacs-bot
Copy link
Contributor

/retest

@vikin91 vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from bf181fd to 41d6369 Compare October 14, 2025 13:09
@janisz janisz requested review from a team and johannes94 October 15, 2025 11:21
Copy link
Contributor

@janisz janisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM there are potential perf optimisations but that could be handled in next PR
I added @johannes94 the hashing expert to find other optimisations

@johannes94
Copy link
Contributor

johannes94 commented Oct 15, 2025

@janisz @vikin91

TBH I don't feel like a hashing expert and it is hard for me to come up with good suggestions without any data to see where allocation and heap usage hot spots are.

If I were to find more optimizations I would:

  • Write a benchmark test that uses the hash function e.g. NetworkConn.Key, or even a Benchmark for the code part storing the Keys
  • Let that run with go test -bench=. -memprofile mem.prof
  • Analyze with go tool pprof mem.prof and go tool pprof --alloc_space mem.prof and try to optimize the hot spots I see there

@vikin91 vikin91 requested a review from a team as a code owner October 15, 2025 13:04
@vikin91
Copy link
Contributor Author

vikin91 commented Oct 15, 2025

Naturally! I compared the 41d6369 (Old) against 0c2abb2 (New)

Here are the results:

➜ go test -bench=. -benchmem -benchtime=10s ./sensor/common/networkflow/manager/indicator/
goos: darwin
goarch: arm64
pkg: github.com/stackrox/rox/sensor/common/networkflow/manager/indicator
cpu: Apple M3 Pro
BenchmarkNetworkConn_KeyGeneration_Old-12       100000000              119.5 ns/op           184 B/op          7 allocs/op
BenchmarkNetworkConn_KeyGeneration_New-12       172759272               67.48 ns/op          104 B/op          3 allocs/op
BenchmarkNetworkConn_Key_Old-12                 100000000              111.5 ns/op           184 B/op          7 allocs/op
BenchmarkNetworkConn_Key_New-12                 173108421               69.77 ns/op          104 B/op          3 allocs/op
BenchmarkContainerEndpoint_Old-12               100000000              101.0 ns/op           152 B/op          6 allocs/op
BenchmarkContainerEndpoint_New-12               335066269               35.69 ns/op           88 B/op          2 allocs/op
BenchmarkProcessListening_Old-12                65562798               182.5 ns/op           224 B/op         12 allocs/op
BenchmarkProcessListening_New-12                154173038               77.99 ns/op           88 B/op          2 allocs/op
BenchmarkHashPortAndProtocol_Old-12             630870724               18.92 ns/op            8 B/op          2 allocs/op
BenchmarkHashPortAndProtocol_New-12             1000000000              10.30 ns/op            8 B/op          1 allocs/op
BenchmarkHashStrings_Old-12                     231562418               51.60 ns/op           81 B/op          3 allocs/op
BenchmarkHashStrings_New-12                     670954704               17.79 ns/op            0 B/op          0 allocs/op
BenchmarkHashStrings_Old_5Strings-12            89847706               132.6 ns/op           136 B/op          9 allocs/op
BenchmarkHashStrings_New_5Strings-12            301258119               39.68 ns/op            0 B/op          0 allocs/op
PASS
ok      github.com/stackrox/rox/sensor/common/networkflow/manager/indicator     203.889s
Here is the benchmark code
package indicator

import (
	"hash"
	"testing"

	"github.com/cespare/xxhash"
	"github.com/stackrox/rox/generated/storage"
	"github.com/stackrox/rox/pkg/networkgraph"
)

// Old implementations (before optimization) for comparison

func hashPortAndProtocolOld(h hash.Hash64, port uint16, protocol storage.L4Protocol) {
	// OLD: Two separate allocations
	portBytes := [2]byte{byte(port >> 8), byte(port)}
	_, _ = h.Write(portBytes[:])

	protocolBytes := [4]byte{
		byte(protocol >> 24), byte(protocol >> 16),
		byte(protocol >> 8), byte(protocol),
	}
	_, _ = h.Write(protocolBytes[:])
}

func hashStringsOld(h hash.Hash64, strs ...string) {
	// OLD: Allocates []byte for delimiter and for each string conversion
	for i, s := range strs {
		if i > 0 {
			_, _ = h.Write([]byte{0}) // Allocation per iteration
		}
		_, _ = h.Write([]byte(s)) // Allocation per string
	}
}

func keyHashOld(srcID, dstID string, port uint16, protocol storage.L4Protocol) string {
	h := xxhash.New()
	hashStringsOld(h, srcID, dstID)
	hashPortAndProtocolOld(h, port, protocol)
	return hashToHexString(h.Sum64())
}

// New implementations (current optimized versions) - using the actual functions

func keyHashNew(srcID, dstID string, port uint16, protocol storage.L4Protocol) string {
	h := xxhash.New()
	hashStrings(h, srcID, dstID)
	hashPortAndProtocol(h, port, protocol)
	return hashToHexString(h.Sum64())
}

// Benchmark NetworkConn key generation

func BenchmarkNetworkConn_KeyGeneration_Old(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashOld(entity1.ID, entity2.ID, port, protocol)
	}
}

func BenchmarkNetworkConn_KeyGeneration_New(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashNew(entity1.ID, entity2.ID, port, protocol)
	}
}

// Benchmark with real NetworkConn objects

func BenchmarkNetworkConn_Key_Old(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}

	conn := NetworkConn{
		SrcEntity: entity1,
		DstEntity: entity2,
		DstPort:   8080,
		Protocol:  storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashOld(conn.SrcEntity.ID, conn.DstEntity.ID, conn.DstPort, conn.Protocol)
	}
}

func BenchmarkNetworkConn_Key_New(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}

	conn := NetworkConn{
		SrcEntity: entity1,
		DstEntity: entity2,
		DstPort:   8080,
		Protocol:  storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = conn.Key()
	}
}

// Benchmark ContainerEndpoint key generation

func BenchmarkContainerEndpoint_Old(b *testing.B) {
	entity := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-with-a-very-long-identifier"}
	port := uint16(443)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashOld(entity.ID, "", port, protocol)
	}
}

func BenchmarkContainerEndpoint_New(b *testing.B) {
	entity := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-with-a-very-long-identifier"}

	endpoint := ContainerEndpoint{
		Entity:   entity,
		Port:     443,
		Protocol: storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = endpoint.BinaryKey()
	}
}

// Benchmark ProcessListening key generation

func BenchmarkProcessListening_Old(b *testing.B) {
	podID := "nginx-pod-12345678-abcd"
	containerName := "nginx-container-with-long-name"
	processName := "nginx"
	processExec := "/usr/sbin/nginx"
	processArgs := "-g daemon off; -c /etc/nginx/nginx.conf"
	port := uint16(80)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		h := xxhash.New()
		hashStringsOld(h, podID, containerName, processName, processExec, processArgs)
		hashPortAndProtocolOld(h, port, protocol)
		_ = BinaryHash(h.Sum64())
	}
}

func BenchmarkProcessListening_New(b *testing.B) {
	process := ProcessListening{
		Process: ProcessInfo{
			ProcessName: "nginx",
			ProcessArgs: "-g daemon off; -c /etc/nginx/nginx.conf",
			ProcessExec: "/usr/sbin/nginx",
		},
		PodID:         "nginx-pod-12345678-abcd",
		ContainerName: "nginx-container-with-long-name",
		DeploymentID:  "nginx-deployment",
		PodUID:        "pod-uid-456",
		Namespace:     "default",
		Port:          80,
		Protocol:      storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = process.BinaryKey()
	}
}

// Benchmark individual optimization components

func BenchmarkHashPortAndProtocol_Old(b *testing.B) {
	h := xxhash.New()
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashPortAndProtocolOld(h, port, protocol)
		h.Reset()
	}
}

func BenchmarkHashPortAndProtocol_New(b *testing.B) {
	h := xxhash.New()
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashPortAndProtocol(h, port, protocol)
		h.Reset()
	}
}

func BenchmarkHashStrings_Old(b *testing.B) {
	h := xxhash.New()
	str1 := "deployment-1-with-a-long-name"
	str2 := "deployment-2-with-another-long-name"

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStringsOld(h, str1, str2)
		h.Reset()
	}
}

func BenchmarkHashStrings_New(b *testing.B) {
	h := xxhash.New()
	str1 := "deployment-1-with-a-long-name"
	str2 := "deployment-2-with-another-long-name"

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStrings(h, str1, str2)
		h.Reset()
	}
}

// Benchmark with varying string counts
func BenchmarkHashStrings_Old_5Strings(b *testing.B) {
	h := xxhash.New()
	strs := []string{
		"nginx-pod-12345678-abcd",
		"nginx-container-with-long-name",
		"nginx",
		"/usr/sbin/nginx",
		"-g daemon off; -c /etc/nginx/nginx.conf",
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStringsOld(h, strs...)
		h.Reset()
	}
}

func BenchmarkHashStrings_New_5Strings(b *testing.B) {
	h := xxhash.New()
	strs := []string{
		"nginx-pod-12345678-abcd",
		"nginx-container-with-long-name",
		"nginx",
		"/usr/sbin/nginx",
		"-g daemon off; -c /etc/nginx/nginx.conf",
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStrings(h, strs...)
		h.Reset()
	}
}

@rhacs-bot
Copy link
Contributor

/retest

@rhacs-bot
Copy link
Contributor

/retest

1 similar comment
@rhacs-bot
Copy link
Contributor

/retest

@openshift-ci
Copy link

openshift-ci bot commented Oct 15, 2025

@vikin91: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-19-qa-e2e-tests 0c2abb2 link false /test ocp-4-19-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@vikin91 vikin91 merged commit 03f121c into master Oct 15, 2025
97 of 98 checks passed
@vikin91 vikin91 deleted the piotr/ROX-31012-less-memory-deduper branch October 15, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review area/sensor auto-retest PRs with this label will be automatically retested if prow checks fails

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants