ROX-31012: Optimize memory for UpdateComputer endpoint dedupers by vikin91 · Pull Request #17040 · stackrox/stackrox

vikin91 · 2025-09-26T15:12:17Z

(OPTIONAL FOR 4.9)

Description

This PR changes the key-value type of endpoints deduper to use ~~[8]byte~~ uint64 instead of string. That significantly reduces the memory footprint of the deduper (in theory by 50%) without loosing accuracy. The 8-byte type is used because of the 64-bit hashing with collision probability of 2.7% for 100M entries*.

Drawbacks:

It is no longer possible to switch the hashingAlgo between string and hash. Constant length of the data type requires using hashing.

*) Long running cluster running with 100 open endpoints per second generates roughly 16M entries in 3 days.
Increasing the hash length to 128-bit lowers the collision chance to 1 in ~10⁻²⁰ while using the same amount of memory as the current map[string]string implementation.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

CI
Long running cluster with memory measurement using metrics from ROX-30390: Add grafana dashboard for Update Computer #16790

Insights into deduper memory after 2 days of long-running cluster with fake workloads

Size of single deduper entry

Before (code from piotr/ROX-30941-no-deduping-processes)

(Cluster was running over 6 days. Value = 3.60GB / 31.5 M deduper entries)

After (code from this branch):

(Cluster was running over few hours. Value = 11.3MB / 393 K deduper entries)

vikin91 · 2025-09-26T15:12:18Z

This change is part of the following stack:

ROX-30941: Enrich endpoints and processes together #16901
- ROX-30941: Correct 'getDeduperAction' details #17108
  - ROX-31012: Optimize memory for UpdateComputer endpoint dedupers #17040 ◀
    - ROX-31012: Optimize memory for UpdateComputer connection deduper #17041

_{Change managed by git-spice.}

openshift-ci · 2025-09-26T15:12:21Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

rhacs-bot · 2025-09-26T15:42:41Z

Images are ready for the commit at 175c010.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-1002-g175c010b7a.

codecov · 2025-09-26T15:52:25Z

Codecov Report

❌ Patch coverage is 93.33333% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.85%. Comparing base (bf99a92) to head (0c2abb2).
⚠️ Report is 21 commits behind head on master.

Files with missing lines	Patch %	Lines
...mon/networkflow/updatecomputer/transition_based.go	86.66%	2 Missing ⚠️
...ensor/common/networkflow/updatecomputer/testing.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #17040      +/-   ##
==========================================
- Coverage   48.86%   48.85%   -0.02%     
==========================================
  Files        2720     2720              
  Lines      203364   203304      -60     
==========================================
- Hits        99374    99319      -55     
+ Misses      96166    96163       -3     
+ Partials     7824     7822       -2

Flag	Coverage Δ
go-unit-tests	`48.85% <93.33%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

vikin91 · 2025-10-02T11:21:04Z

/retest

sensor/common/networkflow/manager/indicator/indicator.go

sensor/common/networkflow/manager/indicator/key.go

sensor/common/networkflow/updatecomputer/transition_based.go

rhacs-bot · 2025-10-07T13:36:50Z

/retest

sensor/common/networkflow/manager/indicator/indicator.go

sensor/common/networkflow/manager/indicator/key.go

janisz

LGTM there are potential perf optimisations but that could be handled in next PR
I added @johannes94 the hashing expert to find other optimisations

johannes94 · 2025-10-15T12:55:01Z

@janisz @vikin91

TBH I don't feel like a hashing expert and it is hard for me to come up with good suggestions without any data to see where allocation and heap usage hot spots are.

If I were to find more optimizations I would:

Write a benchmark test that uses the hash function e.g. NetworkConn.Key, or even a Benchmark for the code part storing the Keys
Let that run with go test -bench=. -memprofile mem.prof
Analyze with go tool pprof mem.prof and go tool pprof --alloc_space mem.prof and try to optimize the hot spots I see there

vikin91 · 2025-10-15T13:16:55Z

Naturally! I compared the 41d6369 (Old) against 0c2abb2 (New)

Here are the results:

➜ go test -bench=. -benchmem -benchtime=10s ./sensor/common/networkflow/manager/indicator/
goos: darwin
goarch: arm64
pkg: github.com/stackrox/rox/sensor/common/networkflow/manager/indicator
cpu: Apple M3 Pro
BenchmarkNetworkConn_KeyGeneration_Old-12       100000000              119.5 ns/op           184 B/op          7 allocs/op
BenchmarkNetworkConn_KeyGeneration_New-12       172759272               67.48 ns/op          104 B/op          3 allocs/op
BenchmarkNetworkConn_Key_Old-12                 100000000              111.5 ns/op           184 B/op          7 allocs/op
BenchmarkNetworkConn_Key_New-12                 173108421               69.77 ns/op          104 B/op          3 allocs/op
BenchmarkContainerEndpoint_Old-12               100000000              101.0 ns/op           152 B/op          6 allocs/op
BenchmarkContainerEndpoint_New-12               335066269               35.69 ns/op           88 B/op          2 allocs/op
BenchmarkProcessListening_Old-12                65562798               182.5 ns/op           224 B/op         12 allocs/op
BenchmarkProcessListening_New-12                154173038               77.99 ns/op           88 B/op          2 allocs/op
BenchmarkHashPortAndProtocol_Old-12             630870724               18.92 ns/op            8 B/op          2 allocs/op
BenchmarkHashPortAndProtocol_New-12             1000000000              10.30 ns/op            8 B/op          1 allocs/op
BenchmarkHashStrings_Old-12                     231562418               51.60 ns/op           81 B/op          3 allocs/op
BenchmarkHashStrings_New-12                     670954704               17.79 ns/op            0 B/op          0 allocs/op
BenchmarkHashStrings_Old_5Strings-12            89847706               132.6 ns/op           136 B/op          9 allocs/op
BenchmarkHashStrings_New_5Strings-12            301258119               39.68 ns/op            0 B/op          0 allocs/op
PASS
ok      github.com/stackrox/rox/sensor/common/networkflow/manager/indicator     203.889s

Here is the benchmark code

package indicator

import (
	"hash"
	"testing"

	"github.com/cespare/xxhash"
	"github.com/stackrox/rox/generated/storage"
	"github.com/stackrox/rox/pkg/networkgraph"
)

// Old implementations (before optimization) for comparison

func hashPortAndProtocolOld(h hash.Hash64, port uint16, protocol storage.L4Protocol) {
	// OLD: Two separate allocations
	portBytes := [2]byte{byte(port >> 8), byte(port)}
	_, _ = h.Write(portBytes[:])

	protocolBytes := [4]byte{
		byte(protocol >> 24), byte(protocol >> 16),
		byte(protocol >> 8), byte(protocol),
	}
	_, _ = h.Write(protocolBytes[:])
}

func hashStringsOld(h hash.Hash64, strs ...string) {
	// OLD: Allocates []byte for delimiter and for each string conversion
	for i, s := range strs {
		if i > 0 {
			_, _ = h.Write([]byte{0}) // Allocation per iteration
		}
		_, _ = h.Write([]byte(s)) // Allocation per string
	}
}

func keyHashOld(srcID, dstID string, port uint16, protocol storage.L4Protocol) string {
	h := xxhash.New()
	hashStringsOld(h, srcID, dstID)
	hashPortAndProtocolOld(h, port, protocol)
	return hashToHexString(h.Sum64())
}

// New implementations (current optimized versions) - using the actual functions

func keyHashNew(srcID, dstID string, port uint16, protocol storage.L4Protocol) string {
	h := xxhash.New()
	hashStrings(h, srcID, dstID)
	hashPortAndProtocol(h, port, protocol)
	return hashToHexString(h.Sum64())
}

// Benchmark NetworkConn key generation

func BenchmarkNetworkConn_KeyGeneration_Old(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashOld(entity1.ID, entity2.ID, port, protocol)
	}
}

func BenchmarkNetworkConn_KeyGeneration_New(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashNew(entity1.ID, entity2.ID, port, protocol)
	}
}

// Benchmark with real NetworkConn objects

func BenchmarkNetworkConn_Key_Old(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}

	conn := NetworkConn{
		SrcEntity: entity1,
		DstEntity: entity2,
		DstPort:   8080,
		Protocol:  storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashOld(conn.SrcEntity.ID, conn.DstEntity.ID, conn.DstPort, conn.Protocol)
	}
}

func BenchmarkNetworkConn_Key_New(b *testing.B) {
	entity1 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-1-with-a-long-name"}
	entity2 := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-2-with-another-long-name"}

	conn := NetworkConn{
		SrcEntity: entity1,
		DstEntity: entity2,
		DstPort:   8080,
		Protocol:  storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = conn.Key()
	}
}

// Benchmark ContainerEndpoint key generation

func BenchmarkContainerEndpoint_Old(b *testing.B) {
	entity := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-with-a-very-long-identifier"}
	port := uint16(443)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = keyHashOld(entity.ID, "", port, protocol)
	}
}

func BenchmarkContainerEndpoint_New(b *testing.B) {
	entity := networkgraph.Entity{Type: storage.NetworkEntityInfo_DEPLOYMENT, ID: "deployment-with-a-very-long-identifier"}

	endpoint := ContainerEndpoint{
		Entity:   entity,
		Port:     443,
		Protocol: storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = endpoint.BinaryKey()
	}
}

// Benchmark ProcessListening key generation

func BenchmarkProcessListening_Old(b *testing.B) {
	podID := "nginx-pod-12345678-abcd"
	containerName := "nginx-container-with-long-name"
	processName := "nginx"
	processExec := "/usr/sbin/nginx"
	processArgs := "-g daemon off; -c /etc/nginx/nginx.conf"
	port := uint16(80)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		h := xxhash.New()
		hashStringsOld(h, podID, containerName, processName, processExec, processArgs)
		hashPortAndProtocolOld(h, port, protocol)
		_ = BinaryHash(h.Sum64())
	}
}

func BenchmarkProcessListening_New(b *testing.B) {
	process := ProcessListening{
		Process: ProcessInfo{
			ProcessName: "nginx",
			ProcessArgs: "-g daemon off; -c /etc/nginx/nginx.conf",
			ProcessExec: "/usr/sbin/nginx",
		},
		PodID:         "nginx-pod-12345678-abcd",
		ContainerName: "nginx-container-with-long-name",
		DeploymentID:  "nginx-deployment",
		PodUID:        "pod-uid-456",
		Namespace:     "default",
		Port:          80,
		Protocol:      storage.L4Protocol_L4_PROTOCOL_TCP,
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = process.BinaryKey()
	}
}

// Benchmark individual optimization components

func BenchmarkHashPortAndProtocol_Old(b *testing.B) {
	h := xxhash.New()
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashPortAndProtocolOld(h, port, protocol)
		h.Reset()
	}
}

func BenchmarkHashPortAndProtocol_New(b *testing.B) {
	h := xxhash.New()
	port := uint16(8080)
	protocol := storage.L4Protocol_L4_PROTOCOL_TCP

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashPortAndProtocol(h, port, protocol)
		h.Reset()
	}
}

func BenchmarkHashStrings_Old(b *testing.B) {
	h := xxhash.New()
	str1 := "deployment-1-with-a-long-name"
	str2 := "deployment-2-with-another-long-name"

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStringsOld(h, str1, str2)
		h.Reset()
	}
}

func BenchmarkHashStrings_New(b *testing.B) {
	h := xxhash.New()
	str1 := "deployment-1-with-a-long-name"
	str2 := "deployment-2-with-another-long-name"

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStrings(h, str1, str2)
		h.Reset()
	}
}

// Benchmark with varying string counts
func BenchmarkHashStrings_Old_5Strings(b *testing.B) {
	h := xxhash.New()
	strs := []string{
		"nginx-pod-12345678-abcd",
		"nginx-container-with-long-name",
		"nginx",
		"/usr/sbin/nginx",
		"-g daemon off; -c /etc/nginx/nginx.conf",
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStringsOld(h, strs...)
		h.Reset()
	}
}

func BenchmarkHashStrings_New_5Strings(b *testing.B) {
	h := xxhash.New()
	strs := []string{
		"nginx-pod-12345678-abcd",
		"nginx-container-with-long-name",
		"nginx",
		"/usr/sbin/nginx",
		"-g daemon off; -c /etc/nginx/nginx.conf",
	}

	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		hashStrings(h, strs...)
		h.Reset()
	}
}

rhacs-bot · 2025-10-15T15:02:02Z

/retest

rhacs-bot · 2025-10-15T15:27:34Z

/retest

rhacs-bot · 2025-10-15T15:40:48Z

/retest

openshift-ci · 2025-10-15T18:28:53Z

@vikin91: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ocp-4-19-qa-e2e-tests	`0c2abb2`	link	false	`/test ocp-4-19-qa-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress label Sep 26, 2025

github-actions bot added area/sensor ai-review labels Sep 26, 2025

vikin91 changed the title ~~ROX-31012:~~ ROX-31012: Optimize memory for UpdateComputer dedupers Sep 26, 2025

sourcery-ai bot reviewed Sep 26, 2025

View reviewed changes

vikin91 mentioned this pull request Sep 26, 2025

ROX-31012: Optimize memory for UpdateComputer connection deduper #17041

Merged

9 tasks

vikin91 changed the title ~~ROX-31012: Optimize memory for UpdateComputer dedupers~~ ROX-31012: Optimize memory for UpdateComputer endpoint dedupers Sep 26, 2025

vikin91 force-pushed the piotr/ROX-30941-no-deduping-processes branch from 0264519 to 26fee46 Compare October 1, 2025 08:01

vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from f636ddd to 425a08d Compare October 1, 2025 08:02

vikin91 mentioned this pull request Oct 1, 2025

ROX-30941: Enrich endpoints and processes together #16901

Merged

13 tasks

Base automatically changed from piotr/ROX-30941-no-deduping-processes to master October 1, 2025 10:33

vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from 5e67e12 to 677d6ce Compare October 2, 2025 08:47

vikin91 marked this pull request as ready for review October 2, 2025 08:55

vikin91 requested a review from a team as a code owner October 2, 2025 08:55

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 2, 2025

sourcery-ai bot reviewed Oct 2, 2025

View reviewed changes

vikin91 added the auto-retest PRs with this label will be automatically retested if prow checks fails label Oct 2, 2025

This comment was marked as off-topic.

Sign in to view

vikin91 mentioned this pull request Oct 2, 2025

ROX-30941: Correct 'getDeduperAction' details #17108

Merged

9 tasks

janisz reviewed Oct 3, 2025

View reviewed changes

sensor/common/networkflow/manager/indicator/indicator.go Outdated Show resolved Hide resolved

janisz reviewed Oct 3, 2025

View reviewed changes

sensor/common/networkflow/manager/indicator/key.go Outdated Show resolved Hide resolved

janisz reviewed Oct 3, 2025

View reviewed changes

sensor/common/networkflow/updatecomputer/transition_based.go Show resolved Hide resolved

janisz requested a review from a team October 3, 2025 12:34

vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from 677d6ce to 175c010 Compare October 7, 2025 11:22

vikin91 added 8 commits October 14, 2025 15:09

Empty commit

d05ca3e

Change endpoints deduper to use [8]byte instead of string

e47612b

Update the unit test

1e1f51a

Empty commit

b71a0e4

Empty commit

690f5a0

Use uint64 as type for BinaryHash

2d42072

Use xxhash instead of fnv

8bdfa17

Run go mod tidy again

41d6369

vikin91 force-pushed the piotr/ROX-31012-less-memory-deduper branch from bf181fd to 41d6369 Compare October 14, 2025 13:09

janisz reviewed Oct 15, 2025

View reviewed changes

sensor/common/networkflow/manager/indicator/indicator.go Show resolved Hide resolved

janisz reviewed Oct 15, 2025

View reviewed changes

sensor/common/networkflow/manager/indicator/key.go Show resolved Hide resolved

janisz reviewed Oct 15, 2025

View reviewed changes

sensor/common/networkflow/manager/indicator/key.go Outdated Show resolved Hide resolved

janisz requested review from a team and johannes94 October 15, 2025 11:21

janisz approved these changes Oct 15, 2025

View reviewed changes

Save one allocation in hashPortAndProtocol

0e0e47f

vikin91 added 2 commits October 15, 2025 15:00

Use zero-copy conversion from string to []byte

e406890

Add nosec linter annotation

0c2abb2

vikin91 requested a review from a team as a code owner October 15, 2025 13:04

janisz approved these changes Oct 15, 2025

View reviewed changes

vikin91 merged commit 03f121c into master Oct 15, 2025
97 of 98 checks passed

vikin91 deleted the piotr/ROX-31012-less-memory-deduper branch October 15, 2025 20:09

janisz mentioned this pull request Oct 16, 2025

ROX-31174: reverting ID back to index based #17200

Merged

9 tasks

janisz mentioned this pull request Dec 1, 2025

ROX-31831: optimize ProcessIndicator similarity filter memory #17984

Merged

9 tasks

Conversation

vikin91 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Insights into deduper memory after 2 days of long-running cluster with fake workloads

Size of single deduper entry

Uh oh!

vikin91 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Sep 26, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rhacs-bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

vikin91 commented Oct 2, 2025

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

Uh oh!

rhacs-bot commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janisz left a comment

Choose a reason for hiding this comment

Uh oh!

johannes94 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikin91 commented Oct 15, 2025

Uh oh!

rhacs-bot commented Oct 15, 2025

Uh oh!

rhacs-bot commented Oct 15, 2025

Uh oh!

rhacs-bot commented Oct 15, 2025

Uh oh!

openshift-ci bot commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vikin91 commented Sep 26, 2025 •

edited

Loading

vikin91 commented Sep 26, 2025 •

edited

Loading

rhacs-bot commented Sep 26, 2025 •

edited

Loading

codecov bot commented Sep 26, 2025 •

edited

Loading

johannes94 commented Oct 15, 2025 •

edited

Loading