Conversation
User request: Reduce allocations in applySingleNoLock by applying pre-allocation and moving redundant assignments out of the loop. Changes: - Pre-allocate sets/maps with known capacity to avoid growth reallocations - Move reverseEndpointMap assignment outside loop (was N redundant writes per call) - Add comprehensive benchmarks for performance tracking Benchmark results (10 iterations, benchstat): ApplySingleNoLock/20Endpoints: -39.73% time (22.46µs → 13.54µs) ApplySingleNoLock/50Endpoints: -37.31% time (59.48µs → 37.29µs) ApplyRepeated (realistic usage): -40.45% time (2.489ms → 1.482ms) Memory improvements: Medium workloads: -5.09% bytes/op (552.8Ki → 524.7Ki), -0.74% allocs/op Large workloads: -7.39% bytes/op (2.271Mi → 2.103Mi), -0.78% allocs/op Note: Code partially generated with AI assistance. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove unconditional etiSet allocation on every loop iteration. Only allocate when actually needed based on map state. Changes: - Move etiSet allocation into conditional branches - Retrieve etiSet from map after ensuring it exists - Clearer logic flow with explicit cases Performance improvement: - Medium workloads: -5% memory (566KB → 537KB), -50 allocations - Large workloads: -7.4% memory (2.38MB → 2.20MB), -300 allocations Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Pre-allocate historical maps with known capacity from current maps to avoid growth reallocations. Changes: - Pre-allocate historicalEndpoints[ep][deploymentID] with size from endpointMap - Pre-allocate reverseHistoricalEndpoints[deploymentID] with size from reverseEndpointMap This reduces allocations when moving endpoints to history during reset operations or deployment changes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
Replace set.Set[EndpointTargetInfo] with set.Set[uint16] since the PortNames field was never used in production code. This simplifies the storage structure and achieves significant memory savings. Changes: - Store container ports directly as uint16 instead of EndpointTargetInfo structs - Remove unused PortNames field from LookupResult (never read anywhere) - Update all helper methods to work with uint16 ports - Update tests to not expect PortNames Benchmark results (benchstat baseline → Phase 1): Memory (B/op): - Large workloads: 1.937MB → 1.207MB (-37.67%) - ApplyRepeated: 458.0KB → 270.6KB (-40.92%) - Geometric mean: -28.01% memory reduction Speed (sec/op): - Medium_Incremental: 1.833ms → 1.454ms (-20.64%) - Small_Incremental: 141.7µs → 125.7µs (-11.34%) The ~37% memory reduction comes from replacing EndpointTargetInfo (24 bytes: 16B ContainerPort + 8B PortName pointer) with uint16 (2 bytes). Related to ROX-33201 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace NumericEndpoint (56 bytes) with BinaryHash (uint64) as map keys, following the pattern from PR #17041. This enables runtime.mapassign_fast64 for map operations instead of the slower generic mapassign. Implementation: - Add BinaryHash type and BinaryKey() method to NumericEndpoint - Use xxhash for fast, non-cryptographic hashing - Update all endpoint maps to use BinaryHash keys - Hash endpoints once before map operations - Update debug/logging to format hashes as hex Benchmark results (Phase 1 → Phase 2): Speed (sec/op): - Geometric mean: -51.66% (2x faster!) - Large_Incremental: 7.764ms → 3.863ms (-50.24%) - Medium_Incremental: 1.454ms → 834µs (-42.63%) - ApplySingleNoLock: -63% to -68% Memory (B/op): - Large workloads: 1.24MB → 932KB (-24.63%) - ApplyRepeated: 271KB → 196KB (-27.74%) Trade-off: - Allocations: +22% (hashing overhead) Combined improvement (Baseline → Phase 2): Speed: - Geometric mean: -51.66% (2x faster!) - Medium_Incremental: 1.833ms → 834µs (-54.47%) Memory: - Large workloads: 1.98MB → 932KB (-53.02%) - ApplyRepeated: 458KB → 196KB (-57.31%) Why this works: - uint64 keys use mapassign_fast64 (single CMP instruction for comparison) - NumericEndpoint keys use generic mapassign (56 bytes to hash and compare) - Reduced map storage overhead (8 bytes vs 56 bytes per key) - Better CPU cache utilization Trade-offs: - Public IP tracking optimization disabled (can't extract IP from hash) - Hash collision probability: 0.027% for 100M endpoints (negligible for K8s scale) - Debug output shows hashes instead of endpoint details Related to ROX-33201 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Phase 3 of endpoint storage optimization eliminates allocation overhead from BinaryKey() hashing by copying IP address bytes to a reusable buffer before calling h.Write(). Problem: Phase 2 introduced allocations from: 1. data.bytes() method returning d[:] (array-to-slice conversion) 2. Slice escaping to heap when passed to h.Write() interface method 3. Attempts with unsafe.Slice also allocated due to interface boxing Solution: - Expand hashBuf from [8]byte to [16]byte to accommodate IPv6 - Type-switch on ipAddrData to get concrete type (ipv4data or ipv6data) - Copy bytes to buffer: buf[0] = data[0], etc. - Write from buffer: h.Write(buf[:4]) or h.Write(buf[:16]) Why this works: - buf points to heap-allocated endpointsStore.hashBuf - buf[:4] creates slice header pointing to existing memory (no allocation) - Contrast with data.bytes() which returns slice of temporary value copy Benchmark results (Baseline → Phase 3 Final): Speed improvements: - ApplySingleNoLock: -63.68% to -68.29% (3x faster!) - Apply/Small_Incremental: -45.02% - Apply/Medium_Full: -27.17% - Apply/Large_Full: -33.32% - Geometric mean: -43.98% (1.78x faster!) Memory improvements: - All Apply benchmarks: -52.72% to -54.89% (more than halved!) - Geometric mean: -40.52% Allocation improvements: - ApplySingleNoLock: 0 allocs (same as baseline!) - Apply/Large_Incremental: +0.00% (essentially identical to baseline) - Apply/Small_Incremental: +0.20% (negligible) - Geometric mean: +0.05% (effectively zero overhead) This achieves the optimal balance: ✅ Near-zero allocation overhead (vs baseline) ✅ 1.78x faster overall ✅ 50%+ memory reduction ✅ Solves runtime.mapassign bottleneck User request: "but there are still more allocs then on the baseline" "look at my changes" [user experimented with buffer copy approach] Implemented buffer copy pattern to eliminate IP address allocations, achieving near-zero allocation overhead while maintaining massive speed and memory improvements. Partially generated by AI with extensive benchmarking and optimization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Simplify code and comments: - Use copy() builtin instead of manual byte assignments - Remove redundant comments - Simplify function documentation - Remove unnecessary local variables No functional changes, purely cosmetic cleanup. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Eliminate 141 MB of allocation overhead by optimizing how historical
endpoint data is tracked.
**Phase 1: Replace pointer with direct uint16 storage**
Changed historicalEndpoints and reverseHistoricalEndpoints from storing
*entityStatus pointers to storing uint16 values directly. This eliminates
~50-60 bytes of heap allocation overhead per 2-byte tick counter.
Before:
type entityStatus struct { ticksLeft uint16 } // 2 bytes
Stored as: *entityStatus // ~60 bytes with overhead
After:
Stored as: uint16 // 2 bytes
Changes:
- Changed historicalEndpoints value type from *entityStatus to uint16
- Changed reverseHistoricalEndpoints value type from *entityStatus to uint16
- Inlined recordTick() and IsExpired() logic (trivial decrement/comparison)
- Added prettyPrintHistoricalDataPtr() for stores still using pointers
- Updated debug.go to access uint16 values directly
Phase 1 Results:
- Memory: -19% to -27% reduction
- Allocations: -44% to -76% reduction
- Speed: neutral
**Phase 2: Remove duplicate tracking map**
Removed reverseHistoricalEndpoints entirely as it duplicated data from
historicalEndpoints without providing additional value.
Analysis showed reverseHistoricalEndpoints was only used for:
- RecordTick: Updating tick counters (duplicates historicalEndpoints work)
- deleteFromHistory: Cleanup (can be done from historicalEndpoints alone)
No actual lookups used the reverse map - it was write-only overhead.
Changes:
- Removed reverseHistoricalEndpoints field
- Simplified RecordTick to single loop over historicalEndpoints
- Removed reverse map population in addToHistory
- Simplified deleteFromHistory and removeFromHistoryIfExpired
- Updated test to check historicalEndpoints instead
Phase 2 Results (incremental):
- Speed: -27% to -36% faster (significant improvement!)
- Memory: -10.9% to -12.5% reduction
- Allocations: -4.5% to -8.2% reduction
**Combined Results (Baseline → Phase 2)**:
- Speed: +27% to +36% faster ✅
- Memory: -28% to -33% reduction ✅
- Allocations: -49% to -77% reduction ✅
- Code complexity: Reduced (single source of truth)
Test results:
All tests pass:
ok github.com/stackrox/rox/sensor/common/clusterentities 0.087s
Benchmark comparison (Large Incremental):
name old time/op new time/op delta
Apply/Large_Incremental 6.23ms ± 21% 3.91ms ± 21% -37.22%
name old alloc/op new alloc/op delta
Apply/Large_Incremental 895kB ± 0% 582kB ± 0% -34.93%
name old allocs/op new allocs/op delta
Apply/Large_Incremental 37.8k ± 0% 8.6k ± 0% -77.11%
Prompted by: Optimize History Tracking in Endpoint Storage plan
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- In the benchmarking helpers (e.g., generateTestUpdates/generateEntityData in store_endpoints_bench_test.go), the loops use
for d := range numDeployments/for e := range numEndpoints, which won’t compile for anint; these should be standard indexed loops likefor d := 0; d < numDeployments; d++ { ... }. - The BinaryKey implementation hashes L4Proto as an 8-byte big-endian value even though it appears to be a small enum; you can reduce work and simplify the code by writing a single byte for the protocol instead of 8 bytes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the benchmarking helpers (e.g., generateTestUpdates/generateEntityData in store_endpoints_bench_test.go), the loops use `for d := range numDeployments`/`for e := range numEndpoints`, which won’t compile for an `int`; these should be standard indexed loops like `for d := 0; d < numDeployments; d++ { ... }`.
- The BinaryKey implementation hashes L4Proto as an 8-byte big-endian value even though it appears to be a small enum; you can reduce work and simplify the code by writing a single byte for the protocol instead of 8 bytes.
## Individual Comments
### Comment 1
<location path="pkg/net/endpoint.go" line_range="140-154" />
<code_context>
+ buf[1] = byte(e.IPAndPort.Port)
+ _, _ = h.Write(buf[:2])
+
+ // Hash protocol (big-endian)
+ buf[0] = byte(e.L4Proto >> 56)
+ buf[1] = byte(e.L4Proto >> 48)
+ buf[2] = byte(e.L4Proto >> 40)
+ buf[3] = byte(e.L4Proto >> 32)
+ buf[4] = byte(e.L4Proto >> 24)
+ buf[5] = byte(e.L4Proto >> 16)
+ buf[6] = byte(e.L4Proto >> 8)
+ buf[7] = byte(e.L4Proto)
+ _, _ = h.Write(buf[:8])
+
</code_context>
<issue_to_address>
**suggestion:** Hashing L4Proto as 8 bytes is unnecessary and confusing if it is a small enum-like type.
Given `L4Proto` is used as an enum-like int (see `NumericEndpointCompare`), writing it as 8 bytes with bit shifts is unnecessary—all high bytes will be zero for typical definitions. This adds noise to the hash input and obscures intent. You can just write a single byte instead:
```go
buf[0] = byte(e.L4Proto)
_, _ = h.Write(buf[:1])
```
This preserves hash behavior while keeping the code simpler and more explicit about the field’s size.
```suggestion
// Hash port (big-endian)
buf[0] = byte(e.IPAndPort.Port >> 8)
buf[1] = byte(e.IPAndPort.Port)
_, _ = h.Write(buf[:2])
// Hash protocol as a single byte (enum-like)
buf[0] = byte(e.L4Proto)
_, _ = h.Write(buf[:1])
```
</issue_to_address>
### Comment 2
<location path="sensor/common/clusterentities/debug.go" line_range="137-140" />
<code_context>
}
- for ep, submap := range e.historicalEndpoints {
- dbg["historicalEndpoints"][ep.String()] = make(map[string]interface{})
+ for epHash, submap := range e.historicalEndpoints {
+ epKey := fmt.Sprintf("0x%x", epHash)
+ dbg["historicalEndpoints"][epKey] = make(map[string]interface{})
for deplID, targetInfoSetMap := range submap {
- for targetInfo, status := range targetInfoSetMap {
- dbg["historicalEndpoints"][ep.String()][deplID] = map[string]interface{}{
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Debug output for historicalEndpoints only keeps the last port per deployment.
In the inner loop, `dbg[
</issue_to_address>
### Comment 3
<location path="sensor/common/clusterentities/store_endpoints.go" line_range="30" />
<code_context>
+ // historicalEndpoints is mimicking endpointMap: endpoint hashes -> deployment id -> container port -> ticks remaining
+ historicalEndpoints map[net.BinaryHash]map[string]map[uint16]uint16
+
+ // h is the xxhash instance used for hashing endpoints
+ h hash.Hash64
+ // hashBuf is a reusable buffer for hashing (16 bytes for IPv6 addresses)
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the new hash-based endpoint storage by using a pure BinaryKey helper instead of a stateful hasher on endpointsStore and by replacing the generic Map[T] abstraction with a concrete doLookupEndpoint function.
You can keep the new memory layout but simplify a couple of the indirections without changing behavior.
### 1. Make `BinaryKey` pure and remove stateful hasher from `endpointsStore`
You don’t need a shared `hash.Hash64` and buffer on the store; `xxhash.Sum64` works on stack-local buffers and keeps hashing independent of the store/locks.
**Before:**
```go
type endpointsStore struct {
// ...
h hash.Hash64
hashBuf [16]byte
}
func newEndpointsStoreWithMemory(numTicks uint16) *endpointsStore {
store := &endpointsStore{
memorySize: numTicks,
h: xxhash.New(),
}
// ...
}
func (e *endpointsStore) applySingleNoLock(deploymentID string, data EntityData) {
// ...
for ep, targetInfos := range data.endpoints {
epHash := ep.BinaryKey(e.h, &e.hashBuf)
// ...
}
}
func (e *endpointsStore) lookupEndpoint(endpoint net.NumericEndpoint, netLookup netAddrLookupper) {
// ...
epHash := endpoint.BinaryKey(e.h, &e.hashBuf)
// ...
}
```
**After (example):**
1. Change `BinaryKey` to be a pure helper, e.g.:
```go
// in net.NumericEndpoint (or a helper in this package)
func (ep net.NumericEndpoint) BinaryKey() net.BinaryHash {
var buf [16]byte
// fill buf deterministically from ep.IPAndPort.Address and ep.IPAndPort.Port
// ...
return net.BinaryHash(xxhash.Sum64(buf[:]))
}
```
2. Drop hasher/buffer fields from the store and call the pure `BinaryKey`:
```go
type endpointsStore struct {
mutex sync.RWMutex
endpointMap map[net.BinaryHash]map[string]set.Set[uint16]
reverseEndpointMap map[string]set.Set[net.BinaryHash]
memorySize uint16
historicalEndpoints map[net.BinaryHash]map[string]map[uint16]uint16
}
func newEndpointsStoreWithMemory(numTicks uint16) *endpointsStore {
store := &endpointsStore{
memorySize: numTicks,
}
concurrency.WithLock(&store.mutex, func() {
store.initMapsNoLock()
})
return store
}
func (e *endpointsStore) applySingleNoLock(deploymentID string, data EntityData) {
// ...
for ep, targetInfos := range data.endpoints {
epHash := ep.BinaryKey()
// unchanged logic using epHash...
}
}
func (e *endpointsStore) lookupEndpoint(endpoint net.NumericEndpoint, netLookup netAddrLookupper) (current, historical, ipLookup, ipLookupHistorical []LookupResult) {
e.mutex.RLock()
defer e.mutex.RUnlock()
epHash := endpoint.BinaryKey()
current = doLookupEndpoint(epHash, e.endpointMap)
historical = doLookupEndpoint(epHash, e.historicalEndpoints)
// rest unchanged...
}
```
This keeps memory benefits but removes shared mutable hashing state and implicit coupling to the store’s lock discipline.
---
### 2. Replace generic `Map[T]` with a concrete helper
The generic `Map[T]` abstraction adds an extra cognitive layer without real reuse. You can make `doLookupEndpoint` concrete and keep the current behavior.
**Before:**
```go
type Map[T any] interface {
~map[uint16]T
}
func doLookupEndpoint[M Map[T], T any](epHash net.BinaryHash, src map[net.BinaryHash]map[string]M) (results []LookupResult) {
for deploymentID, portSet := range src[epHash] {
result := LookupResult{
Entity: networkgraph.EntityForDeployment(deploymentID),
ContainerPorts: make([]uint16, 0, len(portSet)),
}
for port := range portSet {
result.ContainerPorts = append(result.ContainerPorts, port)
}
results = append(results, result)
}
return results
}
```
**After (non-generic, same logic):**
```go
func doLookupEndpoint(epHash net.BinaryHash, src map[net.BinaryHash]map[string]set.Set[uint16]) (results []LookupResult) {
deployments := src[epHash]
if deployments == nil {
return nil
}
for deploymentID, portSet := range deployments {
result := LookupResult{
Entity: networkgraph.EntityForDeployment(deploymentID),
ContainerPorts: make([]uint16, 0, len(portSet)),
}
for port := range portSet {
result.ContainerPorts = append(result.ContainerPorts, port)
}
results = append(results, result)
}
return results
}
```
And keep the caller unchanged:
```go
current = doLookupEndpoint(epHash, e.endpointMap)
historical = doLookupEndpoint(epHash, e.historicalEndpoints)
```
You still get the same behavior and shapes, but the lookup path is much easier to read.
---
These two changes alone remove the shared mutable hasher/buffer and the generic map abstraction, which addresses a significant part of the “layers of indirection” concern while preserving the new hash-based/indexed representation and per-port tracking.
</issue_to_address>
### Comment 4
<location path="pkg/net/endpoint.go" line_range="121" />
<code_context>
+
+var hashDelimiter = []byte{0}
+
+// BinaryKey produces a binary hash for this endpoint.
+// Uses xxhash for fast, non-cryptographic hashing with low collision probability.
+// The buf parameter must be at least [16]byte to avoid allocations for IPv6 addresses.
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying `BinaryKey` by making it a pure function that builds and hashes a well-defined local buffer instead of relying on shared hash state and manual byte manipulation.
You can keep the `BinaryHash` optimization while making `BinaryKey` simpler and stateless, and avoiding shared mutable state and manual byte-twiddling.
Consider:
1. Make `BinaryKey` pure: no `hash.Hash64` or external buffer parameters.
2. Build a single local buffer with a fixed layout.
3. Use `encoding/binary` (or `xxhash` directly) instead of manual shifts.
For example, something like this keeps the same idea but is much easier to reason about and use:
```go
import (
"encoding/binary"
"github.com/cespare/xxhash/v2"
)
func (e NumericEndpoint) BinaryKey() BinaryHash {
// Max: 16 bytes IP + 1 delimiter + 2 port + 8 proto = 27 bytes
var buf [27]byte
i := 0
if e.IPAndPort.Address.IsValid() {
switch data := e.IPAndPort.Address.data.(type) {
case ipv4data:
copy(buf[i:i+4], data[:])
i += 4
case ipv6data:
copy(buf[i:i+16], data[:])
i += 16
}
}
// delimiter
buf[i] = 0
i++
// port (big endian)
binary.BigEndian.PutUint16(buf[i:i+2], uint16(e.IPAndPort.Port))
i += 2
// protocol (big endian, 8 bytes to preserve current layout)
binary.BigEndian.PutUint64(buf[i:i+8], uint64(e.L4Proto))
i += 8
return BinaryHash(xxhash.Sum64(buf[:i]))
}
```
If changing the method signature is not acceptable, you can still reduce low-level complexity inside the current API by at least replacing the manual shifts with `binary.BigEndian` and removing the explicit 8-byte protocol write:
```go
import "encoding/binary"
func (e NumericEndpoint) BinaryKey(h hash.Hash64, buf *[16]byte) BinaryHash {
h.Reset()
if e.IPAndPort.Address.IsValid() {
switch data := e.IPAndPort.Address.data.(type) {
case ipv4data:
copy(buf[:4], data[:])
_, _ = h.Write(buf[:4])
case ipv6data:
copy(buf[:16], data[:])
_, _ = h.Write(buf[:16])
}
}
_, _ = h.Write(hashDelimiter)
// reuse buf for port + proto
var tmp [10]byte
binary.BigEndian.PutUint16(tmp[0:2], uint16(e.IPAndPort.Port))
binary.BigEndian.PutUint64(tmp[2:10], uint64(e.L4Proto))
_, _ = h.Write(tmp[:])
return BinaryHash(h.Sum64())
}
```
Both versions keep the memory benefits while making the data layout and encoding clearer and eliminating most of the manual offset/bit-shift logic.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| // Hash port (big-endian) | ||
| buf[0] = byte(e.IPAndPort.Port >> 8) | ||
| buf[1] = byte(e.IPAndPort.Port) | ||
| _, _ = h.Write(buf[:2]) | ||
|
|
||
| // Hash protocol (big-endian) | ||
| buf[0] = byte(e.L4Proto >> 56) | ||
| buf[1] = byte(e.L4Proto >> 48) | ||
| buf[2] = byte(e.L4Proto >> 40) | ||
| buf[3] = byte(e.L4Proto >> 32) | ||
| buf[4] = byte(e.L4Proto >> 24) | ||
| buf[5] = byte(e.L4Proto >> 16) | ||
| buf[6] = byte(e.L4Proto >> 8) | ||
| buf[7] = byte(e.L4Proto) | ||
| _, _ = h.Write(buf[:8]) |
There was a problem hiding this comment.
suggestion: Hashing L4Proto as 8 bytes is unnecessary and confusing if it is a small enum-like type.
Given L4Proto is used as an enum-like int (see NumericEndpointCompare), writing it as 8 bytes with bit shifts is unnecessary—all high bytes will be zero for typical definitions. This adds noise to the hash input and obscures intent. You can just write a single byte instead:
buf[0] = byte(e.L4Proto)
_, _ = h.Write(buf[:1])This preserves hash behavior while keeping the code simpler and more explicit about the field’s size.
| // Hash port (big-endian) | |
| buf[0] = byte(e.IPAndPort.Port >> 8) | |
| buf[1] = byte(e.IPAndPort.Port) | |
| _, _ = h.Write(buf[:2]) | |
| // Hash protocol (big-endian) | |
| buf[0] = byte(e.L4Proto >> 56) | |
| buf[1] = byte(e.L4Proto >> 48) | |
| buf[2] = byte(e.L4Proto >> 40) | |
| buf[3] = byte(e.L4Proto >> 32) | |
| buf[4] = byte(e.L4Proto >> 24) | |
| buf[5] = byte(e.L4Proto >> 16) | |
| buf[6] = byte(e.L4Proto >> 8) | |
| buf[7] = byte(e.L4Proto) | |
| _, _ = h.Write(buf[:8]) | |
| // Hash port (big-endian) | |
| buf[0] = byte(e.IPAndPort.Port >> 8) | |
| buf[1] = byte(e.IPAndPort.Port) | |
| _, _ = h.Write(buf[:2]) | |
| // Hash protocol as a single byte (enum-like) | |
| buf[0] = byte(e.L4Proto) | |
| _, _ = h.Write(buf[:1]) |
| for epHash, submap := range e.historicalEndpoints { | ||
| epKey := fmt.Sprintf("0x%x", epHash) | ||
| dbg["historicalEndpoints"][epKey] = make(map[string]interface{}) | ||
| for deplID, targetInfoSetMap := range submap { |
There was a problem hiding this comment.
suggestion (bug_risk): Debug output for historicalEndpoints only keeps the last port per deployment.
In the inner loop, `dbg[
| // historicalEndpoints is mimicking endpointMap: endpoint hashes -> deployment id -> container port -> ticks remaining | ||
| historicalEndpoints map[net.BinaryHash]map[string]map[uint16]uint16 | ||
|
|
||
| // h is the xxhash instance used for hashing endpoints |
There was a problem hiding this comment.
issue (complexity): Consider simplifying the new hash-based endpoint storage by using a pure BinaryKey helper instead of a stateful hasher on endpointsStore and by replacing the generic Map[T] abstraction with a concrete doLookupEndpoint function.
You can keep the new memory layout but simplify a couple of the indirections without changing behavior.
1. Make BinaryKey pure and remove stateful hasher from endpointsStore
You don’t need a shared hash.Hash64 and buffer on the store; xxhash.Sum64 works on stack-local buffers and keeps hashing independent of the store/locks.
Before:
type endpointsStore struct {
// ...
h hash.Hash64
hashBuf [16]byte
}
func newEndpointsStoreWithMemory(numTicks uint16) *endpointsStore {
store := &endpointsStore{
memorySize: numTicks,
h: xxhash.New(),
}
// ...
}
func (e *endpointsStore) applySingleNoLock(deploymentID string, data EntityData) {
// ...
for ep, targetInfos := range data.endpoints {
epHash := ep.BinaryKey(e.h, &e.hashBuf)
// ...
}
}
func (e *endpointsStore) lookupEndpoint(endpoint net.NumericEndpoint, netLookup netAddrLookupper) {
// ...
epHash := endpoint.BinaryKey(e.h, &e.hashBuf)
// ...
}After (example):
- Change
BinaryKeyto be a pure helper, e.g.:
// in net.NumericEndpoint (or a helper in this package)
func (ep net.NumericEndpoint) BinaryKey() net.BinaryHash {
var buf [16]byte
// fill buf deterministically from ep.IPAndPort.Address and ep.IPAndPort.Port
// ...
return net.BinaryHash(xxhash.Sum64(buf[:]))
}- Drop hasher/buffer fields from the store and call the pure
BinaryKey:
type endpointsStore struct {
mutex sync.RWMutex
endpointMap map[net.BinaryHash]map[string]set.Set[uint16]
reverseEndpointMap map[string]set.Set[net.BinaryHash]
memorySize uint16
historicalEndpoints map[net.BinaryHash]map[string]map[uint16]uint16
}
func newEndpointsStoreWithMemory(numTicks uint16) *endpointsStore {
store := &endpointsStore{
memorySize: numTicks,
}
concurrency.WithLock(&store.mutex, func() {
store.initMapsNoLock()
})
return store
}
func (e *endpointsStore) applySingleNoLock(deploymentID string, data EntityData) {
// ...
for ep, targetInfos := range data.endpoints {
epHash := ep.BinaryKey()
// unchanged logic using epHash...
}
}
func (e *endpointsStore) lookupEndpoint(endpoint net.NumericEndpoint, netLookup netAddrLookupper) (current, historical, ipLookup, ipLookupHistorical []LookupResult) {
e.mutex.RLock()
defer e.mutex.RUnlock()
epHash := endpoint.BinaryKey()
current = doLookupEndpoint(epHash, e.endpointMap)
historical = doLookupEndpoint(epHash, e.historicalEndpoints)
// rest unchanged...
}This keeps memory benefits but removes shared mutable hashing state and implicit coupling to the store’s lock discipline.
2. Replace generic Map[T] with a concrete helper
The generic Map[T] abstraction adds an extra cognitive layer without real reuse. You can make doLookupEndpoint concrete and keep the current behavior.
Before:
type Map[T any] interface {
~map[uint16]T
}
func doLookupEndpoint[M Map[T], T any](epHash net.BinaryHash, src map[net.BinaryHash]map[string]M) (results []LookupResult) {
for deploymentID, portSet := range src[epHash] {
result := LookupResult{
Entity: networkgraph.EntityForDeployment(deploymentID),
ContainerPorts: make([]uint16, 0, len(portSet)),
}
for port := range portSet {
result.ContainerPorts = append(result.ContainerPorts, port)
}
results = append(results, result)
}
return results
}After (non-generic, same logic):
func doLookupEndpoint(epHash net.BinaryHash, src map[net.BinaryHash]map[string]set.Set[uint16]) (results []LookupResult) {
deployments := src[epHash]
if deployments == nil {
return nil
}
for deploymentID, portSet := range deployments {
result := LookupResult{
Entity: networkgraph.EntityForDeployment(deploymentID),
ContainerPorts: make([]uint16, 0, len(portSet)),
}
for port := range portSet {
result.ContainerPorts = append(result.ContainerPorts, port)
}
results = append(results, result)
}
return results
}And keep the caller unchanged:
current = doLookupEndpoint(epHash, e.endpointMap)
historical = doLookupEndpoint(epHash, e.historicalEndpoints)You still get the same behavior and shapes, but the lookup path is much easier to read.
These two changes alone remove the shared mutable hasher/buffer and the generic map abstraction, which addresses a significant part of the “layers of indirection” concern while preserving the new hash-based/indexed representation and per-port tracking.
|
|
||
| var hashDelimiter = []byte{0} | ||
|
|
||
| // BinaryKey produces a binary hash for this endpoint. |
There was a problem hiding this comment.
issue (complexity): Consider simplifying BinaryKey by making it a pure function that builds and hashes a well-defined local buffer instead of relying on shared hash state and manual byte manipulation.
You can keep the BinaryHash optimization while making BinaryKey simpler and stateless, and avoiding shared mutable state and manual byte-twiddling.
Consider:
- Make
BinaryKeypure: nohash.Hash64or external buffer parameters. - Build a single local buffer with a fixed layout.
- Use
encoding/binary(orxxhashdirectly) instead of manual shifts.
For example, something like this keeps the same idea but is much easier to reason about and use:
import (
"encoding/binary"
"github.com/cespare/xxhash/v2"
)
func (e NumericEndpoint) BinaryKey() BinaryHash {
// Max: 16 bytes IP + 1 delimiter + 2 port + 8 proto = 27 bytes
var buf [27]byte
i := 0
if e.IPAndPort.Address.IsValid() {
switch data := e.IPAndPort.Address.data.(type) {
case ipv4data:
copy(buf[i:i+4], data[:])
i += 4
case ipv6data:
copy(buf[i:i+16], data[:])
i += 16
}
}
// delimiter
buf[i] = 0
i++
// port (big endian)
binary.BigEndian.PutUint16(buf[i:i+2], uint16(e.IPAndPort.Port))
i += 2
// protocol (big endian, 8 bytes to preserve current layout)
binary.BigEndian.PutUint64(buf[i:i+8], uint64(e.L4Proto))
i += 8
return BinaryHash(xxhash.Sum64(buf[:i]))
}If changing the method signature is not acceptable, you can still reduce low-level complexity inside the current API by at least replacing the manual shifts with binary.BigEndian and removing the explicit 8-byte protocol write:
import "encoding/binary"
func (e NumericEndpoint) BinaryKey(h hash.Hash64, buf *[16]byte) BinaryHash {
h.Reset()
if e.IPAndPort.Address.IsValid() {
switch data := e.IPAndPort.Address.data.(type) {
case ipv4data:
copy(buf[:4], data[:])
_, _ = h.Write(buf[:4])
case ipv6data:
copy(buf[:16], data[:])
_, _ = h.Write(buf[:16])
}
}
_, _ = h.Write(hashDelimiter)
// reuse buf for port + proto
var tmp [10]byte
binary.BigEndian.PutUint16(tmp[0:2], uint16(e.IPAndPort.Port))
binary.BigEndian.PutUint64(tmp[2:10], uint64(e.L4Proto))
_, _ = h.Write(tmp[:])
return BinaryHash(h.Sum64())
}Both versions keep the memory benefits while making the data layout and encoding clearer and eliminating most of the manual offset/bit-shift logic.
|
Images are ready for the commit at 1709451. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #19289 +/- ##
==========================================
+ Coverage 49.54% 49.62% +0.07%
==========================================
Files 2674 2680 +6
Lines 201755 202240 +485
==========================================
+ Hits 99956 100358 +402
- Misses 94340 94400 +60
- Partials 7459 7482 +23
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
History Tracking Optimization in Endpoint Storage - Implementation Summary
Overview
Optimized history tracking in the endpoint storage by eliminating 141 MB (89%) of allocation overhead through two strategic optimizations:
reverseHistoricalEndpointsmap that duplicated dataThis function is called in performance-critical paths:
Implementation
Modified Files
sensor/common/clusterentities/entity_status.go (11 lines)
newHistoricalEntity()to returnuint16directly instead of*entityStatusnewHistoricalEntityPtr()for backward compatibility with other storesentityStatusstruct and methods for stores still using pointer approachsensor/common/clusterentities/store_endpoints.go (53 lines changed)
historicalEndpointstype frommap[...]*entityStatustomap[...]uint16reverseHistoricalEndpointsfield entirelyRecordTick()to single loop (removed dual-map maintenance)addToHistory()by removing reverse map populationdeleteFromHistory()to only handle single maprecordTick()andIsExpired()logic (trivial decrement/comparison)sensor/common/clusterentities/store.go (16 lines)
prettyPrintHistoricalData()generic constraint touint16prettyPrintHistoricalDataPtr()for stores still using pointerssensor/common/clusterentities/debug.go (14 lines)
uint16values directlyreverseHistoricalEndpointsfrom debug outputsensor/common/clusterentities/store_test.go (2 lines)
historicalEndpointsinstead ofreverseHistoricalEndpointssensor/common/clusterentities/store_ips.go (2 lines)
newHistoricalEntityPtr()for compatibilitysensor/common/clusterentities/store_container_id.go (4 lines)
newHistoricalEntityPtr()for compatibilityPerformance Results
Benchmark Results (Intel i7-6700K @ 4.00GHz)
Overall Performance (Geometric Mean)
Detailed Results by Dataset Size
Small Incremental (5 deployments, 50 endpoints):
Medium Incremental (20 deployments, 200 endpoints):
Large Incremental (50 deployments, 500 endpoints):
Large Full (50 deployments, 500 endpoints, non-incremental):
ApplyRepeated (Realistic Workload):
Key Improvements
*entityStatusrequired ~50-60 bytes (pointer + heap metadata) for a 2-byte valueuint16stored inline in maps (2 bytes instead of 60+ bytes)reverseHistoricalEndpointswas write-only overhead, never used for lookupshistoricalEndpointstracks historical dataPhase-by-Phase Breakdown
Phase 1: Direct uint16 Storage
Phase 2: Remove Duplicate Tracking
Testing & Verification
Test Results
Memory Profiling
Test Coverage
Code Quality
Production Impact
Performance-Critical Paths
sensor/common/clusterentities/store_endpoints.go:106 (
Apply)sensor/common/clusterentities/store_endpoints.go:90 (
RecordTick)sensor/common/clusterentities/store_endpoints.go:270 (
addToHistory)Estimated Throughput Improvement
Why the Improvement?
Before: Pointer-based tracking
Problems:
After: Direct value storage
Benefits:
Branch
ROX-33201/store_endpointsmasterBenchstat Output
Files Changed
Total: 7 files changed, 54 insertions(+), 48 deletions(-)
Conclusion
The history tracking optimization successfully:
This optimization is a high-impact, low-risk change that:
The optimization follows the principle of "make the common case fast" - history tracking happens on every entity update and tick cycle, making this a performance-critical path worth optimizing.