perf(sensor): add informer population timing metrics by vikin91 · Pull Request #19614 · stackrox/stackrox

vikin91 · 2026-03-25T16:32:16Z

Description

Add instrumentation to track how long initial object population takes after
informer cache sync completes. This is the handoff/processing phase that
follows informer sync time and is a separate, non-overlapping metric.

Despite seeing:

informer_sync_tracker.go:234: Info: All 18 informers have synced successfully

the objects that now fully arrived to Sensor may still not be fully processed.
This PR adds periodic logs if the first stage of processing (after informer sync) takes longer than expected.
This may highlight problems with Sensor performance and hint the incident handler in the right direction.

Example:

# k8s starts ingesting objects through informers
resource_event_handler.go:576: Info: Informer "Pods": still waiting for initial object population after 15s (missing=18978 total=27500)
# ^^ New line, shows that processing objects takes at least 15s.

informer_sync_tracker.go:234: Info: All 18 informers have synced successfully
# ^^ this was there before. It means that k8s gave us all objects and now Sensor has them.

resource_event_handler.go:576: Info: Informer "Pods": still waiting for initial object population after 30s (missing=10655 total=27500)
# ^^ Again, processing takes long. Only 10k more pods to process.
resource_event_handler.go:576: Info: Informer "Pods": still waiting for initial object population after 45s (missing=2384 total=27500)

informer_sync_tracker.go:148: Info: Informer sync tracker stopped
# ^^ This line was there. It told us that there is nothing more left to track. However, not everyone looking at the logs would know that we must wait for this line to appear to know that initial object population has finished.

(triggered by using fake workload with 5k deployments, each with 4 pods and 2 containers).

Key changes:

New Prometheus histogram informer_initial_object_population_duration_seconds
recorded per informer after PopulateInitialObjects finishes.
Periodic 15s warnings during long initial-population waits, reporting
remaining/total object counts for stuck-informer diagnosis.
Log messages on interruption, completion, and aborted cache syncs.
Track initialSyncTotalIDs and expose initialSyncDebugState() for
diagnostics.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

Ran local-sensor with large fake workloads and observed:

Histogram metric emitted on Prometheus endpoint with per-informer labels
Warning logs appearing at 15s intervals during slow population

Looking at the metrics:

➜ curl -sS http://localhost:9090/metrics | grep informer_initial_object_population_duration_seconds | grep Pods | grep -v _bucket
rox_sensor_informer_initial_object_population_duration_seconds_sum{informer="Pods"} 49.6065285
rox_sensor_informer_initial_object_population_duration_seconds_count{informer="Pods"} 1

Difference to the metric introduced to the informer tracker:

rox_sensor_informer_sync_duration_ms{informer="Pods"} 6

K8s quickly passed the objects to Sensor - 6ms - (suspiciously fast in this case) but the initial processing of them took 49s.

Add instrumentation to track how long initial object population takes after informer cache sync completes. This is the handoff/processing phase that is separate from (and comes after) informer sync time. Changes: - New Prometheus histogram `informer_initial_object_population_duration_seconds` recorded per informer after PopulateInitialObjects finishes. - Periodic 15-second warnings during long initial-population waits, reporting remaining/total object counts for stuck-informer diagnosis. - Log messages on interruption and completion of initial population. - Track `initialSyncTotalIDs` and expose `initialSyncDebugState()` for the warning ticker. - Add duration logging for enrichAndSend in networkflow manager to identify slow enrichment ticks. Partially generated by AI. Made-with: Cursor

vikin91 · 2026-03-25T16:32:19Z

This change is part of the following stack:

perf(sensor): add informer population timing metrics #19614 ◀

_{Change managed by git-spice.}

openshift-ci · 2026-03-25T16:32:20Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new resourceEventHandlerImpl.name field is never read; consider either wiring it into log/metric output for per-handler attribution or removing it to avoid confusion about its purpose.
The new networkFlowManager.enrichConnections logging will emit a log line on every tick (debug or warn), which could be very noisy in busy clusters; consider gating these behind a higher log level, rate-limiting, or only logging when the duration exceeds a more conservative threshold.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `resourceEventHandlerImpl.name` field is never read; consider either wiring it into log/metric output for per-handler attribution or removing it to avoid confusion about its purpose.
- The new `networkFlowManager.enrichConnections` logging will emit a log line on every tick (debug or warn), which could be very noisy in busy clusters; consider gating these behind a higher log level, rate-limiting, or only logging when the duration exceeds a more conservative threshold.

## Individual Comments

### Comment 1
<location path="sensor/common/networkflow/manager/manager_impl.go" line_range="320-324" />
<code_context>
 		case <-m.stopper.Flow().StopRequested():
 			return
 		case <-tickerC:
+			enrichStart := time.Now()
 			m.enrichAndSend()
+			enrichDuration := time.Since(enrichStart)
+			if enrichDuration < time.Second {
+				log.Debugf("networkflow tick: finished enrichAndSend in %s; calling clusterentities RecordTick", enrichDuration)
+			} else {
</code_context>
<issue_to_address>
**suggestion:** Warn-level logging on every slow tick may be too noisy for expected bursty behavior

This logs at Warnf for any tick taking ≥1s, even though the message notes this "is not necessarily an error" and can occur during normal bursty updates. In a busy cluster this could generate excessive, low‑value warnings that obscure real issues. Consider raising the threshold, lowering the log level (Info/Debug), or adding rate limiting/backoff to this warning.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sensor/common/networkflow/manager/manager_impl.go

rhacs-bot · 2026-03-25T16:51:34Z

Images are ready for the commit at 22e7e58.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-433-g22e7e58397.

codecov · 2026-03-25T17:04:23Z

Codecov Report

❌ Patch coverage is 2.85714% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.27%. Comparing base (123b457) to head (22e7e58).
⚠️ Report is 17 commits behind head on master.

Files with missing lines	Patch %	Lines
...nsor/kubernetes/listener/resource_event_handler.go	0.00%	25 Missing ⚠️
...kubernetes/listener/resource_event_handler_impl.go	10.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #19614      +/-   ##
==========================================
+ Coverage   49.25%   49.27%   +0.01%     
==========================================
  Files        2735     2735              
  Lines      206138   206246     +108     
==========================================
+ Hits       101539   101627      +88     
- Misses      97051    97076      +25     
+ Partials     7548     7543       -5

Flag	Coverage Δ
go-unit-tests	`49.27% <2.85%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The informer_initial_object_population_duration_seconds histogram currently uses prometheus.DefBuckets, which top out at 10s; given your example workloads in the tens of seconds, consider defining custom buckets that extend into the minutes range so the metric remains informative rather than accumulating mostly in the +Inf bucket.
In the warnTicker loop, you might want to short‑circuit logging when missingCount == 0 (or otherwise detect completion) to avoid periodic "missing=0" log lines if initialSyncDebugState and doneChannel become momentarily out of sync under race conditions.
The periodic 15s log uses Infof but the description calls them "warnings"; if these are intended as slow‑path alerts, consider using Warnf or a distinct log key/field so operators can more easily distinguish them from normal informational logs.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `informer_initial_object_population_duration_seconds` histogram currently uses `prometheus.DefBuckets`, which top out at 10s; given your example workloads in the tens of seconds, consider defining custom buckets that extend into the minutes range so the metric remains informative rather than accumulating mostly in the `+Inf` bucket.
- In the `warnTicker` loop, you might want to short‑circuit logging when `missingCount == 0` (or otherwise detect completion) to avoid periodic "missing=0" log lines if `initialSyncDebugState` and `doneChannel` become momentarily out of sync under race conditions.
- The periodic 15s log uses `Infof` but the description calls them "warnings"; if these are intended as slow‑path alerts, consider using `Warnf` or a distinct log key/field so operators can more easily distinguish them from normal informational logs.

## Individual Comments

### Comment 1
<location path="sensor/common/metrics/metrics.go" line_range="354-363" />
<code_context>
+	//     phase (after cache sync -> PopulateInitialObjects completion).
+	//   - Therefore, informerInitialObjectPopulationDurationSeconds does NOT include any time already
+	//     counted in informerSyncDurationMs; the two metrics represent consecutive, non-overlapping phases.
+	informerInitialObjectPopulationDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
+		Namespace: metrics.PrometheusNamespace,
+		Subsystem: metrics.SensorSubsystem.String(),
+		Name:      "informer_initial_object_population_duration_seconds",
+		Help: "Time in seconds spent processing and dispatching the initial object snapshot after informer cache sync. " +
+			"Measured per informer from PopulateInitialObjects start until completion. " +
+			"High values indicate slow initial object processing, which can delay full listener readiness.",
+		Buckets: prometheus.DefBuckets,
+	}, []string{"informer"})
 )
</code_context>
<issue_to_address>
**suggestion (performance):** Consider tuning histogram buckets for initial object population durations instead of using `prometheus.DefBuckets`.

The default ~5ms–10s buckets may not reflect the expected range for initial population (likely hundreds of ms to tens of seconds, varying with cluster size). Consider defining custom buckets (e.g., extending to 60s/5m or using coarser steps) to improve observability and avoid unnecessary TSDB pressure. If real-world timings are unknown, you could at least add a TODO to revisit this after observing metrics in production.

```suggestion
	informerInitialObjectPopulationDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
		Namespace: metrics.PrometheusNamespace,
		Subsystem: metrics.SensorSubsystem.String(),
		Name:      "informer_initial_object_population_duration_seconds",
		Help: "Time in seconds spent processing and dispatching the initial object snapshot after informer cache sync. " +
			"Measured per informer from PopulateInitialObjects start until completion. " +
			"High values indicate slow initial object processing, which can delay full listener readiness.",
		// NOTE: Initial population duration is typically on the order of hundreds of ms to tens of seconds,
		// and can grow with cluster size. Use coarser, extended buckets to capture long tails without
		// excessive TSDB cardinality.
		// TODO(ROX-XXXX): Revisit bucket boundaries after observing this metric in production.
		Buckets: []float64{
			0.1, 0.25, 0.5, // sub-second startup
			1, 2.5, 5, 10,  // common expected range
			30, 60, 120, 300, // slow / pathological cases up to 5 minutes
		},
	}, []string{"informer"})
)
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sensor/common/metrics/metrics.go

vikin91 · 2026-03-26T10:04:27Z

@sourcery-ai recheck

sourcery-ai · 2026-03-26T10:04:36Z

Sure! I'm generating a new review now.

sourcery-ai

Hey - I've found 1 issue

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="sensor/kubernetes/listener/resource_event_handler.go" line_range="559-584" />
<code_context>
-		case <-doneChannel:
+		initialObjects := informer.GetIndexer().List()
+		doneChannel := handlerImpl.PopulateInitialObjects(initialObjects)
+		waitStarted := time.Now()
+		warnTicker := time.NewTicker(15 * time.Second)
+		defer warnTicker.Stop()
+		for {
+			select {
+			case <-stopSignal.Done():
+				log.Infof("Informer %q: initial object population wait interrupted after %s", name, time.Since(waitStarted).Truncate(time.Millisecond))
+				return
+			case <-doneChannel:
+				duration := time.Since(waitStarted)
+				sensorMetrics.ObserveInformerInitialObjectPopulationDuration(name, duration)
+				log.Debugf("Informer %q: initial object population completed in %s", name, duration.Truncate(time.Millisecond))
+				return
+			case <-warnTicker.C:
+				missingCount, totalCount := handlerImpl.initialSyncDebugState()
+				log.Infof(
</code_context>
<issue_to_address>
**suggestion:** Consider lowering the log level or rate for the periodic "still waiting" messages to avoid log noise in slow or pathological cases.

The `log.Infof("Informer %q: still waiting for initial object population ...")` within the 15s ticker will fire indefinitely for a stuck informer. Across many informers or in consistently slow cases this can generate a lot of Info-level noise. Consider either lowering the level (e.g., `Debugf`), only escalating to `Warnf` after a longer threshold, or adding simple rate limiting (log every Nth tick or only after some minimum wait) to keep the logs useful without being spammy.

```suggestion
		initialObjects := informer.GetIndexer().List()
		doneChannel := handlerImpl.PopulateInitialObjects(initialObjects)
		waitStarted := time.Now()
		warnTicker := time.NewTicker(15 * time.Second)
		defer warnTicker.Stop()

		// Rate-limit and tier logging for long-running initial populations:
		// - Log at debug level, at most once per minute, for the first 5 minutes
		// - After 5 minutes, escalate to warn level (still at most once per minute)
		const (
			initialPopulationLogIntervalTicks = 4                 // 4 * 15s = 60s
			initialPopulationWarnAfter        = 5 * time.Minute   // escalate after this duration
		)
		logTickCount := 0

		for {
			select {
			case <-stopSignal.Done():
				log.Infof("Informer %q: initial object population wait interrupted after %s", name, time.Since(waitStarted).Truncate(time.Millisecond))
				return
			case <-doneChannel:
				duration := time.Since(waitStarted)
				sensorMetrics.ObserveInformerInitialObjectPopulationDuration(name, duration)
				log.Debugf("Informer %q: initial object population completed in %s", name, duration.Truncate(time.Millisecond))
				return
			case <-warnTicker.C:
				logTickCount++
				if logTickCount%initialPopulationLogIntervalTicks != 0 {
					// Skip to limit log volume; we only log every Nth tick.
					continue
				}

				missingCount, totalCount := handlerImpl.initialSyncDebugState()
				elapsed := time.Since(waitStarted)
				elapsedTrunc := elapsed.Truncate(time.Millisecond)

				if elapsed >= initialPopulationWarnAfter {
					log.Warnf(
						"Informer %q: still waiting for initial object population after %s (missing=%d total=%d)",
						name,
						elapsedTrunc,
						missingCount,
						totalCount,
					)
				} else {
					log.Debugf(
						"Informer %q: still waiting for initial object population after %s (missing=%d total=%d)",
						name,
						elapsedTrunc,
						missingCount,
						totalCount,
					)
				}
			}
		}
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sensor/kubernetes/listener/resource_event_handler.go

sourcery-ai · 2026-03-26T10:06:11Z

Hey @vikin91, I've posted a new review for you!

vikin91 added the ai-assisted label Mar 25, 2026

openshift-ci bot added the do-not-merge/work-in-progress label Mar 25, 2026

github-actions bot added area/sensor ai-review labels Mar 25, 2026

sourcery-ai bot reviewed Mar 25, 2026

View reviewed changes

sensor/common/networkflow/manager/manager_impl.go Outdated Show resolved Hide resolved

vikin91 added 2 commits March 25, 2026 18:05

Remove performance logging (belongs to anonther branch)

346f16c

Remove unused param

9fba7d5

vikin91 marked this pull request as ready for review March 25, 2026 17:53

vikin91 requested a review from a team as a code owner March 25, 2026 17:53

openshift-ci bot removed the do-not-merge/work-in-progress label Mar 25, 2026

sourcery-ai bot reviewed Mar 25, 2026

View reviewed changes

sensor/common/metrics/metrics.go Show resolved Hide resolved

Use custom prom buckets

22e7e58

sourcery-ai bot reviewed Mar 26, 2026

View reviewed changes

sensor/kubernetes/listener/resource_event_handler.go Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sensor): add informer population timing metrics#19614

perf(sensor): add informer population timing metrics#19614
vikin91 wants to merge 4 commits intomasterfrom
piotr/informer-instrumentation

vikin91 commented Mar 25, 2026 •

edited

Loading

Uh oh!

vikin91 commented Mar 25, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Mar 25, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

rhacs-bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

vikin91 commented Mar 26, 2026

Uh oh!

sourcery-ai bot commented Mar 26, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

sourcery-ai bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vikin91 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

vikin91 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 25, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhacs-bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vikin91 commented Mar 26, 2026

Uh oh!

sourcery-ai bot commented Mar 26, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vikin91 commented Mar 25, 2026 •

edited

Loading

vikin91 commented Mar 25, 2026 •

edited

Loading

rhacs-bot commented Mar 25, 2026 •

edited

Loading

codecov bot commented Mar 25, 2026 •

edited

Loading