Skip to content

perf(sensor): add informer population timing metrics#19614

Open
vikin91 wants to merge 4 commits intomasterfrom
piotr/informer-instrumentation
Open

perf(sensor): add informer population timing metrics#19614
vikin91 wants to merge 4 commits intomasterfrom
piotr/informer-instrumentation

Conversation

@vikin91
Copy link
Contributor

@vikin91 vikin91 commented Mar 25, 2026

Description

Add instrumentation to track how long initial object population takes after
informer cache sync completes. This is the handoff/processing phase that
follows informer sync time and is a separate, non-overlapping metric.

Despite seeing:

informer_sync_tracker.go:234: Info: All 18 informers have synced successfully

the objects that now fully arrived to Sensor may still not be fully processed.
This PR adds periodic logs if the first stage of processing (after informer sync) takes longer than expected.
This may highlight problems with Sensor performance and hint the incident handler in the right direction.

Example:

# k8s starts ingesting objects through informers
resource_event_handler.go:576: Info: Informer "Pods": still waiting for initial object population after 15s (missing=18978 total=27500)
# ^^ New line, shows that processing objects takes at least 15s.

informer_sync_tracker.go:234: Info: All 18 informers have synced successfully
# ^^ this was there before. It means that k8s gave us all objects and now Sensor has them.

resource_event_handler.go:576: Info: Informer "Pods": still waiting for initial object population after 30s (missing=10655 total=27500)
# ^^ Again, processing takes long. Only 10k more pods to process.
resource_event_handler.go:576: Info: Informer "Pods": still waiting for initial object population after 45s (missing=2384 total=27500)

informer_sync_tracker.go:148: Info: Informer sync tracker stopped
# ^^ This line was there. It told us that there is nothing more left to track. However, not everyone looking at the logs would know that we must wait for this line to appear to know that initial object population has finished. 

(triggered by using fake workload with 5k deployments, each with 4 pods and 2 containers).

Key changes:

  • New Prometheus histogram informer_initial_object_population_duration_seconds
    recorded per informer after PopulateInitialObjects finishes.
  • Periodic 15s warnings during long initial-population waits, reporting
    remaining/total object counts for stuck-informer diagnosis.
  • Log messages on interruption, completion, and aborted cache syncs.
  • Track initialSyncTotalIDs and expose initialSyncDebugState() for
    diagnostics.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Ran local-sensor with large fake workloads and observed:

  • Histogram metric emitted on Prometheus endpoint with per-informer labels
  • Warning logs appearing at 15s intervals during slow population

Looking at the metrics:

➜ curl -sS http://localhost:9090/metrics | grep informer_initial_object_population_duration_seconds | grep Pods | grep -v _bucket
rox_sensor_informer_initial_object_population_duration_seconds_sum{informer="Pods"} 49.6065285
rox_sensor_informer_initial_object_population_duration_seconds_count{informer="Pods"} 1

Difference to the metric introduced to the informer tracker:

rox_sensor_informer_sync_duration_ms{informer="Pods"} 6

K8s quickly passed the objects to Sensor - 6ms - (suspiciously fast in this case) but the initial processing of them took 49s.

Add instrumentation to track how long initial object population takes
after informer cache sync completes. This is the handoff/processing
phase that is separate from (and comes after) informer sync time.

Changes:
- New Prometheus histogram `informer_initial_object_population_duration_seconds`
  recorded per informer after PopulateInitialObjects finishes.
- Periodic 15-second warnings during long initial-population waits, reporting
  remaining/total object counts for stuck-informer diagnosis.
- Log messages on interruption and completion of initial population.
- Track `initialSyncTotalIDs` and expose `initialSyncDebugState()` for the
  warning ticker.
- Add duration logging for enrichAndSend in networkflow manager to identify
  slow enrichment ticks.

Partially generated by AI.

Made-with: Cursor
@vikin91
Copy link
Contributor Author

vikin91 commented Mar 25, 2026

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link

openshift-ci bot commented Mar 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new resourceEventHandlerImpl.name field is never read; consider either wiring it into log/metric output for per-handler attribution or removing it to avoid confusion about its purpose.
  • The new networkFlowManager.enrichConnections logging will emit a log line on every tick (debug or warn), which could be very noisy in busy clusters; consider gating these behind a higher log level, rate-limiting, or only logging when the duration exceeds a more conservative threshold.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `resourceEventHandlerImpl.name` field is never read; consider either wiring it into log/metric output for per-handler attribution or removing it to avoid confusion about its purpose.
- The new `networkFlowManager.enrichConnections` logging will emit a log line on every tick (debug or warn), which could be very noisy in busy clusters; consider gating these behind a higher log level, rate-limiting, or only logging when the duration exceeds a more conservative threshold.

## Individual Comments

### Comment 1
<location path="sensor/common/networkflow/manager/manager_impl.go" line_range="320-324" />
<code_context>
 		case <-m.stopper.Flow().StopRequested():
 			return
 		case <-tickerC:
+			enrichStart := time.Now()
 			m.enrichAndSend()
+			enrichDuration := time.Since(enrichStart)
+			if enrichDuration < time.Second {
+				log.Debugf("networkflow tick: finished enrichAndSend in %s; calling clusterentities RecordTick", enrichDuration)
+			} else {
</code_context>
<issue_to_address>
**suggestion:** Warn-level logging on every slow tick may be too noisy for expected bursty behavior

This logs at Warnf for any tick taking ≥1s, even though the message notes this "is not necessarily an error" and can occur during normal bursty updates. In a busy cluster this could generate excessive, low‑value warnings that obscure real issues. Consider raising the threshold, lowering the log level (Info/Debug), or adding rate limiting/backoff to this warning.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@rhacs-bot
Copy link
Contributor

rhacs-bot commented Mar 25, 2026

Images are ready for the commit at 22e7e58.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-433-g22e7e58397.

@codecov
Copy link

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 2.85714% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.27%. Comparing base (123b457) to head (22e7e58).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
...nsor/kubernetes/listener/resource_event_handler.go 0.00% 25 Missing ⚠️
...kubernetes/listener/resource_event_handler_impl.go 10.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19614      +/-   ##
==========================================
+ Coverage   49.25%   49.27%   +0.01%     
==========================================
  Files        2735     2735              
  Lines      206138   206246     +108     
==========================================
+ Hits       101539   101627      +88     
- Misses      97051    97076      +25     
+ Partials     7548     7543       -5     
Flag Coverage Δ
go-unit-tests 49.27% <2.85%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 marked this pull request as ready for review March 25, 2026 17:53
@vikin91 vikin91 requested a review from a team as a code owner March 25, 2026 17:53
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The informer_initial_object_population_duration_seconds histogram currently uses prometheus.DefBuckets, which top out at 10s; given your example workloads in the tens of seconds, consider defining custom buckets that extend into the minutes range so the metric remains informative rather than accumulating mostly in the +Inf bucket.
  • In the warnTicker loop, you might want to short‑circuit logging when missingCount == 0 (or otherwise detect completion) to avoid periodic "missing=0" log lines if initialSyncDebugState and doneChannel become momentarily out of sync under race conditions.
  • The periodic 15s log uses Infof but the description calls them "warnings"; if these are intended as slow‑path alerts, consider using Warnf or a distinct log key/field so operators can more easily distinguish them from normal informational logs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `informer_initial_object_population_duration_seconds` histogram currently uses `prometheus.DefBuckets`, which top out at 10s; given your example workloads in the tens of seconds, consider defining custom buckets that extend into the minutes range so the metric remains informative rather than accumulating mostly in the `+Inf` bucket.
- In the `warnTicker` loop, you might want to short‑circuit logging when `missingCount == 0` (or otherwise detect completion) to avoid periodic "missing=0" log lines if `initialSyncDebugState` and `doneChannel` become momentarily out of sync under race conditions.
- The periodic 15s log uses `Infof` but the description calls them "warnings"; if these are intended as slow‑path alerts, consider using `Warnf` or a distinct log key/field so operators can more easily distinguish them from normal informational logs.

## Individual Comments

### Comment 1
<location path="sensor/common/metrics/metrics.go" line_range="354-363" />
<code_context>
+	//     phase (after cache sync -> PopulateInitialObjects completion).
+	//   - Therefore, informerInitialObjectPopulationDurationSeconds does NOT include any time already
+	//     counted in informerSyncDurationMs; the two metrics represent consecutive, non-overlapping phases.
+	informerInitialObjectPopulationDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
+		Namespace: metrics.PrometheusNamespace,
+		Subsystem: metrics.SensorSubsystem.String(),
+		Name:      "informer_initial_object_population_duration_seconds",
+		Help: "Time in seconds spent processing and dispatching the initial object snapshot after informer cache sync. " +
+			"Measured per informer from PopulateInitialObjects start until completion. " +
+			"High values indicate slow initial object processing, which can delay full listener readiness.",
+		Buckets: prometheus.DefBuckets,
+	}, []string{"informer"})
 )
</code_context>
<issue_to_address>
**suggestion (performance):** Consider tuning histogram buckets for initial object population durations instead of using `prometheus.DefBuckets`.

The default ~5ms–10s buckets may not reflect the expected range for initial population (likely hundreds of ms to tens of seconds, varying with cluster size). Consider defining custom buckets (e.g., extending to 60s/5m or using coarser steps) to improve observability and avoid unnecessary TSDB pressure. If real-world timings are unknown, you could at least add a TODO to revisit this after observing metrics in production.

```suggestion
	informerInitialObjectPopulationDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
		Namespace: metrics.PrometheusNamespace,
		Subsystem: metrics.SensorSubsystem.String(),
		Name:      "informer_initial_object_population_duration_seconds",
		Help: "Time in seconds spent processing and dispatching the initial object snapshot after informer cache sync. " +
			"Measured per informer from PopulateInitialObjects start until completion. " +
			"High values indicate slow initial object processing, which can delay full listener readiness.",
		// NOTE: Initial population duration is typically on the order of hundreds of ms to tens of seconds,
		// and can grow with cluster size. Use coarser, extended buckets to capture long tails without
		// excessive TSDB cardinality.
		// TODO(ROX-XXXX): Revisit bucket boundaries after observing this metric in production.
		Buckets: []float64{
			0.1, 0.25, 0.5, // sub-second startup
			1, 2.5, 5, 10,  // common expected range
			30, 60, 120, 300, // slow / pathological cases up to 5 minutes
		},
	}, []string{"informer"})
)
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@vikin91
Copy link
Contributor Author

vikin91 commented Mar 26, 2026

@sourcery-ai recheck

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Mar 26, 2026

Sure! I'm generating a new review now.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="sensor/kubernetes/listener/resource_event_handler.go" line_range="559-584" />
<code_context>
-		case <-doneChannel:
+		initialObjects := informer.GetIndexer().List()
+		doneChannel := handlerImpl.PopulateInitialObjects(initialObjects)
+		waitStarted := time.Now()
+		warnTicker := time.NewTicker(15 * time.Second)
+		defer warnTicker.Stop()
+		for {
+			select {
+			case <-stopSignal.Done():
+				log.Infof("Informer %q: initial object population wait interrupted after %s", name, time.Since(waitStarted).Truncate(time.Millisecond))
+				return
+			case <-doneChannel:
+				duration := time.Since(waitStarted)
+				sensorMetrics.ObserveInformerInitialObjectPopulationDuration(name, duration)
+				log.Debugf("Informer %q: initial object population completed in %s", name, duration.Truncate(time.Millisecond))
+				return
+			case <-warnTicker.C:
+				missingCount, totalCount := handlerImpl.initialSyncDebugState()
+				log.Infof(
</code_context>
<issue_to_address>
**suggestion:** Consider lowering the log level or rate for the periodic "still waiting" messages to avoid log noise in slow or pathological cases.

The `log.Infof("Informer %q: still waiting for initial object population ...")` within the 15s ticker will fire indefinitely for a stuck informer. Across many informers or in consistently slow cases this can generate a lot of Info-level noise. Consider either lowering the level (e.g., `Debugf`), only escalating to `Warnf` after a longer threshold, or adding simple rate limiting (log every Nth tick or only after some minimum wait) to keep the logs useful without being spammy.

```suggestion
		initialObjects := informer.GetIndexer().List()
		doneChannel := handlerImpl.PopulateInitialObjects(initialObjects)
		waitStarted := time.Now()
		warnTicker := time.NewTicker(15 * time.Second)
		defer warnTicker.Stop()

		// Rate-limit and tier logging for long-running initial populations:
		// - Log at debug level, at most once per minute, for the first 5 minutes
		// - After 5 minutes, escalate to warn level (still at most once per minute)
		const (
			initialPopulationLogIntervalTicks = 4                 // 4 * 15s = 60s
			initialPopulationWarnAfter        = 5 * time.Minute   // escalate after this duration
		)
		logTickCount := 0

		for {
			select {
			case <-stopSignal.Done():
				log.Infof("Informer %q: initial object population wait interrupted after %s", name, time.Since(waitStarted).Truncate(time.Millisecond))
				return
			case <-doneChannel:
				duration := time.Since(waitStarted)
				sensorMetrics.ObserveInformerInitialObjectPopulationDuration(name, duration)
				log.Debugf("Informer %q: initial object population completed in %s", name, duration.Truncate(time.Millisecond))
				return
			case <-warnTicker.C:
				logTickCount++
				if logTickCount%initialPopulationLogIntervalTicks != 0 {
					// Skip to limit log volume; we only log every Nth tick.
					continue
				}

				missingCount, totalCount := handlerImpl.initialSyncDebugState()
				elapsed := time.Since(waitStarted)
				elapsedTrunc := elapsed.Truncate(time.Millisecond)

				if elapsed >= initialPopulationWarnAfter {
					log.Warnf(
						"Informer %q: still waiting for initial object population after %s (missing=%d total=%d)",
						name,
						elapsedTrunc,
						missingCount,
						totalCount,
					)
				} else {
					log.Debugf(
						"Informer %q: still waiting for initial object population after %s (missing=%d total=%d)",
						name,
						elapsedTrunc,
						missingCount,
						totalCount,
					)
				}
			}
		}
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Mar 26, 2026

Hey @vikin91, I've posted a new review for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants