perf(sensor): add informer population timing metrics#19614
perf(sensor): add informer population timing metrics#19614
Conversation
Add instrumentation to track how long initial object population takes after informer cache sync completes. This is the handoff/processing phase that is separate from (and comes after) informer sync time. Changes: - New Prometheus histogram `informer_initial_object_population_duration_seconds` recorded per informer after PopulateInitialObjects finishes. - Periodic 15-second warnings during long initial-population waits, reporting remaining/total object counts for stuck-informer diagnosis. - Log messages on interruption and completion of initial population. - Track `initialSyncTotalIDs` and expose `initialSyncDebugState()` for the warning ticker. - Add duration logging for enrichAndSend in networkflow manager to identify slow enrichment ticks. Partially generated by AI. Made-with: Cursor
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new
resourceEventHandlerImpl.namefield is never read; consider either wiring it into log/metric output for per-handler attribution or removing it to avoid confusion about its purpose. - The new
networkFlowManager.enrichConnectionslogging will emit a log line on every tick (debug or warn), which could be very noisy in busy clusters; consider gating these behind a higher log level, rate-limiting, or only logging when the duration exceeds a more conservative threshold.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `resourceEventHandlerImpl.name` field is never read; consider either wiring it into log/metric output for per-handler attribution or removing it to avoid confusion about its purpose.
- The new `networkFlowManager.enrichConnections` logging will emit a log line on every tick (debug or warn), which could be very noisy in busy clusters; consider gating these behind a higher log level, rate-limiting, or only logging when the duration exceeds a more conservative threshold.
## Individual Comments
### Comment 1
<location path="sensor/common/networkflow/manager/manager_impl.go" line_range="320-324" />
<code_context>
case <-m.stopper.Flow().StopRequested():
return
case <-tickerC:
+ enrichStart := time.Now()
m.enrichAndSend()
+ enrichDuration := time.Since(enrichStart)
+ if enrichDuration < time.Second {
+ log.Debugf("networkflow tick: finished enrichAndSend in %s; calling clusterentities RecordTick", enrichDuration)
+ } else {
</code_context>
<issue_to_address>
**suggestion:** Warn-level logging on every slow tick may be too noisy for expected bursty behavior
This logs at Warnf for any tick taking ≥1s, even though the message notes this "is not necessarily an error" and can occur during normal bursty updates. In a busy cluster this could generate excessive, low‑value warnings that obscure real issues. Consider raising the threshold, lowering the log level (Info/Debug), or adding rate limiting/backoff to this warning.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Images are ready for the commit at 22e7e58. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #19614 +/- ##
==========================================
+ Coverage 49.25% 49.27% +0.01%
==========================================
Files 2735 2735
Lines 206138 206246 +108
==========================================
+ Hits 101539 101627 +88
- Misses 97051 97076 +25
+ Partials 7548 7543 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The
informer_initial_object_population_duration_secondshistogram currently usesprometheus.DefBuckets, which top out at 10s; given your example workloads in the tens of seconds, consider defining custom buckets that extend into the minutes range so the metric remains informative rather than accumulating mostly in the+Infbucket. - In the
warnTickerloop, you might want to short‑circuit logging whenmissingCount == 0(or otherwise detect completion) to avoid periodic "missing=0" log lines ifinitialSyncDebugStateanddoneChannelbecome momentarily out of sync under race conditions. - The periodic 15s log uses
Infofbut the description calls them "warnings"; if these are intended as slow‑path alerts, consider usingWarnfor a distinct log key/field so operators can more easily distinguish them from normal informational logs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `informer_initial_object_population_duration_seconds` histogram currently uses `prometheus.DefBuckets`, which top out at 10s; given your example workloads in the tens of seconds, consider defining custom buckets that extend into the minutes range so the metric remains informative rather than accumulating mostly in the `+Inf` bucket.
- In the `warnTicker` loop, you might want to short‑circuit logging when `missingCount == 0` (or otherwise detect completion) to avoid periodic "missing=0" log lines if `initialSyncDebugState` and `doneChannel` become momentarily out of sync under race conditions.
- The periodic 15s log uses `Infof` but the description calls them "warnings"; if these are intended as slow‑path alerts, consider using `Warnf` or a distinct log key/field so operators can more easily distinguish them from normal informational logs.
## Individual Comments
### Comment 1
<location path="sensor/common/metrics/metrics.go" line_range="354-363" />
<code_context>
+ // phase (after cache sync -> PopulateInitialObjects completion).
+ // - Therefore, informerInitialObjectPopulationDurationSeconds does NOT include any time already
+ // counted in informerSyncDurationMs; the two metrics represent consecutive, non-overlapping phases.
+ informerInitialObjectPopulationDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
+ Namespace: metrics.PrometheusNamespace,
+ Subsystem: metrics.SensorSubsystem.String(),
+ Name: "informer_initial_object_population_duration_seconds",
+ Help: "Time in seconds spent processing and dispatching the initial object snapshot after informer cache sync. " +
+ "Measured per informer from PopulateInitialObjects start until completion. " +
+ "High values indicate slow initial object processing, which can delay full listener readiness.",
+ Buckets: prometheus.DefBuckets,
+ }, []string{"informer"})
)
</code_context>
<issue_to_address>
**suggestion (performance):** Consider tuning histogram buckets for initial object population durations instead of using `prometheus.DefBuckets`.
The default ~5ms–10s buckets may not reflect the expected range for initial population (likely hundreds of ms to tens of seconds, varying with cluster size). Consider defining custom buckets (e.g., extending to 60s/5m or using coarser steps) to improve observability and avoid unnecessary TSDB pressure. If real-world timings are unknown, you could at least add a TODO to revisit this after observing metrics in production.
```suggestion
informerInitialObjectPopulationDurationSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: metrics.PrometheusNamespace,
Subsystem: metrics.SensorSubsystem.String(),
Name: "informer_initial_object_population_duration_seconds",
Help: "Time in seconds spent processing and dispatching the initial object snapshot after informer cache sync. " +
"Measured per informer from PopulateInitialObjects start until completion. " +
"High values indicate slow initial object processing, which can delay full listener readiness.",
// NOTE: Initial population duration is typically on the order of hundreds of ms to tens of seconds,
// and can grow with cluster size. Use coarser, extended buckets to capture long tails without
// excessive TSDB cardinality.
// TODO(ROX-XXXX): Revisit bucket boundaries after observing this metric in production.
Buckets: []float64{
0.1, 0.25, 0.5, // sub-second startup
1, 2.5, 5, 10, // common expected range
30, 60, 120, 300, // slow / pathological cases up to 5 minutes
},
}, []string{"informer"})
)
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai recheck |
|
Sure! I'm generating a new review now. |
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="sensor/kubernetes/listener/resource_event_handler.go" line_range="559-584" />
<code_context>
- case <-doneChannel:
+ initialObjects := informer.GetIndexer().List()
+ doneChannel := handlerImpl.PopulateInitialObjects(initialObjects)
+ waitStarted := time.Now()
+ warnTicker := time.NewTicker(15 * time.Second)
+ defer warnTicker.Stop()
+ for {
+ select {
+ case <-stopSignal.Done():
+ log.Infof("Informer %q: initial object population wait interrupted after %s", name, time.Since(waitStarted).Truncate(time.Millisecond))
+ return
+ case <-doneChannel:
+ duration := time.Since(waitStarted)
+ sensorMetrics.ObserveInformerInitialObjectPopulationDuration(name, duration)
+ log.Debugf("Informer %q: initial object population completed in %s", name, duration.Truncate(time.Millisecond))
+ return
+ case <-warnTicker.C:
+ missingCount, totalCount := handlerImpl.initialSyncDebugState()
+ log.Infof(
</code_context>
<issue_to_address>
**suggestion:** Consider lowering the log level or rate for the periodic "still waiting" messages to avoid log noise in slow or pathological cases.
The `log.Infof("Informer %q: still waiting for initial object population ...")` within the 15s ticker will fire indefinitely for a stuck informer. Across many informers or in consistently slow cases this can generate a lot of Info-level noise. Consider either lowering the level (e.g., `Debugf`), only escalating to `Warnf` after a longer threshold, or adding simple rate limiting (log every Nth tick or only after some minimum wait) to keep the logs useful without being spammy.
```suggestion
initialObjects := informer.GetIndexer().List()
doneChannel := handlerImpl.PopulateInitialObjects(initialObjects)
waitStarted := time.Now()
warnTicker := time.NewTicker(15 * time.Second)
defer warnTicker.Stop()
// Rate-limit and tier logging for long-running initial populations:
// - Log at debug level, at most once per minute, for the first 5 minutes
// - After 5 minutes, escalate to warn level (still at most once per minute)
const (
initialPopulationLogIntervalTicks = 4 // 4 * 15s = 60s
initialPopulationWarnAfter = 5 * time.Minute // escalate after this duration
)
logTickCount := 0
for {
select {
case <-stopSignal.Done():
log.Infof("Informer %q: initial object population wait interrupted after %s", name, time.Since(waitStarted).Truncate(time.Millisecond))
return
case <-doneChannel:
duration := time.Since(waitStarted)
sensorMetrics.ObserveInformerInitialObjectPopulationDuration(name, duration)
log.Debugf("Informer %q: initial object population completed in %s", name, duration.Truncate(time.Millisecond))
return
case <-warnTicker.C:
logTickCount++
if logTickCount%initialPopulationLogIntervalTicks != 0 {
// Skip to limit log volume; we only log every Nth tick.
continue
}
missingCount, totalCount := handlerImpl.initialSyncDebugState()
elapsed := time.Since(waitStarted)
elapsedTrunc := elapsed.Truncate(time.Millisecond)
if elapsed >= initialPopulationWarnAfter {
log.Warnf(
"Informer %q: still waiting for initial object population after %s (missing=%d total=%d)",
name,
elapsedTrunc,
missingCount,
totalCount,
)
} else {
log.Debugf(
"Informer %q: still waiting for initial object population after %s (missing=%d total=%d)",
name,
elapsedTrunc,
missingCount,
totalCount,
)
}
}
}
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Hey @vikin91, I've posted a new review for you! |
Description
Add instrumentation to track how long initial object population takes after
informer cache sync completes. This is the handoff/processing phase that
follows informer sync time and is a separate, non-overlapping metric.
Despite seeing:
the objects that now fully arrived to Sensor may still not be fully processed.
This PR adds periodic logs if the first stage of processing (after informer sync) takes longer than expected.
This may highlight problems with Sensor performance and hint the incident handler in the right direction.
Example:
(triggered by using fake workload with 5k deployments, each with 4 pods and 2 containers).
Key changes:
informer_initial_object_population_duration_secondsrecorded per informer after PopulateInitialObjects finishes.
remaining/total object counts for stuck-informer diagnosis.
initialSyncTotalIDsand exposeinitialSyncDebugState()fordiagnostics.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Ran local-sensor with large fake workloads and observed:
Looking at the metrics:
Difference to the metric introduced to the informer tracker:
K8s quickly passed the objects to Sensor - 6ms - (suspiciously fast in this case) but the initial processing of them took 49s.