chore(scale): add online/offline transition cycle#19611
chore(scale): add online/offline transition cycle#19611
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
main, removing the genericspyCentral.SetMessageRecording(!localConfig.SkipCentralOutput)and moving that call intocreateFakeCentralServicechanges behavior for any non-fake Central connection wherespyCentralis non-nil; double-check whether this was intentional or if real connections should still honorSkipCentralOutput. - In
validateWorkload, a negativecentralConnectionCrashCycleis both clamped to 0 and returned as an error; if negative values are expected to be auto-corrected, consider returning a warning-level signal (or logging) instead of an error to avoid treating this as a hard validation failure.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `main`, removing the generic `spyCentral.SetMessageRecording(!localConfig.SkipCentralOutput)` and moving that call into `createFakeCentralService` changes behavior for any non-fake Central connection where `spyCentral` is non-nil; double-check whether this was intentional or if real connections should still honor `SkipCentralOutput`.
- In `validateWorkload`, a negative `centralConnectionCrashCycle` is both clamped to 0 and returned as an error; if negative values are expected to be auto-corrected, consider returning a warning-level signal (or logging) instead of an error to avoid treating this as a hard validation failure.
## Individual Comments
### Comment 1
<location path="tools/local-sensor/main.go" line_range="522" />
<code_context>
return fakeConnectionFactory, centralclient.EmptyCertLoader(), spyCentral
}
+
+func startCentralCrashCycle(ctx context.Context, cycle time.Duration, sensor *commonSensor.Sensor) {
+ if cycle <= 0 || sensor == nil {
+ return
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider using `RequestCentralConnectionRestart` instead of a synthetic state transition for a "connection crash cycle"
The current implementation only toggles component state via `TriggerOnlineOfflineTransition`, while the naming (`startCentralCrashCycle` / `CentralConnectionCrashCycle`) implies exercising actual connection restart/teardown behavior. If you intend to test connection retry under load, this should likely call `RequestCentralConnectionRestart`; if you only need state transitions, consider renaming to better reflect that behavior and avoid confusion for future readers.
</issue_to_address>
### Comment 2
<location path="sensor/kubernetes/fake/fake.go" line_range="254" />
<code_context>
return w.client
}
+// CentralConnectionCrashCycle returns how often local-sensor should crash fake Central.
+func (w *WorkloadManager) CentralConnectionCrashCycle() time.Duration {
+ if w == nil || w.workload == nil {
</code_context>
<issue_to_address>
**issue:** Comment says "crash fake Central" but lifecycle implementation only toggles sensor state
The docstring for this getter implies the fake Central server/connection is being crashed and restarted, but `startCentralCrashCycle` only calls `Sensor.TriggerOnlineOfflineTransition`, which just flips the sensor’s online/offline state. Please either update the comment to describe the current behavior or adjust the implementation so it actually controls fake Central crashes, so the two stay aligned.
</issue_to_address>
### Comment 3
<location path="sensor/kubernetes/fake/fake.go" line_range="318-320" />
<code_context>
}
func validateWorkload(workload *Workload) error {
+ if workload.CentralConnectionCrashCycle < 0 {
+ workload.CentralConnectionCrashCycle = 0
+ return errors.New("negative centralConnectionCrashCycle, clamped to 0")
+ }
if workload.NetworkWorkload.OpenPortReuseProbability < 0.0 || workload.NetworkWorkload.OpenPortReuseProbability > 1.0 {
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Reconsider returning an error when the crash cycle is negative but successfully clamped
Here we correct a negative value to 0, but still return an error. That means callers may see failures or noisy logs even though the input has been made safe. If this is intended as a recoverable misconfiguration (like the open-port reuse probability case), consider either logging a warning and returning nil, or returning a non-fatal/wrapped error that callers can deliberately ignore. Clarifying whether this should be fatal vs advisory will make `validateWorkload`’s behavior more predictable for callers.
Suggested implementation:
```golang
func validateWorkload(workload *Workload) error {
if workload.CentralConnectionCrashCycle < 0 {
// Clamp to 0 but treat as recoverable misconfiguration: log advisory, do not fail validation.
workload.CentralConnectionCrashCycle = 0
log.Warnf("negative centralConnectionCrashCycle in workload; clamped to 0")
}
if workload.NetworkWorkload.OpenPortReuseProbability < 0.0 || workload.NetworkWorkload.OpenPortReuseProbability > 1.0 {
```
1. Ensure the logger used here (`log.Warnf`) matches the logging package already used in this file/package (e.g. `logrus`, `logging.LoggerForModule`, or Go's stdlib `log`). If this file does not yet have a logger, import and initialize it according to the project's conventions.
2. If `errors` was only used for this now-removed return value, remove the `errors` import to avoid an unused import error.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
tools/local-sensor/main.go
Outdated
| return fakeConnectionFactory, centralclient.EmptyCertLoader(), spyCentral | ||
| } | ||
|
|
||
| func startCentralCrashCycle(ctx context.Context, cycle time.Duration, sensor *commonSensor.Sensor) { |
There was a problem hiding this comment.
issue (bug_risk): Consider using RequestCentralConnectionRestart instead of a synthetic state transition for a "connection crash cycle"
The current implementation only toggles component state via TriggerOnlineOfflineTransition, while the naming (startCentralCrashCycle / CentralConnectionCrashCycle) implies exercising actual connection restart/teardown behavior. If you intend to test connection retry under load, this should likely call RequestCentralConnectionRestart; if you only need state transitions, consider renaming to better reflect that behavior and avoid confusion for future readers.
sensor/kubernetes/fake/fake.go
Outdated
| return w.client | ||
| } | ||
|
|
||
| // CentralConnectionCrashCycle returns how often local-sensor should crash fake Central. |
There was a problem hiding this comment.
issue: Comment says "crash fake Central" but lifecycle implementation only toggles sensor state
The docstring for this getter implies the fake Central server/connection is being crashed and restarted, but startCentralCrashCycle only calls Sensor.TriggerOnlineOfflineTransition, which just flips the sensor’s online/offline state. Please either update the comment to describe the current behavior or adjust the implementation so it actually controls fake Central crashes, so the two stay aligned.
sensor/kubernetes/fake/fake.go
Outdated
| if workload.CentralConnectionCrashCycle < 0 { | ||
| workload.CentralConnectionCrashCycle = 0 | ||
| return errors.New("negative centralConnectionCrashCycle, clamped to 0") |
There was a problem hiding this comment.
suggestion (bug_risk): Reconsider returning an error when the crash cycle is negative but successfully clamped
Here we correct a negative value to 0, but still return an error. That means callers may see failures or noisy logs even though the input has been made safe. If this is intended as a recoverable misconfiguration (like the open-port reuse probability case), consider either logging a warning and returning nil, or returning a non-fatal/wrapped error that callers can deliberately ignore. Clarifying whether this should be fatal vs advisory will make validateWorkload’s behavior more predictable for callers.
Suggested implementation:
func validateWorkload(workload *Workload) error {
if workload.CentralConnectionCrashCycle < 0 {
// Clamp to 0 but treat as recoverable misconfiguration: log advisory, do not fail validation.
workload.CentralConnectionCrashCycle = 0
log.Warnf("negative centralConnectionCrashCycle in workload; clamped to 0")
}
if workload.NetworkWorkload.OpenPortReuseProbability < 0.0 || workload.NetworkWorkload.OpenPortReuseProbability > 1.0 {- Ensure the logger used here (
log.Warnf) matches the logging package already used in this file/package (e.g.logrus,logging.LoggerForModule, or Go's stdliblog). If this file does not yet have a logger, import and initialize it according to the project's conventions. - If
errorswas only used for this now-removed return value, remove theerrorsimport to avoid an unused import error.
|
Images are ready for the commit at 13ed82b. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #19611 +/- ##
==========================================
+ Coverage 49.25% 49.31% +0.06%
==========================================
Files 2735 2737 +2
Lines 206138 206467 +329
==========================================
+ Hits 101539 101825 +286
- Misses 97051 97096 +45
+ Partials 7548 7546 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e7e5ac8 to
13ed82b
Compare
Description
Add a
centralConnectionCrashCycleduration field to fake workload profilesthat periodically triggers synthetic Sensor offline->online state transitions.
This exercises the same state-handling paths that real Central connection drops
would, without requiring actual network disruptions.
This was a required change to reproduce a real customer issue: ROX-33474.
Key changes:
createFakeCentralServicefromsetupCentralWithFakeConnectionfor clarity and reuse.
TriggerOnlineOfflineTransitionon Sensor to drive state changesfrom outside the gRPC stream path.
RequestCentralConnectionRestartmethod.high-node-count.yamlworkload profile that uses the feature.User-facing documentation
Testing and quality
Automated testing
How I validated my change
Scale test tooling change only. Ran local-sensor with
high-node-count.yamlprofile and observed periodic offline/online state transitions in logs.