Skip to content

chore(scale): add online/offline transition cycle#19611

Draft
vikin91 wants to merge 1 commit intomasterfrom
piotr/fake-workload-offline
Draft

chore(scale): add online/offline transition cycle#19611
vikin91 wants to merge 1 commit intomasterfrom
piotr/fake-workload-offline

Conversation

@vikin91
Copy link
Copy Markdown
Contributor

@vikin91 vikin91 commented Mar 25, 2026

Description

Add a centralConnectionCrashCycle duration field to fake workload profiles
that periodically triggers synthetic Sensor offline->online state transitions.
This exercises the same state-handling paths that real Central connection drops
would, without requiring actual network disruptions.

This was a required change to reproduce a real customer issue: ROX-33474.

Key changes:

  • Extract createFakeCentralService from setupCentralWithFakeConnection
    for clarity and reuse.
  • Add TriggerOnlineOfflineTransition on Sensor to drive state changes
    from outside the gRPC stream path.
  • Refactor internal-message restart handler into the new public
    RequestCentralConnectionRestart method.
  • Wire a cycle timer in local-sensor main loop.
  • Include high-node-count.yaml workload profile that uses the feature.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Scale test tooling change only. Ran local-sensor with high-node-count.yaml
profile and observed periodic offline/online state transitions in logs.

@vikin91
Copy link
Copy Markdown
Contributor Author

vikin91 commented Mar 25, 2026

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In main, removing the generic spyCentral.SetMessageRecording(!localConfig.SkipCentralOutput) and moving that call into createFakeCentralService changes behavior for any non-fake Central connection where spyCentral is non-nil; double-check whether this was intentional or if real connections should still honor SkipCentralOutput.
  • In validateWorkload, a negative centralConnectionCrashCycle is both clamped to 0 and returned as an error; if negative values are expected to be auto-corrected, consider returning a warning-level signal (or logging) instead of an error to avoid treating this as a hard validation failure.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `main`, removing the generic `spyCentral.SetMessageRecording(!localConfig.SkipCentralOutput)` and moving that call into `createFakeCentralService` changes behavior for any non-fake Central connection where `spyCentral` is non-nil; double-check whether this was intentional or if real connections should still honor `SkipCentralOutput`.
- In `validateWorkload`, a negative `centralConnectionCrashCycle` is both clamped to 0 and returned as an error; if negative values are expected to be auto-corrected, consider returning a warning-level signal (or logging) instead of an error to avoid treating this as a hard validation failure.

## Individual Comments

### Comment 1
<location path="tools/local-sensor/main.go" line_range="522" />
<code_context>
 	return fakeConnectionFactory, centralclient.EmptyCertLoader(), spyCentral
 }
+
+func startCentralCrashCycle(ctx context.Context, cycle time.Duration, sensor *commonSensor.Sensor) {
+	if cycle <= 0 || sensor == nil {
+		return
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider using `RequestCentralConnectionRestart` instead of a synthetic state transition for a "connection crash cycle"

The current implementation only toggles component state via `TriggerOnlineOfflineTransition`, while the naming (`startCentralCrashCycle` / `CentralConnectionCrashCycle`) implies exercising actual connection restart/teardown behavior. If you intend to test connection retry under load, this should likely call `RequestCentralConnectionRestart`; if you only need state transitions, consider renaming to better reflect that behavior and avoid confusion for future readers.
</issue_to_address>

### Comment 2
<location path="sensor/kubernetes/fake/fake.go" line_range="254" />
<code_context>
 	return w.client
 }

+// CentralConnectionCrashCycle returns how often local-sensor should crash fake Central.
+func (w *WorkloadManager) CentralConnectionCrashCycle() time.Duration {
+	if w == nil || w.workload == nil {
</code_context>
<issue_to_address>
**issue:** Comment says "crash fake Central" but lifecycle implementation only toggles sensor state

The docstring for this getter implies the fake Central server/connection is being crashed and restarted, but `startCentralCrashCycle` only calls `Sensor.TriggerOnlineOfflineTransition`, which just flips the sensor’s online/offline state. Please either update the comment to describe the current behavior or adjust the implementation so it actually controls fake Central crashes, so the two stay aligned.
</issue_to_address>

### Comment 3
<location path="sensor/kubernetes/fake/fake.go" line_range="318-320" />
<code_context>
 }

 func validateWorkload(workload *Workload) error {
+	if workload.CentralConnectionCrashCycle < 0 {
+		workload.CentralConnectionCrashCycle = 0
+		return errors.New("negative centralConnectionCrashCycle, clamped to 0")
+	}
 	if workload.NetworkWorkload.OpenPortReuseProbability < 0.0 || workload.NetworkWorkload.OpenPortReuseProbability > 1.0 {
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Reconsider returning an error when the crash cycle is negative but successfully clamped

Here we correct a negative value to 0, but still return an error. That means callers may see failures or noisy logs even though the input has been made safe. If this is intended as a recoverable misconfiguration (like the open-port reuse probability case), consider either logging a warning and returning nil, or returning a non-fatal/wrapped error that callers can deliberately ignore. Clarifying whether this should be fatal vs advisory will make `validateWorkload`’s behavior more predictable for callers.

Suggested implementation:

```golang
func validateWorkload(workload *Workload) error {
	if workload.CentralConnectionCrashCycle < 0 {
		// Clamp to 0 but treat as recoverable misconfiguration: log advisory, do not fail validation.
		workload.CentralConnectionCrashCycle = 0
		log.Warnf("negative centralConnectionCrashCycle in workload; clamped to 0")
	}
	if workload.NetworkWorkload.OpenPortReuseProbability < 0.0 || workload.NetworkWorkload.OpenPortReuseProbability > 1.0 {

```

1. Ensure the logger used here (`log.Warnf`) matches the logging package already used in this file/package (e.g. `logrus`, `logging.LoggerForModule`, or Go's stdlib `log`). If this file does not yet have a logger, import and initialize it according to the project's conventions.
2. If `errors` was only used for this now-removed return value, remove the `errors` import to avoid an unused import error.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

return fakeConnectionFactory, centralclient.EmptyCertLoader(), spyCentral
}

func startCentralCrashCycle(ctx context.Context, cycle time.Duration, sensor *commonSensor.Sensor) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Consider using RequestCentralConnectionRestart instead of a synthetic state transition for a "connection crash cycle"

The current implementation only toggles component state via TriggerOnlineOfflineTransition, while the naming (startCentralCrashCycle / CentralConnectionCrashCycle) implies exercising actual connection restart/teardown behavior. If you intend to test connection retry under load, this should likely call RequestCentralConnectionRestart; if you only need state transitions, consider renaming to better reflect that behavior and avoid confusion for future readers.

return w.client
}

// CentralConnectionCrashCycle returns how often local-sensor should crash fake Central.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Comment says "crash fake Central" but lifecycle implementation only toggles sensor state

The docstring for this getter implies the fake Central server/connection is being crashed and restarted, but startCentralCrashCycle only calls Sensor.TriggerOnlineOfflineTransition, which just flips the sensor’s online/offline state. Please either update the comment to describe the current behavior or adjust the implementation so it actually controls fake Central crashes, so the two stay aligned.

Comment on lines +318 to +320
if workload.CentralConnectionCrashCycle < 0 {
workload.CentralConnectionCrashCycle = 0
return errors.New("negative centralConnectionCrashCycle, clamped to 0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Reconsider returning an error when the crash cycle is negative but successfully clamped

Here we correct a negative value to 0, but still return an error. That means callers may see failures or noisy logs even though the input has been made safe. If this is intended as a recoverable misconfiguration (like the open-port reuse probability case), consider either logging a warning and returning nil, or returning a non-fatal/wrapped error that callers can deliberately ignore. Clarifying whether this should be fatal vs advisory will make validateWorkload’s behavior more predictable for callers.

Suggested implementation:

func validateWorkload(workload *Workload) error {
	if workload.CentralConnectionCrashCycle < 0 {
		// Clamp to 0 but treat as recoverable misconfiguration: log advisory, do not fail validation.
		workload.CentralConnectionCrashCycle = 0
		log.Warnf("negative centralConnectionCrashCycle in workload; clamped to 0")
	}
	if workload.NetworkWorkload.OpenPortReuseProbability < 0.0 || workload.NetworkWorkload.OpenPortReuseProbability > 1.0 {
  1. Ensure the logger used here (log.Warnf) matches the logging package already used in this file/package (e.g. logrus, logging.LoggerForModule, or Go's stdlib log). If this file does not yet have a logger, import and initialize it according to the project's conventions.
  2. If errors was only used for this now-removed return value, remove the errors import to avoid an unused import error.

@rhacs-bot
Copy link
Copy Markdown
Contributor

rhacs-bot commented Mar 25, 2026

Images are ready for the commit at 13ed82b.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-430-g13ed82bbd3.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 0% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.31%. Comparing base (123b457) to head (13ed82b).
⚠️ Report is 25 commits behind head on master.

Files with missing lines Patch % Lines
sensor/common/sensor/sensor.go 0.00% 13 Missing ⚠️
sensor/kubernetes/fake/fake.go 0.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19611      +/-   ##
==========================================
+ Coverage   49.25%   49.31%   +0.06%     
==========================================
  Files        2735     2737       +2     
  Lines      206138   206467     +329     
==========================================
+ Hits       101539   101825     +286     
- Misses      97051    97096      +45     
+ Partials     7548     7546       -2     
Flag Coverage Δ
go-unit-tests 49.31% <0.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 force-pushed the piotr/fake-workload-offline branch from e7e5ac8 to 13ed82b Compare March 26, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants