Skip to content

fix: startup races#3172

Draft
alpe wants to merge 5 commits intomainfrom
alex/fix_startup
Draft

fix: startup races#3172
alpe wants to merge 5 commits intomainfrom
alex/fix_startup

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Mar 17, 2026

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 87aa1b97-706f-4b57-934b-7f5f82e8258f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/fix_startup
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 18, 2026, 10:05 AM

@claude
Copy link
Contributor

claude bot commented Mar 17, 2026

Claude finished @alpe's task in 2m 44s —— View job


PR Review: fix: startup races

  • Gather context and read changed files
  • Review pkg/sync/sync_service.go changes
  • Review pkg/sync/syncer_status.go and tests
  • Review node/failover.go changes
  • Review execution/evm/execution.go changes
  • Review test changes (e2e, sync_service_test)
  • Compile overall findings

Overview

This PR correctly identifies and addresses a class of startup races in the HA sequencer path. The core fix — replacing the atomic.Bool-based syncer state with a mutex-protected startOnce/stopIfStarted pattern — is sound and well-tested. The new StartForPublishing mode for the raft leader path elegantly avoids the deadlock where a fresh leader was waiting for P2P peers to serve height 1 (which nobody could, since it hadn't produced it yet).


Bugs / Issues

1. Timer goroutine leak in retryWithBackoffOnPayloadStatusexecution/evm/execution.go:133

select {
case <-ctx.Done():
    return ...
case <-time.After(backoff):  // leaks a timer on context cancellation
    backoff *= 2
}

time.After allocates a new time.Timer on each call; if ctx is cancelled first, the time.After goroutine is orphaned until the timer fires. Should use time.NewTimer + defer t.Stop() pattern. Fix this →

2. Timer goroutine leak in waitForCatchupnode/failover.go:321

timeoutCh = time.After(f.catchupTimeout)

Same issue: if catchup completes before the timeout fires, the underlying goroutine leaks until the timer expires. time.NewTimer with defer t.Stop() (or t.Stop() when caughtUp = true) is preferred. Fix this →


Correctness Observations

3. Dead loop in TestHeaderSyncServiceRestartpkg/sync/sync_service_test.go:183

for i := signedHeader.Height() + 1; i < 2; i++ {

After the first loop the signedHeader is at height 9, so signedHeader.Height() + 1 = 10. The loop condition 10 < 2 is immediately false — the body never executes. Seems like the intent was i <= signedHeader.Height() or a fixed upper bound. Fix this →

4. CAS + initStore window vs. startSyncerpkg/sync/sync_service.go:136-149

After storeInitialized.CompareAndSwap(false, true) returns true but before initStore completes, a concurrent WriteToStoreAndBroadcast call sees storeInitialized=true, skips initStore, and proceeds directly to startSyncer. The syncer is then started against an incompletely-initialized store. In practice this is benign because the syncer tolerates an empty store, but the invariant that "syncer starts only after store is ready" is subtly violated under concurrent writes to block 1. Worth a comment noting the intent.


Design / Architecture Notes

5. startOnce holds mutex for duration of startFn

SyncerStatus.startOnce holds the mutex while calling startFn (i.e. while syncer.Start(ctx) is running, which may involve network I/O). All concurrent callers block waiting for the mutex until start completes. This is correct for the current use pattern, but it means a slow start (e.g. due to a hung context) will block all concurrent calls indefinitely. A brief comment in SyncerStatus about this trade-off would help future readers.

6. Leadership may change between shouldStartSyncInPublisherMode and StartForPublishing completing

shouldStartSyncInPublisherMode is evaluated once at the top of Run, before the sync goroutines complete. If raft leadership changes during StartForPublishing, the node proceeds in publisher mode without being the leader. The downstream effect is that WriteToStoreAndBroadcast will be called only by the new leader, so no blocks would be produced by this node — which is safe. But it's a subtle time-of-check-time-of-use. Adding a raftNode.IsLeader() guard inside WriteToStoreAndBroadcast (or a comment on the TOCTOU) would make the design more robust.

7. retryWithBackoffOnPayloadStatus has unreachable final returnexecution/evm/execution.go:141

The final return fmt.Errorf("max retries (%d) exceeded for %s", maxRetries, operation) after the loop can never be reached: for attempt == maxRetries, the check on line 128 returns early. The compiler won't flag this (not all code paths return from the loop), but it's dead code that could mislead readers.


Minor / Style

8. stopIfStarted doesn't reset started on stop failure

If stopFn returns an error, started remains true. This means a retry of stopIfStarted would call stopFn again — which is the right behavior. However the test (syncer_status_test.go) doesn't cover the "stop fails, retry" case, leaving this assumption untested.

9. TestHASequencerRollingRestartE2E restart retry loop

The maxRestartAttempts = 10 loop uses IsNodeUp (which polls for advancing block height) with nodeCheckTimeout = 5s per node. In the worst case — all 3 nodes non-responsive — each retry attempt can take 3 * 5s = 15s, making the retry loop up to 150s before failing. This is fine for correctness but worth documenting.

10. execution.go reconcileExecutionAtHeight tx-hash validation

The check !bytes.Equal(execMeta.TxHash, requestedTxHash) correctly guards against resuming execution for a different block. For the case where execMeta.TxHash == nil (stored before this fix, len(txs)==0 path), hashTxs now also returns nil for empty txs, so the equality check stays consistent across old and new metadata.


Summary

The fundamental race fix (startOnce/stopIfStarted + StartForPublishing) is well-designed and properly tested. The primary issues to address before merge are the timer goroutine leaks in retryWithBackoffOnPayloadStatus and waitForCatchup, and the dead test loop in TestHeaderSyncServiceRestart. The remaining items are lower priority.

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 54.65116% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.74%. Comparing base (6e0bd9a) to head (505bce1).

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 47.36% 13 Missing and 7 partials ⚠️
node/failover.go 37.03% 11 Missing and 6 partials ⚠️
pkg/sync/syncer_status.go 90.47% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3172      +/-   ##
==========================================
- Coverage   60.75%   60.74%   -0.01%     
==========================================
  Files         117      117              
  Lines       11988    12047      +59     
==========================================
+ Hits         7283     7318      +35     
- Misses       3886     3903      +17     
- Partials      819      826       +7     
Flag Coverage Δ
combined 60.74% <54.65%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant