fix: startup races by alpe · Pull Request #3172 · evstack/ev-node

alpe · 2026-03-17T13:16:12Z

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

coderabbitai · 2026-03-17T13:16:18Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 87aa1b97-706f-4b57-934b-7f5f82e8258f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch alex/fix_startup

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-17T13:16:25Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Mar 18, 2026, 10:05 AM

claude · 2026-03-17T13:16:31Z

Claude finished @alpe's task in 2m 44s —— View job

PR Review: fix: startup races

Gather context and read changed files
Review pkg/sync/sync_service.go changes
Review pkg/sync/syncer_status.go and tests
Review node/failover.go changes
Review execution/evm/execution.go changes
Review test changes (e2e, sync_service_test)
Compile overall findings

Overview

This PR correctly identifies and addresses a class of startup races in the HA sequencer path. The core fix — replacing the atomic.Bool-based syncer state with a mutex-protected startOnce/stopIfStarted pattern — is sound and well-tested. The new StartForPublishing mode for the raft leader path elegantly avoids the deadlock where a fresh leader was waiting for P2P peers to serve height 1 (which nobody could, since it hadn't produced it yet).

Bugs / Issues

1. Timer goroutine leak in retryWithBackoffOnPayloadStatus — execution/evm/execution.go:133

select {
case <-ctx.Done():
    return ...
case <-time.After(backoff):  // leaks a timer on context cancellation
    backoff *= 2
}

time.After allocates a new time.Timer on each call; if ctx is cancelled first, the time.After goroutine is orphaned until the timer fires. Should use time.NewTimer + defer t.Stop() pattern. Fix this →

2. Timer goroutine leak in waitForCatchup — node/failover.go:321

timeoutCh = time.After(f.catchupTimeout)

Same issue: if catchup completes before the timeout fires, the underlying goroutine leaks until the timer expires. time.NewTimer with defer t.Stop() (or t.Stop() when caughtUp = true) is preferred. Fix this →

Correctness Observations

3. Dead loop in TestHeaderSyncServiceRestart — pkg/sync/sync_service_test.go:183

for i := signedHeader.Height() + 1; i < 2; i++ {

After the first loop the signedHeader is at height 9, so signedHeader.Height() + 1 = 10. The loop condition 10 < 2 is immediately false — the body never executes. Seems like the intent was i <= signedHeader.Height() or a fixed upper bound. Fix this →

4. CAS + initStore window vs. startSyncer — pkg/sync/sync_service.go:136-149

After storeInitialized.CompareAndSwap(false, true) returns true but before initStore completes, a concurrent WriteToStoreAndBroadcast call sees storeInitialized=true, skips initStore, and proceeds directly to startSyncer. The syncer is then started against an incompletely-initialized store. In practice this is benign because the syncer tolerates an empty store, but the invariant that "syncer starts only after store is ready" is subtly violated under concurrent writes to block 1. Worth a comment noting the intent.

Design / Architecture Notes

5. startOnce holds mutex for duration of startFn

SyncerStatus.startOnce holds the mutex while calling startFn (i.e. while syncer.Start(ctx) is running, which may involve network I/O). All concurrent callers block waiting for the mutex until start completes. This is correct for the current use pattern, but it means a slow start (e.g. due to a hung context) will block all concurrent calls indefinitely. A brief comment in SyncerStatus about this trade-off would help future readers.

6. Leadership may change between shouldStartSyncInPublisherMode and StartForPublishing completing

shouldStartSyncInPublisherMode is evaluated once at the top of Run, before the sync goroutines complete. If raft leadership changes during StartForPublishing, the node proceeds in publisher mode without being the leader. The downstream effect is that WriteToStoreAndBroadcast will be called only by the new leader, so no blocks would be produced by this node — which is safe. But it's a subtle time-of-check-time-of-use. Adding a raftNode.IsLeader() guard inside WriteToStoreAndBroadcast (or a comment on the TOCTOU) would make the design more robust.

7. retryWithBackoffOnPayloadStatus has unreachable final return — execution/evm/execution.go:141

The final return fmt.Errorf("max retries (%d) exceeded for %s", maxRetries, operation) after the loop can never be reached: for attempt == maxRetries, the check on line 128 returns early. The compiler won't flag this (not all code paths return from the loop), but it's dead code that could mislead readers.

Minor / Style

8. stopIfStarted doesn't reset started on stop failure

If stopFn returns an error, started remains true. This means a retry of stopIfStarted would call stopFn again — which is the right behavior. However the test (syncer_status_test.go) doesn't cover the "stop fails, retry" case, leaving this assumption untested.

9. TestHASequencerRollingRestartE2E restart retry loop

The maxRestartAttempts = 10 loop uses IsNodeUp (which polls for advancing block height) with nodeCheckTimeout = 5s per node. In the worst case — all 3 nodes non-responsive — each retry attempt can take 3 * 5s = 15s, making the retry loop up to 150s before failing. This is fine for correctness but worth documenting.

10. execution.go reconcileExecutionAtHeight tx-hash validation

The check !bytes.Equal(execMeta.TxHash, requestedTxHash) correctly guards against resuming execution for a different block. For the case where execMeta.TxHash == nil (stored before this fix, len(txs)==0 path), hashTxs now also returns nil for empty txs, so the equality check stays consistent across old and new metadata.

Summary

The fundamental race fix (startOnce/stopIfStarted + StartForPublishing) is well-designed and properly tested. The primary issues to address before merge are the timer goroutine leaks in retryWithBackoffOnPayloadStatus and waitForCatchup, and the dead test loop in TestHeaderSyncServiceRestart. The remaining items are lower priority.

codecov · 2026-03-17T13:19:48Z

Codecov Report

❌ Patch coverage is 54.65116% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.74%. Comparing base (6e0bd9a) to head (505bce1).

Files with missing lines	Patch %	Lines
pkg/sync/sync_service.go	47.36%	13 Missing and 7 partials ⚠️
node/failover.go	37.03%	11 Missing and 6 partials ⚠️
pkg/sync/syncer_status.go	90.47%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3172      +/-   ##
==========================================
- Coverage   60.75%   60.74%   -0.01%     
==========================================
  Files         117      117              
  Lines       11988    12047      +59     
==========================================
+ Hits         7283     7318      +35     
- Misses       3886     3903      +17     
- Partials      819      826       +7

Flag	Coverage Δ
combined	`60.74% <54.65%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix startup races

522863f

github-actions bot assigned alpe Mar 17, 2026

alpe added 4 commits March 17, 2026 15:47

Review feedback

24cb707

Store logs on e2e test failure

21ef2da

Fix test

9e3e8e5

Merge branch 'main' into alex/fix_startup

505bce1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: startup races#3172

fix: startup races#3172
alpe wants to merge 5 commits intomainfrom
alex/fix_startup

alpe commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alpe commented Mar 17, 2026

Overview

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix: startup races

Overview

Bugs / Issues

Correctness Observations

Design / Architecture Notes

Minor / Style

Summary

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading

claude bot commented Mar 17, 2026 •

edited

Loading

codecov bot commented Mar 17, 2026 •

edited

Loading