Skip to content

feat(benchmarking): [WIP] enable tests to run in dedicated environment or in docker#3157

Draft
chatton wants to merge 57 commits intomainfrom
cian/enable-tests-to-run-without-infra
Draft

feat(benchmarking): [WIP] enable tests to run in dedicated environment or in docker#3157
chatton wants to merge 57 commits intomainfrom
cian/enable-tests-to-run-without-infra

Conversation

@chatton
Copy link
Contributor

@chatton chatton commented Mar 12, 2026

Overview

…mark

- Create test/e2e/benchmark/ subpackage with SpamoorSuite (testify/suite)
- Move spamoor smoke test into suite as TestSpamoorSmoke
- Split helpers into focused files: traces.go, output.go, metrics.go
- Introduce resultWriter for defer-based benchmark JSON output
- Export shared symbols from evm_test_common.go for cross-package use
- Restructure CI to fan-out benchmark jobs and fan-in publishing
- Run benchmarks on PRs only when benchmark-related files change
Resolve conflicts keeping the benchmark suite refactoring:
- benchmark.yml: keep path filters and suite-style test command
- evm_spamoor_smoke_test.go: keep deleted (moved to benchmark pkg)
- evm_test_common.go: keep exported types, drop writeTraceBenchmarkJSON
  (now in benchmark/output.go)
go test sets the working directory to the package under test, so the
env var should be relative to test/e2e/benchmark/, not test/e2e/.
go test treats all arguments after an unknown flag (--evm-binary) as
test binary args, so ./benchmark/ was never recognized as a package
pattern.
go test sets the cwd to the package directory (test/e2e/benchmark/),
so the binary path needs an extra parent traversal.
The benchmark package doesn't define the --binary flag that test-e2e
passes. It has its own CI workflow so it doesn't need to run here.
…nfig

collectBlockMetrics hit reth's 20K FilterLogs limit at high tx volumes.
Replace with direct header iteration over [startBlock, endBlock] and add
Phase 1 metrics: non-empty ratio, block interval p50/p99, gas/block and
tx/block p50/p99.

Optimize spamoor configuration for 100ms block time:
- --slot-duration 100ms, --startup-delay 0 on daemon
- throughput=50 per 100ms slot (500 tx/s per spammer)
- max_pending=50000 to avoid 3s block poll backpressure
- 5 staggered spammers with 50K txs each

Results: 55 MGas/s, 1414 TPS, 19.8% non-empty blocks (up from 6%).
- Move startBlock capture after spammer creation to exclude warm-up
- Replace 20s drain sleep with smart poll (waitForDrain)
- Add deleteAllSpammers cleanup to handle stale spamoor DB entries
- Lower trace sample rate to 10% to prevent Jaeger OOM
- make reth tag configurable via EV_RETH_TAG env var (default pr-140)
- fix OTLP config: remove duplicate env vars, use http/protobuf protocol
- use require.Eventually for host readiness polling
- rename requireHTTP to requireHostUp
- use non-fatal logging in resultWriter.flush deferred context
- fix stale doc comment (setupCommonEVMEnv -> SetupCommonEVMEnv)
- rename loop variable to avoid shadowing testing.TB convention
- add block/internal/executing/** to CI path trigger
- remove unused require import from output.go
# Conflicts:
#	scripts/test.mk
# Conflicts:
#	test/e2e/benchmark/suite_test.go
# Conflicts:
#	test/e2e/benchmark/suite_test.go
move EV_RETH_TAG resolution and rpc connection limits into setupEnv
so all benchmark tests share the same reth configuration. lower ERC20
spammer count from 5 to 2 to reduce resource contention on local
hardware while keeping the loop for easy scaling on dedicated infra.
- add blockMetricsSummary with summarize(), log(), and entries() methods
- add evNodeOverhead() for computing ProduceBlock vs ExecuteTxs overhead
- add collectTraces() suite method to deduplicate trace collection pattern
- add addEntries() convenience method on resultWriter
- slim TestERC20Throughput from ~217 to ~119 lines
- reuse collectTraces in TestSpamoorSmoke
Bumps tastora to pick up host network support in the spamoor builder.
Spamoor in external mode now uses host networking so it can resolve
the same hostnames as the host machine.
Record startTime when the provider is created and use it as the lower
bound for trace queries, preventing spans from previous runs being
included in the analysis.
chatton added 25 commits March 4, 2026 12:26
Replace the Jaeger-compatible API with VictoriaTraces native LogsQL
endpoint which streams all results without the 1000 trace limit.
…run-without-infra

# Conflicts:
#	.github/workflows/benchmark.yml
#	test/e2e/benchmark/gasburner_test.go
#	test/e2e/benchmark/helpers.go
#	test/e2e/benchmark/spamoor_erc20_test.go
#	test/e2e/benchmark/suite_test.go
#	test/e2e/benchmark/traces.go
#	test/e2e/go.mod
#	test/e2e/go.sum
…metrics

- Make gas_units_to_burn, max_wallets, num_spammers, throughput, and
  warmup_txs configurable via BENCH_* env vars
- Add rethExecutionRate() for ev-reth GGas/s measurement
- Add engineSpanEntries() for ProduceBlock/GetPayload/NewPayload timing
- Switch local benchmarks from Jaeger to VictoriaTraces
- Add setupExternalEnv for running against pre-deployed infrastructure
- Update tastora to 2ee1b0a (victoriatraces support)
- benchConfig consolidates all BENCH_* env vars into a single struct
  constructed once per test via newBenchConfig(serviceName)
- benchmarkResult collects all output metrics (block summary, overhead,
  GGas/s, engine span timing, seconds_per_gigagas, span averages) and
  produces entries via a single entries() call
- Removes scattered envInt/envOrDefault calls from test files
- Removes manual entry-by-entry result assembly from each test
- Net reduction of ~129 lines across existing files
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 77b0c502-f8da-4578-8615-7fe10531f436

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cian/enable-tests-to-run-without-infra
📝 Coding Plan for PR comments
  • Generate coding plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 12, 2026, 3:13 PM

@claude
Copy link
Contributor

claude bot commented Mar 12, 2026

Claude finished @chatton's task in 3m 15s —— View job


Code Review

  • Gather context and understand the PR
  • Read changed files
  • Analyze code quality, architecture, error handling, security, performance, testing
  • Post comprehensive review

Summary

This PR introduces a dual-mode benchmark harness: tests can run against locally-spun-up Docker infrastructure (the existing path) or against pre-deployed external infrastructure (new, via BENCH_ETH_RPC_URL). It also extracts benchmark config into benchConfig, adds a benchmarkResult type for aggregated metrics, and introduces rich span/flowchart rendering from VictoriaTraces.

Overall the architecture is clean and the goal is well-executed. There are a few actionable issues and several nits worth addressing before merge.


Medium Issues

1. TestERC20Throughput ignores benchConfig env vars — hardcoded constants

test/e2e/benchmark/spamoor_erc20_test.go:19-24

The test creates a benchConfig and calls setupEnv(cfg), but uses its own hardcoded constants for numSpammers, countPerSpammer, waitTimeout, and erc20Config["throughput"]. This means BENCH_NUM_SPAMMERS, BENCH_COUNT_PER_SPAMMER, BENCH_THROUGHPUT, and BENCH_WAIT_TIMEOUT have no effect on this test — inconsistent with TestGasBurner which uses cfg.* throughout. Fix this →

2. waitForMetricTarget doesn't respect context cancellation during sleep

test/e2e/benchmark/helpers.go:454-466

for time.Now().Before(deadline) {
    v, err := poll()
    if err == nil && v >= target {
        return
    }
    time.Sleep(2 * time.Second)  // blocks for 2s ignoring context cancellation
}

If the test context is cancelled while sleeping (e.g., test timeout), this loop won't wake up for up to 2 additional seconds. Use a time.NewTimer in a select to be responsive to cancellation. Fix this →

3. Dead code: truncateID defined but never called

test/e2e/benchmark/flowchart.go:289-294

func truncateID(id string) string { ... }

This function is defined but has no callers. Remove it or use it. Fix this →

4. fetchAllSpans and fetchAllRichSpans duplicate HTTP boilerplate

test/e2e/benchmark/traces.go:151-207 and :209-262

Both functions construct the same URL, create the same request, handle the same status check, and configure identical scanner buffers. The only differences are the JSON struct and the output type. A shared fetchLogStream(ctx, url) (io.ReadCloser, error) or generic helper would eliminate ~40 lines of duplication and a potential future divergence. Fix this →


Minor Issues

5. extractHostName allocates a full map per span — hot path inefficiency

test/e2e/benchmark/traces.go:282-293

func extractHostName(line []byte) string {
    var raw map[string]string  // allocates for every span
    if err := json.Unmarshal(line, &raw); err != nil { ... }

This deserializes the entire JSON line (already parsed once into logsqlRichSpan) just to find one non-standard field. For large trace collections this is wasteful. Consider a bytes.Contains pre-check or map[string]json.RawMessage on the original parse.

6. Fragile hardcoded string comparison in result.go

test/e2e/benchmark/result.go:80

if e.Name == r.prefix+" - ProduceBlock avg" {

This reconstructs a derived string at call-site to compare against engineSpanEntries's output. If the suffix format in engineSpanEntries changes, this silently stops matching. Use the same constant (spanProduceBlock) and a shared format function.

7. time.Sleep(3 * time.Second) after spammer creation is fragile

test/e2e/benchmark/gasburner_test.go:55-56

A fixed 3-second sleep is brittle under CI load. The subsequent assertSpammersRunning call only confirms spammers haven't already failed — a slow environment could still report status > 0 while the spammer is still in its startup phase. Consider polling assertSpammersRunning with retries or relying solely on the warmup tx check (which already waits for a meaningful number of sends).

8. Stale "Jaeger" comment

test/e2e/benchmark/spamoor_smoke_test.go:67

// so that the expected tracing spans appear in Jaeger.

The project has migrated to VictoriaTraces. Fix this →

9. envInt silently swallows parse errors

test/e2e/benchmark/config.go:68-77

If a user sets BENCH_NUM_SPAMMERS=two by mistake, it silently falls back to the default with no warning. This is acceptable but t.Logf isn't available at package init time — at minimum document the silent-fallback behaviour in a comment.

10. WaitTimeout is not configurable via env var

test/e2e/benchmark/config.go:46

All other benchConfig fields are env-var-driven, but WaitTimeout is hardcoded to 10 minutes. In external/dedicated environments with different network characteristics, this may need tuning. Consider envDuration("BENCH_WAIT_TIMEOUT", 10*time.Minute).


Architecture Observations

Dual-mode dispatch is clean. The setupEnv → setupLocalEnv | setupExternalEnv dispatch in suite_test.go:60-65 is minimal and easy to follow. The contract for external mode (requires BENCH_PRIVATE_KEY + BENCH_TRACE_QUERY_URL) is clearly guarded with s.Require().NotEmpty.

traceProvider / richSpanCollector interface split is appropriate. The optional upgrade via type assertion (if rc, ok := e.traces.(richSpanCollector)) is idiomatic Go and keeps the interface minimal.

buildTree multiple-root heuristic. flowchart.go:187-212 picks the longest-duration root when multiple spans have no parent in the set. This is a reasonable heuristic but silently drops orphan spans. A comment explaining this trade-off would help future readers.

SetupTest vs SetupSuite for Docker client. Each test in SpamoorSuite calls tastoradocker.Setup — creating a new Docker client and network per test. If tests are run individually this is fine, but if the full suite runs this creates unnecessary setup overhead. Worth considering SetupSuite + TearDownSuite if the Docker client is safe to share.


@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.20%. Comparing base (abb6b71) to head (87348de).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3157   +/-   ##
=======================================
  Coverage   60.20%   60.20%           
=======================================
  Files         115      115           
  Lines       11896    11896           
=======================================
  Hits         7162     7162           
  Misses       3923     3923           
  Partials      811      811           
Flag Coverage Δ
combined 60.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant