Skip to content

[BUG] Intermittent EOF errors calling S3 ListObjects from GitHub Actions due to Go HTTP keep-alive / HTTP2 interaction #599

@general-kroll-4-life

Description

@general-kroll-4-life

SUMMARY

StackQL intermittently fails when listing S3 bucket objects only in GitHub Actions, while the same credentials and queries work reliably from a local laptop.

Example failure:

Get "https://s3.ap-southeast-2.amazonaws.com/stackql-trial-bucket-02?max-keys=1000
": EOF

This is not an AWS service defect and not a credentials issue.
The root cause is Go’s default HTTP client connection reuse behavior (keep-alive / HTTP2) interacting poorly with the GitHub Actions network environment when using raw HTTP + SigV4 signing (no AWS SDK).

ENVIRONMENT

StackQL using any-sdk HTTP client

AWS SigV4 signing only (no AWS SDK HTTP client usage)

Same AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY locally and in GitHub Actions

Region explicitly provided to SigV4 signer

Fails only in GitHub Actions

Error manifests as immediate EOF while reading response

WHY THIS HAPPENS

GitHub Actions runners aggressively reuse TCP connections

NAT, proxying, and idle connection reaping are common

Server-side connection close is legal and frequent

Go net/http default transport enables:

HTTP keep-alives

HTTP/2

Idle connection reuse

A reused but half-dead connection results in:

Connection closed by server

Client receives EOF before response body

AWS S3 is behaving within specification

Early connection close is permitted

This is a client-side transport robustness issue

This does not reproduce locally

Local networks are more tolerant of connection reuse

CI networking is not

Retrying is not a viable solution

StackQL fans out across tens or hundreds of endpoints

Retrying on EOF causes request amplification and latency explosion

This is a transport correctness problem, not a transient API failure

EVIDENCE

Same credentials, same query, same endpoint succeed locally

Failures occur only in GitHub Actions

Disabling HTTP/2 via GODEBUG reduces failures

Eliminating connection reuse eliminates failures entirely

CORRECT FIX (VENDOR-AGNOSTIC)

Explicitly control the Go HTTP transport used by any-sdk / StackQL and disable connection reuse for AWS endpoints.

Proposed transport configuration:

transport := &http.Transport{
DisableKeepAlives: true,
ForceAttemptHTTP2: false,
}

client := &http.Client{
Transport: transport,
}

Effects:

One request per TCP connection

No reuse of half-closed sockets

Deterministic behavior in CI

No AWS SDK dependency

No retries required

SCOPE

Apply to AWS providers (or make configurable per provider)

Does not require AWS environment variables

Preserves StackQL vendor-independence

Acceptable performance tradeoff for control-plane style queries

NON-SOLUTIONS (INTENTIONALLY AVOIDED)

Adding retries on EOF

Introducing AWS SDK HTTP client

Requiring AWS_REGION / AWS_DEFAULT_REGION

CI-specific bash hacks

Treating this as an AWS outage or service bug

ACTION ITEMS

Add explicit HTTP transport ownership in any-sdk

Disable keep-alives and HTTP/2 for AWS providers

Document rationale (CI + NAT behavior)

Add CI regression coverage if possible

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions