Skip to content

ROX-33865: Scanner V4 retry initial DB connection#19761

Merged
dcaravel merged 1 commit intomasterfrom
dc/v4-restart
Apr 10, 2026
Merged

ROX-33865: Scanner V4 retry initial DB connection#19761
dcaravel merged 1 commit intomasterfrom
dc/v4-restart

Conversation

@dcaravel
Copy link
Copy Markdown
Contributor

@dcaravel dcaravel commented Apr 2, 2026

Description

Fixes crash in Scanner V4 Indexer and Matcher startup

The upgrade to pgx5 in claircore changed postgres.Connect() so that it no longer establishes a DB connection. This causes Scanner V4 indexer and matcher to crash at a code point where a connection to the DB is expected to be avail (see below for example).

This was impacting ROX-27690 (enabling Scanner V4 in CI) - jobs were failing due to 'unexpected pod restarts'.

Example

No certificates found in /usr/local/share/ca-certificates
'/etc/pki/injected-ca-trust/tls-ca-bundle.pem' -> '/etc/pki/ca-trust/source/anchors/tls-ca-bundle.pem'
pkg/memlimit: 2026/03/25 13:13:59.887533 memlimit.go:43: Info: ROX_MEMLIMIT set to 1.86Gi
2026/03/25 13:13:59 config_file "/run/secrets/stackrox.io/proxy-config/config.yaml" does not exist, continuing...
{"level":"info","host":"scanner-v4-matcher-54bcbf4564-hhcc6","component":"main","version":"4.11.x-443-g8494726004","build_flavor":"development","time":"2026-03-25T13:13:59Z","message":"starting scanner"}
pkg/metrics: 2026/03/25 13:14:00.009988 tls.go:176: Info: Updating secure metrics client CAs based on kube-system/extension-apiserver-authentication
{"level":"info","host":"scanner-v4-matcher-54bcbf4564-hhcc6","component":"main","time":"2026-03-25T13:14:00Z","message":"indexer is disabled"}
{"level":"info","host":"scanner-v4-matcher-54bcbf4564-hhcc6","component":"main","time":"2026-03-25T13:14:00Z","message":"matcher is enabled"}
{"level":"info","host":"scanner-v4-matcher-54bcbf4564-hhcc6","component":"main","time":"2026-03-25T13:14:00Z","message":"remote indexer is enabled"}

panic: migrate: failed to connect to `user=postgres database=`: 172.30.93.203:5432 (scanner-v4-db.stackrox.svc): dial error: dial tcp 172.30.93.203:5432: connect: connection refused
goroutine 1 [running]:
github.com/remind101/migrate.(*postgresLocker).do(0xc000134930, {0x47ee513?, 0xc00011a45c?})
	github.com/remind101/migrate@v0.0.0-20170729031349-52c1edff7319/migrate.go:116 +0x10c
github.com/remind101/migrate.(*postgresLocker).Lock(0x200c000089358?)
	github.com/remind101/migrate@v0.0.0-20170729031349-52c1edff7319/migrate.go:105 +0x1f
github.com/remind101/migrate.(*Migrator).Exec(0xc000bea2a0, 0x0, {0x79364e0, 0x10, 0x10})
	github.com/remind101/migrate@v0.0.0-20170729031349-52c1edff7319/migrate.go:140 +0x65
github.com/quay/claircore/datastore/postgres.InitPostgresMatcherStore({0xc000696c30?, 0xc0000896d0?}, 0xc0008ac8c0, 0x1)
	github.com/quay/claircore@v1.5.44/datastore/postgres/matcher_store.go:28 +0x153
github.com/stackrox/rox/scanner/datastore/postgres.InitPostgresMatcherStore({0x4ed1ec8?, 0xc0006aec30?}, 0xc0008ac8c0, 0xd9?)
	github.com/stackrox/rox/scanner/datastore/postgres/matcher_store.go:34 +0x2f
github.com/stackrox/rox/scanner/matcher.NewMatcher({0x4ed1ec8?, 0xc00084f500?}, {0x1, {{0xc0001801c0, 0xd9}, {0xc0000d90e0, 0x29}}, 0x1, {0xc0000d9110, 0x24}, ...})
	github.com/stackrox/rox/scanner/matcher/matcher.go:112 +0x1d4
main.createBackends({0x4ed1ec8, 0xc00084f500}, 0xc000b30000)
	github.com/stackrox/rox/scanner/cmd/scanner/main.go:245 +0x374
main.main()
	github.com/stackrox/rox/scanner/cmd/scanner/main.go:120 +0x865 

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

No automated tests were added, Scanner V4 is not running in CI. In the future when Scanner V4 is enabled in CI the pod crashing will be caught by existing pod restart checks.

How I validated my change

Manually (because Scanner V4 is not yet running in CI)

Scaled DB pod to zero, and then restarted indexer + matcher pods, observed in logs retry attempts:

{"level":"error","host":"scanner-v4-indexer-66487598f7-dcwc9","component":"scanner/backend/indexer.NewIndexer","error":"failed to connect to `user=postgres database=`: 172.30.187.242:5432 (scanner-v4-db.stackrox.svc): dial error: dial tcp 172.30.187.242:5432: connect: connection refused","time":"2026-04-02T04:00:47Z","message":"failed to connect to postgres database"}
{"level":"warn","host":"scanner-v4-indexer-66487598f7-dcwc9","component":"scanner/backend/indexer.NewIndexer","attempt":2,"time":"2026-04-02T04:00:47Z","message":"retrying connection to postgres database"}
{"level":"error","host":"scanner-v4-indexer-66487598f7-dcwc9","component":"scanner/backend/indexer.NewIndexer","error":"failed to connect to `user=postgres database=`: 172.30.187.242:5432 (scanner-v4-db.stackrox.svc): dial error: dial tcp 172.30.187.242:5432: connect: connection refused","time":"2026-04-02T04:00:57Z","message":"failed to connect to postgres database"}
{"level":"warn","host":"scanner-v4-indexer-66487598f7-dcwc9","component":"scanner/backend/indexer.NewIndexer","attempt":3,"time":"2026-04-02T04:00:57Z","message":"retrying connection to postgres database"}
{"level":"error","host":"scanner-v4-matcher-b7d8665cf-j89tc","component":"scanner/backend/matcher.NewMatcher","error":"failed to connect to `user=postgres database=`: 172.30.187.242:5432 (scanner-v4-db.stackrox.svc): dial error: dial tcp 172.30.187.242:5432: connect: connection refused","time":"2026-04-02T04:04:38Z","message":"failed to connect to postgres database"}
{"level":"warn","host":"scanner-v4-matcher-b7d8665cf-j89tc","component":"scanner/backend/matcher.NewMatcher","attempt":2,"time":"2026-04-02T04:04:38Z","message":"retrying connection to postgres database"}
{"level":"error","host":"scanner-v4-matcher-b7d8665cf-j89tc","component":"scanner/backend/matcher.NewMatcher","error":"failed to connect to `user=postgres database=`: 172.30.187.242:5432 (scanner-v4-db.stackrox.svc): dial error: dial tcp 172.30.187.242:5432: connect: connection refused","time":"2026-04-02T04:04:48Z","message":"failed to connect to postgres database"}
{"level":"warn","host":"scanner-v4-matcher-b7d8665cf-j89tc","component":"scanner/backend/matcher.NewMatcher","attempt":3,"time":"2026-04-02T04:04:48Z","message":"retrying connection to postgres database"}

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 2, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@dcaravel dcaravel marked this pull request as ready for review April 2, 2026 04:13
@dcaravel dcaravel requested a review from a team as a code owner April 2, 2026 04:13
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.58%. Comparing base (f042134) to head (7324d70).
⚠️ Report is 94 commits behind head on master.

Files with missing lines Patch % Lines
scanner/datastore/postgres/connect.go 0.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19761      +/-   ##
==========================================
- Coverage   49.58%   49.58%   -0.01%     
==========================================
  Files        2761     2761              
  Lines      208140   208146       +6     
==========================================
  Hits       103214   103214              
- Misses      97260    97266       +6     
  Partials     7666     7666              
Flag Coverage Δ
go-unit-tests 49.58% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rhacs-bot
Copy link
Copy Markdown
Contributor

Images are ready for the commit at 7324d70.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-529-g7324d709e0.

@BradLugo
Copy link
Copy Markdown
Contributor

BradLugo commented Apr 7, 2026

The upgrade to pgx5 in quay/claircore#1530 changed postgres.Connect() so that it no longer established a DB connection.

The assumption here is that connection pools in pgx/v4 were eagerly connected to the database, whereas pools in pgx/v5 are no longer. This is indeed the case; pgxpool.Connect() was removed in pgx/v5 and replaced with pgxpool.New(), which lazily loads connections.

@dcaravel dcaravel requested review from a team and daynewlee April 8, 2026 18:44
Copy link
Copy Markdown
Contributor

@BradLugo BradLugo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Though it makes me wonder if some behavior should be changed in claircore.

@dcaravel dcaravel merged commit 0f13ca6 into master Apr 10, 2026
115 of 116 checks passed
@dcaravel dcaravel deleted the dc/v4-restart branch April 10, 2026 12:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

🚀 Build Images Ready

Images are ready for commit 0f13ca6. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-622-g0f13ca6fe2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants