Skip to content

ROX-33737: Create dashboards (Phase 0)#19604

Draft
ebensh wants to merge 10 commits intomasterfrom
rox-33737-phase0-dashboards
Draft

ROX-33737: Create dashboards (Phase 0)#19604
ebensh wants to merge 10 commits intomasterfrom
rox-33737-phase0-dashboards

Conversation

@ebensh
Copy link
Contributor

@ebensh ebensh commented Mar 25, 2026

No description provided.

ebensh and others added 10 commits March 25, 2026 15:05
Implements core Grafana dashboard generation using Go structs:
- Dashboard struct with UID, Title, Tags, Links, Rows, Templating
- Panel types: timeseries, stat, gauge, text, table
- Automatic gridPos calculation and panel ID assignment
- Gap annotation support for missing metrics
- Threshold configuration for gauges
- Uses datasource UID PBFA97CFB590B2093 (Prometheus)

All tests passing (11 test cases).

User request: Create a Go package that generates Grafana dashboard JSON 
from Go structs to avoid hand-writing verbose JSON and make dashboards 
maintainable and testable.

Note: Code partially generated by AI (Claude Sonnet 4.5)
Creates generate/main.go with:
- writeJSON helper for writing dashboard JSON files
- main() with configurable output directory
- Placeholder for dashboard generation (to be added in Tasks 2-6)

CLI builds successfully and is ready for dashboard definitions.

User request: Create CLI entrypoint that imports dashboard definitions 
and writes JSON files.

Note: Code partially generated by AI (Claude Sonnet 4.5)
Implements Task 2: Level 1 — Service Map Dashboard (StackRox Overview)

This commit adds the top-level overview dashboard for the StackRox monitoring
hierarchy. The dashboard provides a service map view with three main sections:

- Service Health: Central process metrics (up status, CPU, memory, goroutines, version)
- Connected Sensors: Cluster connectivity metrics (sensors, clusters, nodes, vCPUs)
- Database: PostgreSQL health metrics (connection status, size, active connections, available space)

The dashboard includes a drill-down link to the Level 2 'Central Internals' dashboard
for detailed component analysis.

Implementation follows test-driven development methodology with comprehensive test
coverage verifying dashboard structure, panel configuration, and JSON validity.

Files:
- deploy/charts/monitoring/dashboards/generator/l1_overview.go: Dashboard definition
- deploy/charts/monitoring/dashboards/generator/l1_overview_test.go: Test suite
- deploy/charts/monitoring/dashboards/generator/generate/main.go: CLI integration

User request: Implement Task 2: Level 1 — Service Map Dashboard following TDD methodology.
Dashboard uses existing metrics from Central service (standard Go runtime metrics and
custom StackRox metrics for sensor connectivity and database health).

Note: Code partially generated with AI assistance (Claude Sonnet 4.5).
Implements the Level 2 dashboard that displays a grid view of all 10 logical
regions within Central. Each region includes headline metrics and a link to
its corresponding Level 3 detail dashboard.

Dashboard structure:
- UID: central-internals
- 10 rows (one per logical region): Sensor Ingestion, Deployment Processing,
  Vulnerability Enrichment, Detection & Alerts, Risk Calculation, Background
  Reprocessing, Pruning & GC, Network Analysis, Report Generation, API & UI
- Each row contains 3-4 metric panels plus a details link to Level 3
- Back-link to Level 1 StackRox Overview dashboard

Notable features:
- Identifies existing metrics gaps with text panels (e.g., no alert generation
  rate metric, no Central-side report generation metrics)
- Uses histogram_quantile for p95 latency calculations
- Uses rate() for counter metrics to show per-second rates
- Follows StackRox dashboard generator patterns and testing conventions

Testing:
- Comprehensive test coverage in l2_central_test.go
- All 16 tests pass
- Validates metadata, row structure, panel queries, and JSON generation

CLI integration:
- Added L2 dashboard generation to generate/main.go
- Outputs central-internals.json alongside stackrox-overview.json

Implemented using TDD methodology: tests written first, then implementation,
following Go best practices and StackRox coding style.

User request: Implement Task 3: Level 2 — Central Internals Dashboard

Code partially generated with AI assistance (Claude Code).
Implement the highest-value Level 3 detail dashboard for Central's Sensor
Ingestion pipeline with comprehensive metrics breakdowns.

Dashboard structure (central-sensor-ingestion):
- Connection Status: sensor connectivity and state transitions
- Deduper: throughput, hit rate, hash store size, and operations
- Worker Queue: event processing, duration, with gap annotations for
  missing queue depth and in-flight metrics
- Pipeline Processing: resource processing, panics, K8s event latency,
  with gap annotation for per-fragment metrics
- Messages Not Sent: failed sends to sensor with type and reason

Key features:
- 3 gap annotation panels identify missing observability metrics
- Full metric breakdowns with proper legend formats
- Links back to Level 2 Central Internals dashboard
- Follows TDD: comprehensive test coverage for all rows and panels

Generated files:
- deploy/charts/monitoring/dashboards/central-sensor-ingestion.json

Task: ROX-33737 Phase 0 - Dashboard Hierarchy Prototype

Partially AI-assisted.
Implements Task 5 from the Central metrics dashboard project.

This dashboard provides deep visibility into the vulnerability enrichment
pipeline with comprehensive metrics for:
- Scan semaphore utilization and queue management
- Image scanning performance (p50/p95/p99 latencies)
- Node scanning metrics
- Image deduplication statistics
- Registry client requests, latency, and timeout tracking

Includes 2 gap annotations identifying missing metrics:
- Enrichment request counter (for failure rate calculation)
- Node scan counter

Test-driven development: comprehensive test suite covering all rows,
panels, queries, and JSON generation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements Task 6: Create stub dashboards for the 8 remaining logical regions
in Central. Each dashboard has real panels where metrics exist and prominent
gap annotations documenting missing metrics.

Dashboard UIDs:
- central-deployment-processing: Deployment/pod/namespace resource processing
- central-detection-alerts: Detection engine and alert generation
- central-risk-calculation: Risk scoring and reprocessing
- central-background-reprocessing: Background loops and batch processing
- central-pruning-gc: Process pruning and garbage collection
- central-network-analysis: Network flows and endpoint tracking
- central-report-generation: Compliance operator and vuln reports
- central-api-ui: GraphQL, gRPC, and API endpoint metrics

All dashboards:
- Include level-3 tag and back-link to central-internals dashboard
- Mix real metric panels (timeseries/stat) with gap annotation panels
- Document needed metrics for complete observability

Implementation follows TDD methodology:
1. Wrote comprehensive tests first (l3_stubs_test.go)
2. Implemented L3Stubs() function with 8 dashboard specs
3. Integrated into CLI generator (generate/main.go)
4. Verified all tests pass and JSON output is valid

Files:
- Created: deploy/charts/monitoring/dashboards/generator/l3_stubs.go
- Created: deploy/charts/monitoring/dashboards/generator/l3_stubs_test.go
- Modified: deploy/charts/monitoring/dashboards/generator/generate/main.go

Test results: All 77 tests pass.

AI-assisted implementation based on user specification.
@openshift-ci
Copy link

openshift-ci bot commented Mar 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Mar 25, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant