ROX-9476 ROX-8533: Show health info of a local scanner#1445
Conversation
|
Tag for build #499964 is 💻 For deploying this image using the dev scripts, run the following first: export MAIN_IMAGE_TAG='3.69.x-541-gead56670fc'🕹️ A |
9b8538d to
bfa0852
Compare
3437b07 to
ae61e40
Compare
RTann
left a comment
There was a problem hiding this comment.
Thank you for doing this! Just a few small things. Also, I'd recommend getting a reviewer from the UI team to take a look at the UI parts. Also, do you have screenshots of an example of this in action so we can see what it looks like?
CHANGELOG.md
Outdated
| - ROX-9435: Updated dryrun API to generate preview violations for disabled policies | ||
| - Support for security policies that do not have a policyVersion or have versions prior to 1.1 will be removed. If you have externally stored older policies, they cannot be imported. | ||
| - ROX-10021: RHCOS node support is dropped until major improvements are made in ROX-8944. | ||
| - ROX-8533: Show Cluster Health Info if Scanner is running on a secured cluster |
There was a problem hiding this comment.
I don't think this entry is needed since it's completely captured by a ticket. I tend to only add an entry if there is either no ticket and this adds something worth documenting (for the docs team) or there is a ticket and there is some nuance to the implementation that may not be clear in the ticket. I am guilty of the entry above this - ROX-10021: RHCOS node support is dropped until major improvements are made in ROX-8944., but the purpose of this entry it to let docs people know there are followup plans for this. Also, there is a followup entry which appends information about the UI (which is not really mentioned in the ticket) to the entry.
There was a problem hiding this comment.
Thanks for the explanation. Entry removed.
| message RawClusterHealthInfo { | ||
| storage.CollectorHealthInfo collector_health_info = 1; | ||
| storage.AdmissionControlHealthInfo admission_control_health_info = 2; | ||
| storage.LocalScannerHealthInfo local_scanner_health_info = 3; |
There was a problem hiding this comment.
Since I see it was adjusted in the proto below, should we align these variables, too?
There was a problem hiding this comment.
Aligned in the whole file
proto/storage/cluster.proto
Outdated
|
|
||
| HealthStatusLabel overall_health_status = 4 [(gogoproto.moretags) = "search:\"Cluster Status,store\""]; | ||
| HealthStatusLabel admission_control_health_status = 7 [(gogoproto.moretags) = "search:\"Admission Control Status,store\""]; | ||
| HealthStatusLabel local_scanner_health_status = 11 [(gogoproto.moretags) = "search:\"Local Scanner Status,store\""]; |
There was a problem hiding this comment.
can we add a comment to the top of this message to say which tag is next since the tags are out of order?
There was a problem hiding this comment.
WDYT about adding a comment at the bottom of the post instead? Since we usually add new fields at the bottom, the comment will be more likely to be read.
Also, the comment placed at the top will be propagated to the swagger definitions.
| if !env.LocalImageScanningEnabled.BooleanSetting() { | ||
| return nil | ||
| } | ||
| result := storage.LocalScannerHealthInfo{} |
There was a problem hiding this comment.
nit: var result storage.LocalScannerHealthInfo
| TotalReadyAnalyzerPods: localScanner.Status.ReadyReplicas, | ||
| } | ||
| } | ||
| localScannerDb, err := u.client.AppsV1().Deployments(u.namespace).Get(u.ctx(), localScannerDbDeploymentName, metav1.GetOptions{}) |
| admissionControlContainerName = "admission-control" | ||
|
|
||
| localScannerDeploymentName = "scanner" | ||
| localScannerDbDeploymentName = "scanner-db" |
There was a problem hiding this comment.
nit: localScannerDBDeploymentName
porridge
left a comment
There was a problem hiding this comment.
This looks awesome, but I think we should remove local in user-visible places, see inline.
Also a couple of questions:
- did you perhaps manage to simulate a broken local scanner cert refresher component
Startand see whether the error surfaces in the UI? Or is this PR just about pod health? - I also wanted to double-check what happens when local scanner is not supposed to be running? (Like in non-openshift clusters.) I think what should happen is that the scanner part of the health "pill" is simply not shown. A special sub-case is when secured cluster components run in the same cluster and namespace as central - in that case the local scanner does not get deployed, but there is a "central scanner" with the same name. We probably should not be showing its status together with other secured cluster components, right @RTann
CHANGELOG.md
Outdated
| - ROX-9435: Updated dryrun API to generate preview violations for disabled policies | ||
| - Support for security policies that do not have a policyVersion or have versions prior to 1.1 will be removed. If you have externally stored older policies, they cannot be imported. | ||
| - ROX-10021: RHCOS node support is dropped until major improvements are made in ROX-8944. | ||
| - ROX-8533: Show Cluster Health Info if Scanner is running on a secured cluster |
pkg/search/options.go
Outdated
| SensorStatus = newFieldLabel("Sensor Status") | ||
| CollectorStatus = newFieldLabel("Collector Status") | ||
| AdmissionControlStatus = newFieldLabel("Admission Control Status") | ||
| LocalScannerStatus = newFieldLabel("Local Scanner Status") |
There was a problem hiding this comment.
There was a product decision to show users just "Scanner", not "Local scanner" (here and elsewhere). I think the code-internal identifier names can stay as LocalScanner though.
There was a problem hiding this comment.
It turned out to be a bit tricky because the label has to match the tag defined in the proto file. I am going to remove the word 'Local' from there.
proto/storage/cluster.proto
Outdated
| int32 total_desired_analyzer_pods = 1; | ||
| } | ||
| oneof total_ready_analyzer_pods_opt { | ||
| int32 total_ready_analyzer_pods = 2; | ||
| } | ||
| oneof total_desired_db_pods_opt { | ||
| int32 total_desired_db_pods = 3; | ||
| } | ||
| oneof total_ready_db_pods_opt { | ||
| int32 total_ready_db_pods = 4; |
There was a problem hiding this comment.
The spacing after int32 seems inconsistent.
proto/storage/cluster.proto
Outdated
|
|
||
| // Collection of errors that occurred while trying to obtain local scanner health info. | ||
| repeated string status_errors = 5; | ||
|
|
There was a problem hiding this comment.
Remove this blank line (and perhaps the one in AdmissionControlHealthInfo as well).
| err = errors.Wrap(err, fmt.Sprintf("unable to find local scanner deployments in namespace %q", u.namespace)) | ||
| result.StatusErrors = append(result.StatusErrors, fmt.Sprintf("unable to find local scanner deployments in namespace %q: %v", u.namespace, err)) |
There was a problem hiding this comment.
| err = errors.Wrap(err, fmt.Sprintf("unable to find local scanner deployments in namespace %q", u.namespace)) | |
| result.StatusErrors = append(result.StatusErrors, fmt.Sprintf("unable to find local scanner deployments in namespace %q: %v", u.namespace, err)) | |
| err = errors.Wrap(err, fmt.Sprintf("unable to find scanner deployment in namespace %q", u.namespace)) | |
| result.StatusErrors = append(result.StatusErrors, fmt.Sprintf("unable to find scanner deployment in namespace %q: %v", u.namespace, err)) |
Similar below for the DB case.
| return &result | ||
| } | ||
|
|
||
| func (u *updaterImpl) getLocalScannerInfo() *storage.LocalScannerHealthInfo { |
There was a problem hiding this comment.
It's just itching to extract a generic function out of this once we have generics :-)
@RTann Sure, I also added a few more screens to the description |
To simplify the health check logic, I decided to leave only the health information of the scanner pods. If the cert refresher will not start successfully, the cert will expire and the corresponding pod will eventually become unhealthy. Of course, the indication of the component start failure will happen in advance and in some cases this may prevent an outage, but is it that critical?
That's exactly how it has been implemented. The corresponding part of the health "pill" will not be shown if the local scanner is disabled ( |
vjwilson
left a comment
There was a problem hiding this comment.
For the UI changes:
A couple of minor nits about code style.
More important is the lack of distinction in the design of "delayed" versus "unhealthy"
| storage.CollectorHealthInfo collector_health_info = 1; | ||
| storage.AdmissionControlHealthInfo admission_control_health_info = 2; | ||
| storage.ScannerHealthInfo scanner_health_info = 3; |
There was a problem hiding this comment.
Proto style reminds me of assembly ;)
LDX #$4000
BSR SAVEX * 7 CPU cycles (2 bytes)
RTS * 5 CPU cycles (1 byte)
* These two lines require 12 cycles and
* 3 bytes
...
SAVEX STX ,U++
RTS
| // Special case for Scanner when Sensor is UNHEALTHY or DELAYED. | ||
| export const delayedScannerStatusStyle = { | ||
| Icon: InfoCircleIcon, | ||
| bgColor: 'bg-base-200', | ||
| fgColor: 'text-base-700', |
There was a problem hiding this comment.
This is a bit confusing.
The icon and color choice are neutral, which seems fair for "delayed" status, but unhealthy is represented in other widgets with yellow or red. Here's we're conflating the two.
nit: If we are going to conflate them, delayed doesn't convey that. Hard to come up with name that captures both nuances... "unsatisfactory" maybe?
There was a problem hiding this comment.
This is used when the sensor status is delayed/unhealthy, not scanner.
When the sensor is unavailable, we can't distribute information about the status of the scanner, so we color it gray.
If the scanner is not available by itself, we color the status in red or green as in other cases.
TBH this was copy-pasted from the collector/admission controller health view, and it is probably worth using the same "delayed" style for all 3 cases, since they are exactly the same.
| import { DetailedTooltipOverlay, Tooltip } from '@stackrox/ui-components'; | ||
| import { ClusterHealthStatus } from '../../clusterTypes'; | ||
| import { | ||
| delayedScannerStatusStyle, | ||
| healthStatusStyles, | ||
| isDelayedSensorHealthStatus, | ||
| } from '../../cluster.helpers'; | ||
| import HealthLabelWithDelayed from '../HealthLabelWithDelayed'; | ||
| import { getDistanceStrictAsPhrase } from '../../../../utils/dateUtils'; |
There was a problem hiding this comment.
There are aliases for top-level directories, so you can avoid the deeply nested ancestor path for dateUtils by referencing it directly in the utils/ path.
(It also has to move up in the list, because of lint rule that "absolute" paths come first.)
| import { DetailedTooltipOverlay, Tooltip } from '@stackrox/ui-components'; | |
| import { ClusterHealthStatus } from '../../clusterTypes'; | |
| import { | |
| delayedScannerStatusStyle, | |
| healthStatusStyles, | |
| isDelayedSensorHealthStatus, | |
| } from '../../cluster.helpers'; | |
| import HealthLabelWithDelayed from '../HealthLabelWithDelayed'; | |
| import { getDistanceStrictAsPhrase } from '../../../../utils/dateUtils'; | |
| import { DetailedTooltipOverlay, Tooltip } from '@stackrox/ui-components'; | |
| import { getDistanceStrictAsPhrase } from 'utils/dateUtils'; | |
| import { ClusterHealthStatus } from '../../clusterTypes'; | |
| import { | |
| delayedScannerStatusStyle, | |
| healthStatusStyles, | |
| isDelayedSensorHealthStatus, | |
| } from '../../cluster.helpers'; | |
| import HealthLabelWithDelayed from '../HealthLabelWithDelayed'; |
There was a problem hiding this comment.
Thanks for the hint, fixed.
proto/storage/cluster.proto
Outdated
| AdmissionControlHealthInfo admission_control_health_info = 8; | ||
| // scanner_health_info is filled when the scanner is deployed on a secured cluster (so called "local scanner"). | ||
| // Please do not confuse this with the default scanner deployment on a central cluster. | ||
| ScannerHealthInfo scanner_health_info = 10; |
There was a problem hiding this comment.
nit: I think the = 10 needs to be further to the right.
| // derived from this data. | ||
| // Aggregated scanner health status is not included because it is derived in central and not in the component that | ||
| // first reports ScannerHealthInfo (sensor). | ||
| message ScannerHealthInfo { |
There was a problem hiding this comment.
Is this also solely for the "local" Scanner? If so, can you add that comment here?
| <tbody> | ||
| <tr className={trClassName} key="totalReadyPods"> | ||
| <th className={thClassName} scope="row"> | ||
| Scanner pods ready: |
There was a problem hiding this comment.
Suggestion: we can reduce verbosity and make things look a little nicer if we say Pods ready: instead of Scanner pods ready:, Version: instead of Collector version: and so on. What do you think?
It means you'd need to change titles for Collector and Admission Control but that shouldn't be hard.
There was a problem hiding this comment.
Good idea. I was thinking about simplifying even more, like: "Replicas: 3/3". In any case, it seems that this is out of the scope of this PR.
I'd say it's a prerequisite for closing ROX-9476 which was filed in response to my whining in that PR review.
👌🏻 |


Description
Added local scanner health information to a cluster details page.
Checklist
Testing Performed
Dev test on a local cluster
Screenshots
Secured cluster with
ROX_LOCAL_IMAGE_SCANNING_ENABLED=trueClusters screen:


Health Status widget:
Secured cluster with
ROX_LOCAL_IMAGE_SCANNING_ENABLED=false