Skip to content

ROX-9476 ROX-8533: Show health info of a local scanner#1445

Merged
kovayur merged 9 commits intomasterfrom
yury/ROX-9476-sensor-error-handling-2
May 5, 2022
Merged

ROX-9476 ROX-8533: Show health info of a local scanner#1445
kovayur merged 9 commits intomasterfrom
yury/ROX-9476-sensor-error-handling-2

Conversation

@kovayur
Copy link
Contributor

@kovayur kovayur commented Apr 27, 2022

Description

Added local scanner health information to a cluster details page.

Checklist

  • Investigated and inspected CI test results
  • Unit test and regression tests added
  • Evaluated and added CHANGELOG entry if required
  • Determined and documented upgrade steps
  • Documented user facing changes (create PR based on stackrox/openshift-docs and merge into rhacs-docs)

Testing Performed

Dev test on a local cluster

Screenshots

Secured cluster with ROX_LOCAL_IMAGE_SCANNING_ENABLED=true

Clusters screen:
image
Health Status widget:
image

Secured cluster with ROX_LOCAL_IMAGE_SCANNING_ENABLED=false

image

@ghost
Copy link

ghost commented Apr 27, 2022

Tag for build #499964 is 3.69.x-541-gead56670fc.

💻 For deploying this image using the dev scripts, run the following first:

export MAIN_IMAGE_TAG='3.69.x-541-gead56670fc'

🕹️ A roxctl binary can be downloaded from the CircleCI artifacts.

@kovayur kovayur force-pushed the yury/ROX-9476-sensor-error-handling-2 branch from 9b8538d to bfa0852 Compare April 27, 2022 11:45
@kovayur kovayur requested review from a team and RTann April 28, 2022 10:33
@kovayur kovayur marked this pull request as ready for review April 28, 2022 10:34
@kovayur kovayur force-pushed the yury/ROX-9476-sensor-error-handling-2 branch from 3437b07 to ae61e40 Compare April 28, 2022 10:42
Copy link
Contributor

@RTann RTann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this! Just a few small things. Also, I'd recommend getting a reviewer from the UI team to take a look at the UI parts. Also, do you have screenshots of an example of this in action so we can see what it looks like?

CHANGELOG.md Outdated
- ROX-9435: Updated dryrun API to generate preview violations for disabled policies
- Support for security policies that do not have a policyVersion or have versions prior to 1.1 will be removed. If you have externally stored older policies, they cannot be imported.
- ROX-10021: RHCOS node support is dropped until major improvements are made in ROX-8944.
- ROX-8533: Show Cluster Health Info if Scanner is running on a secured cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this entry is needed since it's completely captured by a ticket. I tend to only add an entry if there is either no ticket and this adds something worth documenting (for the docs team) or there is a ticket and there is some nuance to the implementation that may not be clear in the ticket. I am guilty of the entry above this - ROX-10021: RHCOS node support is dropped until major improvements are made in ROX-8944., but the purpose of this entry it to let docs people know there are followup plans for this. Also, there is a followup entry which appends information about the UI (which is not really mentioned in the ticket) to the entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Entry removed.

message RawClusterHealthInfo {
storage.CollectorHealthInfo collector_health_info = 1;
storage.AdmissionControlHealthInfo admission_control_health_info = 2;
storage.LocalScannerHealthInfo local_scanner_health_info = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I see it was adjusted in the proto below, should we align these variables, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned in the whole file


HealthStatusLabel overall_health_status = 4 [(gogoproto.moretags) = "search:\"Cluster Status,store\""];
HealthStatusLabel admission_control_health_status = 7 [(gogoproto.moretags) = "search:\"Admission Control Status,store\""];
HealthStatusLabel local_scanner_health_status = 11 [(gogoproto.moretags) = "search:\"Local Scanner Status,store\""];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment to the top of this message to say which tag is next since the tags are out of order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about adding a comment at the bottom of the post instead? Since we usually add new fields at the bottom, the comment will be more likely to be read.
Also, the comment placed at the top will be propagated to the swagger definitions.

if !env.LocalImageScanningEnabled.BooleanSetting() {
return nil
}
result := storage.LocalScannerHealthInfo{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: var result storage.LocalScannerHealthInfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

TotalReadyAnalyzerPods: localScanner.Status.ReadyReplicas,
}
}
localScannerDb, err := u.client.AppsV1().Deployments(u.namespace).Get(u.ctx(), localScannerDbDeploymentName, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: localScannerDB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

admissionControlContainerName = "admission-control"

localScannerDeploymentName = "scanner"
localScannerDbDeploymentName = "scanner-db"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: localScannerDBDeploymentName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome, but I think we should remove local in user-visible places, see inline.
Also a couple of questions:

  • did you perhaps manage to simulate a broken local scanner cert refresher component Start and see whether the error surfaces in the UI? Or is this PR just about pod health?
  • I also wanted to double-check what happens when local scanner is not supposed to be running? (Like in non-openshift clusters.) I think what should happen is that the scanner part of the health "pill" is simply not shown. A special sub-case is when secured cluster components run in the same cluster and namespace as central - in that case the local scanner does not get deployed, but there is a "central scanner" with the same name. We probably should not be showing its status together with other secured cluster components, right @RTann

CHANGELOG.md Outdated
- ROX-9435: Updated dryrun API to generate preview violations for disabled policies
- Support for security policies that do not have a policyVersion or have versions prior to 1.1 will be removed. If you have externally stored older policies, they cannot be imported.
- ROX-10021: RHCOS node support is dropped until major improvements are made in ROX-8944.
- ROX-8533: Show Cluster Health Info if Scanner is running on a secured cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

SensorStatus = newFieldLabel("Sensor Status")
CollectorStatus = newFieldLabel("Collector Status")
AdmissionControlStatus = newFieldLabel("Admission Control Status")
LocalScannerStatus = newFieldLabel("Local Scanner Status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a product decision to show users just "Scanner", not "Local scanner" (here and elsewhere). I think the code-internal identifier names can stay as LocalScanner though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out to be a bit tricky because the label has to match the tag defined in the proto file. I am going to remove the word 'Local' from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +359 to +368
int32 total_desired_analyzer_pods = 1;
}
oneof total_ready_analyzer_pods_opt {
int32 total_ready_analyzer_pods = 2;
}
oneof total_desired_db_pods_opt {
int32 total_desired_db_pods = 3;
}
oneof total_ready_db_pods_opt {
int32 total_ready_db_pods = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spacing after int32 seems inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// Collection of errors that occurred while trying to obtain local scanner health info.
repeated string status_errors = 5;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this blank line (and perhaps the one in AdmissionControlHealthInfo as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +174 to +175
err = errors.Wrap(err, fmt.Sprintf("unable to find local scanner deployments in namespace %q", u.namespace))
result.StatusErrors = append(result.StatusErrors, fmt.Sprintf("unable to find local scanner deployments in namespace %q: %v", u.namespace, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = errors.Wrap(err, fmt.Sprintf("unable to find local scanner deployments in namespace %q", u.namespace))
result.StatusErrors = append(result.StatusErrors, fmt.Sprintf("unable to find local scanner deployments in namespace %q: %v", u.namespace, err))
err = errors.Wrap(err, fmt.Sprintf("unable to find scanner deployment in namespace %q", u.namespace))
result.StatusErrors = append(result.StatusErrors, fmt.Sprintf("unable to find scanner deployment in namespace %q: %v", u.namespace, err))

Similar below for the DB case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return &result
}

func (u *updaterImpl) getLocalScannerInfo() *storage.LocalScannerHealthInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just itching to extract a generic function out of this once we have generics :-)

@kovayur
Copy link
Contributor Author

kovayur commented Apr 29, 2022

Thank you for doing this! Just a few small things. Also, I'd recommend getting a reviewer from the UI team to take a look at the UI parts. Also, do you have screenshots of an example of this in action so we can see what it looks like?

@RTann Sure, I also added a few more screens to the description

@kovayur
Copy link
Contributor Author

kovayur commented Apr 29, 2022

This looks awesome, but I think we should remove local in user-visible places, see inline. Also a couple of questions:

  • did you perhaps manage to simulate a broken local scanner cert refresher component Start and see whether the error surfaces in the UI? Or is this PR just about pod health?

To simplify the health check logic, I decided to leave only the health information of the scanner pods. If the cert refresher will not start successfully, the cert will expire and the corresponding pod will eventually become unhealthy. Of course, the indication of the component start failure will happen in advance and in some cases this may prevent an outage, but is it that critical?

  • I also wanted to double-check what happens when local scanner is not supposed to be running? (Like in non-openshift clusters.) I think what should happen is that the scanner part of the health "pill" is simply not shown. A special sub-case is when secured cluster components run in the same cluster and namespace as central - in that case the local scanner does not get deployed, but there is a "central scanner" with the same name. We probably should not be showing its status together with other secured cluster components, right @RTann

That's exactly how it has been implemented. The corresponding part of the health "pill" will not be shown if the local scanner is disabled (ROX_LOCAL_IMAGE_SCANNING_ENABLED=false). Thus, the mentioned special case works as expected (health info is not shown).

@kovayur kovayur requested a review from porridge April 29, 2022 16:56
@vjwilson vjwilson requested a review from a team April 29, 2022 17:02
Copy link
Contributor

@vjwilson vjwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the UI changes:

A couple of minor nits about code style.

More important is the lack of distinction in the design of "delayed" versus "unhealthy"

Comment on lines +22 to +24
storage.CollectorHealthInfo collector_health_info = 1;
storage.AdmissionControlHealthInfo admission_control_health_info = 2;
storage.ScannerHealthInfo scanner_health_info = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto style reminds me of assembly ;)

        LDX     #$4000
        BSR     SAVEX        * 7 CPU cycles (2 bytes)
        RTS                  * 5 CPU cycles (1 byte)
                             * These two lines require 12 cycles and
                             * 3 bytes
        ...
SAVEX   STX     ,U++
        RTS

Comment on lines +187 to +191
// Special case for Scanner when Sensor is UNHEALTHY or DELAYED.
export const delayedScannerStatusStyle = {
Icon: InfoCircleIcon,
bgColor: 'bg-base-200',
fgColor: 'text-base-700',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing.
The icon and color choice are neutral, which seems fair for "delayed" status, but unhealthy is represented in other widgets with yellow or red. Here's we're conflating the two.

nit: If we are going to conflate them, delayed doesn't convey that. Hard to come up with name that captures both nuances... "unsatisfactory" maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used when the sensor status is delayed/unhealthy, not scanner.
When the sensor is unavailable, we can't distribute information about the status of the scanner, so we color it gray.
If the scanner is not available by itself, we color the status in red or green as in other cases.

TBH this was copy-pasted from the collector/admission controller health view, and it is probably worth using the same "delayed" style for all 3 cases, since they are exactly the same.

Comment on lines +3 to +11
import { DetailedTooltipOverlay, Tooltip } from '@stackrox/ui-components';
import { ClusterHealthStatus } from '../../clusterTypes';
import {
delayedScannerStatusStyle,
healthStatusStyles,
isDelayedSensorHealthStatus,
} from '../../cluster.helpers';
import HealthLabelWithDelayed from '../HealthLabelWithDelayed';
import { getDistanceStrictAsPhrase } from '../../../../utils/dateUtils';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are aliases for top-level directories, so you can avoid the deeply nested ancestor path for dateUtils by referencing it directly in the utils/ path.

(It also has to move up in the list, because of lint rule that "absolute" paths come first.)

Suggested change
import { DetailedTooltipOverlay, Tooltip } from '@stackrox/ui-components';
import { ClusterHealthStatus } from '../../clusterTypes';
import {
delayedScannerStatusStyle,
healthStatusStyles,
isDelayedSensorHealthStatus,
} from '../../cluster.helpers';
import HealthLabelWithDelayed from '../HealthLabelWithDelayed';
import { getDistanceStrictAsPhrase } from '../../../../utils/dateUtils';
import { DetailedTooltipOverlay, Tooltip } from '@stackrox/ui-components';
import { getDistanceStrictAsPhrase } from 'utils/dateUtils';
import { ClusterHealthStatus } from '../../clusterTypes';
import {
delayedScannerStatusStyle,
healthStatusStyles,
isDelayedSensorHealthStatus,
} from '../../cluster.helpers';
import HealthLabelWithDelayed from '../HealthLabelWithDelayed';

Copy link
Contributor Author

@kovayur kovayur May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint, fixed.

Copy link
Contributor

@RTann RTann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, but otherwise backend LGTM. Definitely make sure @porridge approves as well since I am probably not as familiar with the health status stuff

AdmissionControlHealthInfo admission_control_health_info = 8;
// scanner_health_info is filled when the scanner is deployed on a secured cluster (so called "local scanner").
// Please do not confuse this with the default scanner deployment on a central cluster.
ScannerHealthInfo scanner_health_info = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the = 10 needs to be further to the right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// derived from this data.
// Aggregated scanner health status is not included because it is derived in central and not in the component that
// first reports ScannerHealthInfo (sensor).
message ScannerHealthInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also solely for the "local" Scanner? If so, can you add that comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

<tbody>
<tr className={trClassName} key="totalReadyPods">
<th className={thClassName} scope="row">
Scanner pods ready:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: we can reduce verbosity and make things look a little nicer if we say Pods ready: instead of Scanner pods ready:, Version: instead of Collector version: and so on. What do you think?
It means you'd need to change titles for Collector and Admission Control but that shouldn't be hard.

Compare before
image

and after
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I was thinking about simplifying even more, like: "Replicas: 3/3". In any case, it seems that this is out of the scope of this PR.

@porridge
Copy link
Contributor

porridge commented May 5, 2022

  • did you perhaps manage to simulate a broken local scanner cert refresher component Start and see whether the error surfaces in the UI? Or is this PR just about pod health?

To simplify the health check logic, I decided to leave only the health information of the scanner pods. If the cert refresher will not start successfully, the cert will expire and the corresponding pod will eventually become unhealthy. Of course, the indication of the component start failure will happen in advance and in some cases this may prevent an outage, but is it that critical?

I'd say it's a prerequisite for closing ROX-9476 which was filed in response to my whining in that PR review.

  • I also wanted to double-check what happens when local scanner is not supposed to be running? (Like in non-openshift clusters.) I think what should happen is that the scanner part of the health "pill" is simply not shown. A special sub-case is when secured cluster components run in the same cluster and namespace as central - in that case the local scanner does not get deployed, but there is a "central scanner" with the same name. We probably should not be showing its status together with other secured cluster components, right @RTann

That's exactly how it has been implemented.

👌🏻

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please don't let the bot close ROX-9476 yet as mentioned above.

@kovayur kovayur merged commit addf58c into master May 5, 2022
@kovayur kovayur deleted the yury/ROX-9476-sensor-error-handling-2 branch May 5, 2022 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants