ROX-12834: Introduce fake nodescan generator by Maddosaurus · Pull Request #3399 · stackrox/stackrox

Maddosaurus · 2022-10-12T12:59:15Z

Description

This PR introduces a generator to send fake NodeScanV2 messages from Collector to Sensor.
These messages enable us to build out the full pipeline independently from implementing the NodeScanV2 logic.

This feature is hidden behind a feature flag. Use ROX_RHCOS_NODE_SCANNING=true when calling deploy.sh to enable it.
The interval for scans is already customizable through an env var, ROX_NODE_SCAN_INTERVAL that defaults to 4 hours
- The interval can be decreased by setting this var to another value for the Collector image in the Collector DaemonSet

Checklist

Investigated and inspected CI test results
Unit test and regression tests added
Evaluated and added CHANGELOG entry if required
Determined and documented upgrade steps
~~Documented user facing changes (create PR based on openshift/openshift-docs and merge into rhacs-docs)~~

If any of these don't apply, please comment below.

Testing Performed

Deploy (e.g. on local) with enabled feature flag
ROX_RHCOS_NODE_SCANNING=true ./deploy/k8s/deploy-local.sh
Observe the log message in the compliance container (in the compliance pod) that the scanning interval is 4 hours
Observe a fake Node Scan message being logged in Sensor logs once at start, then every 4 hours
~~4. Edit Collector DaemonSet and add a custom interval as env var to the collector image~~
Set environment variable via kubectl: kubectl -n stackrox set env daemonsets/collector --containers="compliance" ROX_NODE_RESCAN_INTERVAL="20s"
Observe the Collector Pod getting reinstantiated and the collector image log message stating an interval of 15 seconds.
Finally, teardown everything and redeploy with deploy.sh without setting the feature flag.
Observe the collector log not stating a Node Scan Interval.

Additional testing: Resilience

We also performed error cases:

Kill Sensor pod, observe Compliance container re-establishing communication and continue sending messages
Kill Compliance pod, observe Compliance container re-establish communication and continue sending messages

ghost · 2022-10-12T13:28:17Z

Images are ready for the commit at 707ccf2.

To use with deploy scripts, first export MAIN_IMAGE_TAG=3.72.x-461-g707ccf22c8.

Maddosaurus · 2022-10-14T11:10:24Z

/retest

Maddosaurus · 2022-10-18T13:01:49Z

/retest

Maddosaurus · 2022-10-19T08:05:24Z

/retest

vikin91

It looks really good! I was able to run it and see it working. Leaving my comments for now and will finish the review tomorrow (I want to give it a final look).

compliance/collection/nodescanv2/fake_nodescan.go

compliance/collection/nodescanv2/nodescan.go

compliance/collection/nodescanv2/fake_nodescan.go

compliance/collection/main.go

vikin91 · 2022-10-21T13:19:55Z

compliance/collection/main.go

-			client, config, err := initializeStream(ctx, cli)
-			if err != nil {
-				if ctx.Err() != nil {
-					// continue and the <-ctx.Done() path should be taken next iteration
-					continue
-				}
-				log.Fatalf("error initializing stream to sensor: %v", err)
-			}


I wonder why this has been put in to the default case in the first place 🤔 Having it kept here, the code would have been called 1 or more times. Now we would call the initialization only once. Not sure whether this is a problem, but we should keep an eye on the stability of the stream.

For example: would we be able to reestablish the connection if we crash sensor while compliance is still working? That is something to check.

Yes, the crash of sensor causes the connection to break and there is no self-healing. We need to fix that by reverting the change and merging the two loops by adding a new case to the for-switch loop.

It seems to me, the code before the change will shut down Compliance when the connection to Sensor is lost.
Piotr, in your experiment did you check behavior of new or of unchanged code?

I checked the behavior from after the change that I commented on. I also saw the fatal message, but I couldn't see compliance crashing after the sensor was gone. I guess we would need to check this on the original code from before this PR.

I have reverted to the original code and also implemented a similar mechanic for the NodeScan loop.
I tested this by killing the sensor pod while compliance is running and was able to verify that both loops connect to (the newly created) sensor again.

msugakov

Quick and shallow check, did not go through all code.

pkg/env/node_scan.go

compliance/collection/nodescanv2/nodescan.go

compliance/collection/main.go

msugakov · 2022-10-24T13:11:36Z

compliance/collection/main.go

+	if err != nil {
+		return errors.Wrap(err, "error scanning node")
+	}


I don't think there's a need to wrap error another time (see the calling code)

Suggested change

if err != nil {

return errors.Wrap(err, "error scanning node")

}

if err != nil {

return err

}

msugakov · 2022-10-25T09:05:26Z

compliance/collection/utils/verify_duration.go

+func VerifyAndUpdateDuration(duration time.Duration) time.Duration {
+	if (duration) <= 0 {
+		log.Warn("Negative or zero duration found. Setting to 4 hours.")
+		return time.Hour * 4


time.Hour * 4 is very specific for node rescan interval, therefore this function cannot be presented as generic "util". I think you should simply define it as GetNodeRescanInterval in pkg/env/node_scan.go. Then, instead of doing this

rescanInterval := complianceUtils.VerifyAndUpdateDuration(env.NodeRescanInterval.DurationSetting())

you can simply do this

rescanInterval := env.GetNodeRescanInterval()

I don't think we need to worry about this because DurationSetting() already handles this:

// DurationSetting returns the Duration object represented by the environment variable func (d *DurationSetting) DurationSetting() time.Duration { val := os.Getenv(d.envVar) if val != "" { dur, err := time.ParseDuration(val) if err == nil { return dur } log.Warnf("%s is not a valid environment variable for %s, using default value: %s", val, d.envVar, d.defaultDuration.String()) } return d.defaultDuration }

That being said, in another PR, we should probably add a panic in the registerDurationSetting if the default is not valid

ah actually time.ParseDuration does allow for negative durations, so DurationSetting() does not handle this. Perhaps we should include that in the PR above ^

Thanks for linking the PR! We'll keep an eye on it and update our code as soon as your PR is merged.
For now, we want to get this PR merged rather quick, though.

msugakov · 2022-10-25T10:50:58Z

compliance/collection/main.go

-			}
 			if err := runRecv(ctx, client, config); err != nil {
 				log.Errorf("error running recv: %v", err)
 			}


@vikin91 @Maddosaurus
Unless I read the code incorrectly, the original manageStream function does not function the way it supposed to.
I think, its idea should be to check whether the connection got closed (ctx.Done()) and in this case shut down the process (sig.Signal()), or (if the connection is still alive) receive a message from Sensor and respond back (runRecv()). The problem is that runRecv() is a blocking function that runs an infinite loop of receive-reply logic. Once the code gets to it, it will be running forever until an error happens or unsupported message arrives.
When the code is in runRecv(), the shutdown logic of manageStream won't be executed normally.
Am I right or am I missing something?

Maybe we should check other places like Sensor or Central for how shutdown-receive-send logic is organized there and copy that logic here. To me it seems that we're having difficulties plugging in scanner logic just because the existing approach in Compliance is not prepared for extensibility.

I definitely want to look at sensor and see how it is done there.

For compliance, I see that client.Recv() is being called on a client that was created with the context client, err := cli.Communicate(ctx), so I hope that canceling the context would shutdown the client and release the lock.
(I used word "hope" as I have not verified this and do not know the internals of grpc stream).

I will do some more reading on this.

So sensor utilizes two clever objects to manage the stream to central: a pointer *grpcUtil.LazyClientConn holding a lazy connection, and centralclient.CentralConnectionFactory that sets, monitors the connection and restarts it when failed. All communication attempts are blocking and waiting when the connection is down (no message is lost). The operations resume when the connection is back up.

We could apply this to compliance-sensor as well, but this may deserve a separate refactoring PR. Would it be okay to continue in the current shape and improve this in a separate PR @msugakov ?

The code has changed now a lot. Issues that were before are now gone but new things still don't look 100% self-explanatory to me. I think we (I?) need to confidently understand well what we do. I would not classify "understanding" as "improvement" or "optimization". "Understanding" is our core skill.

Yes, sorry for adding so many changes after you started the review. I think they were needed to address some obvious drawbacks regarding sending without being sure that the connection is up.

While there are still some bits about the GRPC stream that I would like to understand better, I feel pretty confident that the code handling sending and receiving in this component follows the same principles as in other parts of our code.

vikin91 · 2022-10-25T14:20:48Z

compliance/collection/main.go

+				continue
+			}
+			t.Reset(rescanInterval)


This could be an else instead of the continue

vikin91 · 2022-10-25T15:24:05Z

I wanted to share also my variant of the grpc handling (rewritten to use channels): #3548 (PR against this branch).

Let's pick this PR or #3548 as a short-term solution (they both suffer from the same drawback: a new sub-client will be generated for communication every 4 hours). In the long-term, we could aim at a similar solution to that used between sensor and central.

RTann · 2022-10-25T20:16:11Z

compliance/collection/nodescanv2/fake_nodescan.go

+		ScanTime: timestamp.TimestampNow(),
+		Components: &scannerV1.Components{
+			Namespace: "Testme OS",
+			OsComponents: []*scannerV1.OSComponent{


Since we only care about RHCOS at this time, I'd leave this empty/nil

RTann · 2022-10-25T20:16:52Z

compliance/collection/nodescanv2/fake_nodescan.go

+					Version:   "2:7.4.629-6.el7.x86_64",
+					Arch:      "x86_64",
+					Module:    "FakeMod",
+					Cpes:      []string{"cpe:/a:redhat:enterprise_linux:7::baseos"},


Not very important, but maybe use 8 instead of 7 due to RHCOS

RTann · 2022-10-25T20:17:05Z

compliance/collection/nodescanv2/fake_nodescan.go

+			RhelComponents: []*scannerV1.RHELComponent{
+				{
+					Name:      "vim-minimal",
+					Namespace: "rhel:7",


rhel:8 here as well

RTann · 2022-10-25T20:17:35Z

compliance/collection/nodescanv2/fake_nodescan.go

+					Module:    "FakeMod",
+					Cpes:      []string{"cpe:/a:redhat:enterprise_linux:7::baseos"},
+					AddedBy:   "FakeLayer",
+					Executables: []*scannerV1.Executable{


let's ignore executables as well since this is for Active Vuln mgmt, and I don't think we plan on having that in the first iteration

RTann · 2022-10-25T20:37:25Z

compliance/collection/utils/verify_duration.go

+func VerifyAndUpdateDuration(duration time.Duration) time.Duration {
+	if (duration) <= 0 {
+		log.Warn("Negative or zero duration found. Setting to 4 hours.")
+		return time.Hour * 4


I don't think we need to worry about this because DurationSetting() already handles this:

// DurationSetting returns the Duration object represented by the environment variable func (d *DurationSetting) DurationSetting() time.Duration { val := os.Getenv(d.envVar) if val != "" { dur, err := time.ParseDuration(val) if err == nil { return dur } log.Warnf("%s is not a valid environment variable for %s, using default value: %s", val, d.envVar, d.defaultDuration.String()) } return d.defaultDuration }

That being said, in another PR, we should probably add a panic in the registerDurationSetting if the default is not valid

RTann · 2022-10-25T20:51:21Z

compliance/collection/utils/verify_duration.go

+func VerifyAndUpdateDuration(duration time.Duration) time.Duration {
+	if (duration) <= 0 {
+		log.Warn("Negative or zero duration found. Setting to 4 hours.")
+		return time.Hour * 4


RTann · 2022-10-25T20:53:37Z

compliance/collection/main.go

+		case <-ctx.Done():
+			return
+		case <-t.C:
+			client, _, err := initializeStream(ctx, cli)


I wonder if we should throw this inside of manageStream so we can just reuse the same stream and use the same stop signals

We tried this and this is not trivial, because the runRecv blocks. I have a variation on this here #3548 but it also uses two loops - one for sending and one for receiving.

In a35000a, we applied the idea from #3548 and now, we have separate loops for sending and receiving

- Refactor: Rename package to nodescanv2 - Fix style remarks

Co-authored-by: Piotr Rygielski <114479+vikin91@users.noreply.github.com>

- rename manageNodeScanLoop - move scanner injection to function call - use real timestamp in fake node scanner message generation

- Extend fake message with more components - Remove commented out code

Co-authored-by: Misha Sugakov <537715+msugakov@users.noreply.github.com>

Maddosaurus · 2022-10-27T15:31:57Z

I have rebased the PR for a final time to benefit from the latest CI flake fixes.
I also put in another round of manual tests as described in the PR description to ensure that everything still works after the changes we introduced in the latest commits.

vikin91 · 2022-10-28T07:37:28Z

@Maddosaurus shall we have another CI run with the new ci-labels that Misha added?

vikin91 · 2022-10-28T07:54:57Z

/test all

msugakov · 2022-11-01T13:24:51Z

compliance/collection/main.go

+			client, _, err := initializeStream(ctx, cli)
+			if err != nil && ctx.Err() == nil {
+				// error even after retries
+				log.Fatalf("unable to establish send stream to sensor: %v", err)
+			}
+			if err := client.Send(sc); err != nil {
+				log.Errorf("failed sending nodeScanV2 to sensor: %v", err)
+			}


I read up about gRPC yesterday and got a bit more dangerous ;-) although still not very educated.

And I'm definitely here to spoil the party that started (even though I haven't used the opportunity to do that earlier). It seems this code will open a stream each time there's something to send. I think gRPC maintains a pool of connections on the client, plus streams get multiplexed over HTTP/2 (but only when that's used), so opening a stream doesn't mean establishing new connection, but I think it still has some overhead.

What is worse, I believe that we reuse the same rpc Communicate as the ordinary Compliance does. An effect of that might be that Sensor will push its messages over that stream too, but on Compliance side there's nothing receiving and acting on them for this stream. In case Sensor pushes a message to perform Compliance checks (message TriggerRun) through this stream, it will get lost.

From looking at Sensor code, it seems to me that more recent connection (actually stream) replaces previous in the map keyed by a node name. Therefore, I strongly suspect that what I described actually happens if the feature flag for node scans is turned on, i.e. compliance checks stop working properly.

I don't think it's an issue for now as the feature is gated but I think the logic should be reworked before we are ready to release.

WDYT?

That is a very good find. We should definitely improve this before releasing - it deserves a ticket.

Applying the connection factory with shared pointer to the stream (as seen in Sesor-Central) should fix the problem that you described, but I am not saying that this is the only viable solution to it. I will open a ticket.

A thought about this:

In case Sensor pushes a message to perform Compliance checks (message TriggerRun) through this stream, it will get lost.

we may close the stream immediately after sending the scan results to notify sensor that this connection should not be used anymore.

we may close the stream immediately after sending the scan results to notify sensor that this connection should not be used anymore.

I'm afraid the older active connection established by regular compliance logic may not get restored in Sensor's connection map. It deserves some investigation.

I think the problem is that currently Sensor first calls Compliance to do something (after the initial handshake) whereas in our new flow Compliance will first call Sensor. I don't think gRPC bidirectional streams allow receiving and sending at the same time (i.e. full duplex). As a peer in the stream you can either receive or send (or do nothing) but not receive and send concurrently. Therefore the existing stream can only be reused if Sensor initiates scans (going back to one of your proposals Piotr). We will need a different rpc if otherwise we want Compliance to initiate sending scan results (and Sensor ack or nack it).
This may be confusing to read, probably more efficient would be to explain with showing what I mean in code.

I think, your idea about Sensors initiating scans deserves to be reconsidered.

As a peer in the stream you can either receive or send (or do nothing) but not receive and send concurrently.

This would be a limitation. I got the impression from Sensor<->Central that the sending and receiving works concurrently, but I didn't confirm this experimentally.

Having a separate rpc in the proto file would be also an option (maybe even a much cleaner one), but I do not think that this is necessary. The way we initiate the stream clients gives us possibility to send without depending on whether something was received (it is not in the final PR, but we had this variant shortly in the branch).

I am thinking how to design a manual test to confirm or deny our theories 🤔

Created https://issues.redhat.com/browse/ROX-13383 to not forget about the issue we are discussing here.

It turns I was wrong about no duplex support. Looking at Sensor code, one stream is used for both sending and receiving concurrently.

stackrox/sensor/common/sensor/central_communication_impl.go

Lines 157 to 158 in adcf5c3

s.receiver.Start(stream, s.Stop, s.sender.Stop)

s.sender.Start(stream, s.Stop, s.receiver.Stop)

In case Sensor pushes a message to perform Compliance checks (message TriggerRun) through this stream, it will get lost. (@msugakov)

A single manual test is nothing to go by, but I was able to successfully run a compliance check with the FF turned on, so it might be the message isn't lost necessarily.
But we should definitely investigate this further! Thanks for creating the ticket @vikin91

Maddosaurus added the team/maple label Oct 12, 2022

github-actions bot added the area/sensor label Oct 12, 2022

Maddosaurus added ci-no-ui-tests and removed area/sensor labels Oct 12, 2022

Maddosaurus requested a review from a team October 12, 2022 12:59

github-actions bot added the area/sensor label Oct 12, 2022

Maddosaurus marked this pull request as draft October 12, 2022 13:39

openshift-ci bot added the do-not-merge/work-in-progress label Oct 12, 2022

Maddosaurus requested a review from RTann October 13, 2022 15:33

Maddosaurus marked this pull request as ready for review October 14, 2022 11:10

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 14, 2022

gavin-stackrox removed the ci-no-ui-tests label Oct 17, 2022

Maddosaurus force-pushed the mm/ROX-12834-fake-nodescan branch from 58159c5 to 1635ca2 Compare October 18, 2022 07:35

Maddosaurus force-pushed the mm/ROX-12834-fake-nodescan branch from 1635ca2 to ab700bb Compare October 20, 2022 14:25

vikin91 reviewed Oct 20, 2022

View reviewed changes

vikin91 reviewed Oct 21, 2022

View reviewed changes

compliance/collection/main.go Outdated Show resolved Hide resolved

vikin91 reviewed Oct 21, 2022

View reviewed changes

compliance/collection/main.go Outdated Show resolved Hide resolved

Maddosaurus requested a review from vikin91 October 21, 2022 13:03

vikin91 reviewed Oct 21, 2022

View reviewed changes

compliance/collection/main.go Outdated Show resolved Hide resolved

vikin91 reviewed Oct 21, 2022

View reviewed changes

msugakov reviewed Oct 21, 2022

View reviewed changes

msugakov reviewed Oct 25, 2022

View reviewed changes

vikin91 reviewed Oct 25, 2022

View reviewed changes

RTann reviewed Oct 25, 2022

View reviewed changes

Maddosaurus and others added 20 commits October 27, 2022 16:51

Improve log message and add error handling

8a28d1c

Dev cleanup

bc01dd7

Address style issues

9da5ee6

- Refactor: Rename package to nodescanv2 - Fix style remarks

Update message comment in NodeScan struct

aaa5f09

Co-authored-by: Piotr Rygielski <114479+vikin91@users.noreply.github.com>

Move log message in compliance main

9f57e60

Co-authored-by: Piotr Rygielski <114479+vikin91@users.noreply.github.com>

Review comments

2406223

- rename manageNodeScanLoop - move scanner injection to function call - use real timestamp in fake node scanner message generation

Address review, part II

f4959f0

- Extend fake message with more components - Remove commented out code

Extend fake message with more components

69aee63

Fix TODO command format so CI understands

e4028e2

Co-authored-by: Misha Sugakov <537715+msugakov@users.noreply.github.com>

Move logger to fake nodescan implementation

9dcac92

Introduce duration check for Node Scan interval

b2e0a56

Rename NodeScanInterval setting env and fix linter warnings

8c621b4

Change to two separate channels

bc70585

WIP: Address comments

7c7eb2e

Rework sending NodeScan messages from compliance to sensor

940b0f6

WIP: Changes by prygiels

cb9abb7

Updated FakeNodeScan and slimmed message

ed7e56d

Set NodeScanInterval default value visibility to private

0a0095b

Unexport nodeRescanInterval

624fb1a

Remove duplication of initializeStream

707ccf2

Maddosaurus force-pushed the mm/ROX-12834-fake-nodescan branch from b10db84 to 707ccf2 Compare October 27, 2022 15:30

msugakov added ci-all-qa-tests Tells CI to run all API tests (not just BAT). ci-race-tests Uses a `-race` build for all e2e tests labels Oct 28, 2022

Maddosaurus merged commit 9551398 into master Oct 28, 2022

Maddosaurus deleted the mm/ROX-12834-fake-nodescan branch October 28, 2022 11:35

msugakov reviewed Nov 1, 2022

View reviewed changes

ivan-degtiarenko pushed a commit that referenced this pull request Nov 5, 2022

ROX-12834: Introduce fake nodescan generator (#3399)

6d924a2

	s.receiver.Start(stream, s.Stop, s.sender.Stop)
	s.sender.Start(stream, s.Stop, s.receiver.Stop)

Conversation

Maddosaurus commented Oct 12, 2022 • edited by vikin91 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Testing Performed

Additional testing: Resilience

Uh oh!

ghost commented Oct 12, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maddosaurus commented Oct 14, 2022

Uh oh!

Maddosaurus commented Oct 18, 2022

Uh oh!

Maddosaurus commented Oct 19, 2022

Uh oh!

vikin91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikin91 Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msugakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msugakov Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikin91 commented Oct 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Maddosaurus commented Oct 12, 2022 •

edited by vikin91

Loading

ghost commented Oct 12, 2022 •

edited by ghost

Loading

vikin91 Oct 25, 2022 •

edited

Loading

msugakov Oct 25, 2022 •

edited

Loading

msugakov Nov 1, 2022 •

edited

Loading