ROX-8969: Add RetryTicker type for running a function periodically with retries by juanrh · Pull Request #350 · stackrox/stackrox

juanrh · 2022-01-18T18:28:23Z

Description

Create a RetryTicker type for running a function periodically with retries.
This is a part of ROX-8752 for the client side part of refreshing the local scanner certificates.
This in concerned only with the retry and coordination logic.

This and subsequent PRs are replacing #278, as that PR was getting too long. This is a addressing some suggestions from the comments in that PR, as well as missing features in that PR:

Creating a separate CertManager that is a generic component to handle refreshing certificates for any resource. I renamed this to RetryTicker because the type was getting very generic.
- Determining whether or not the secrets should be refreshed is out of scope for RetryTicker and should be handled by the function that is called on each tick, as part of ROX-9014: Draft TLSIssuer as a SensorComponent and CertificateIssuer #400. Same for secret resource ownership and creation.
Timeout when the certificate issue request is not responded on time.
Retry errors with exponential backoff.

Checklist

Investigated and inspected CI test results
Unit test and regression tests added
Evaluated and added CHANGELOG entry if required
Determined and documented upgrade steps

No CHANGELOG entry or upgrade instructions required.

Testing Performed

Added unit test. Test passing running as go test -count=500 -race -v -p=1 github.com/stackrox/rox/pkg/concurrency -run TestRetryTicker

ghost · 2022-01-18T18:47:31Z

Tag for build #173216 is 3.68.x-146-gd9fa4380a8.

💻 For deploying this image using the dev scripts, run the following first:

export MAIN_IMAGE_TAG='3.68.x-146-gd9fa4380a8'

📦 You can also generate an installation bundle with:

docker run -i --rm stackrox/main:3.68.x-146-gd9fa4380a8 central generate interactive > bundle.zip

🕹️ A roxctl binary artifact can be downloaded from CircleCI.

porridge · 2022-01-20T06:38:33Z

Firs of all, I have to be honest I got lost when trying to understand the responsibilities of this manager. 😟 An interaction diagram could be worth a thousand words, but even a short description of who calls what and when would be extremely helpful.

Secondly, certManagerImpl contains 14 fields. There is research that shows that 7 is about the highest number of things that a typical human brain can hold at a time. So it seems like this should be decomposed further.

Lastly, we already talked about this offline, but mentioning it here FTR: CertManager could be confused with https://cert-manager.io/ so perhaps CertRefreshManager or CertRefresher? Though I admit I might be getting the purpose wrong as mentioned in the first comment.

SimonBaeumer · 2022-01-20T08:16:55Z

sensor/kubernetes/certificates/cert_manager.go

What do you mean with kept constant?

I mean that these would be final or const if this was Java or C++. Because this is input configuration for the cert manager

SimonBaeumer · 2022-01-20T09:41:32Z

sensor/kubernetes/certificates/cert_manager.go

This goes already into the right direction but think we try to reinvent a bit here :)

The CertManager imho should be only be responsible for managing when a cert should refreshed. It is also responsible for issuing but this should be configurable.

Also the CertManager should manage the certificates independently. In this implementation the refresh timer can't be set for individual certificates which can lead to problems, especially when shared by all secrets.

Ideally the CertManager also does not need to know anything about kubernetes or stackrox internal APIs (like storage.* or the request ID from the Central <> Sensor connection.
Note: Handling the request ID in the CertManager means it knows how to handle bi-directional messaging synchronously, thus adding knowledge about Central<>Sensor communication.

From your implementation the CertManager needs to be splitted for this:

Extract request handling (i.e. everything related to see requestID. The code still need to be used but is passed into the function to be executed

Separate handling cert issue scheduling and registering the certs to have a "job" for every certificate added to the manager with an individual timer

Persisting the certs into k8s should be also done outside of this and configurable (like the strategy you've already implemented but added as an anonymus func to the cert struct)

The interfaces to interact could look like this:

type CertManager interface{ RegisterCertIssuer(issuer CertificateIssue) func() // Preferably and option for the constructor Start() Stop() } type CertRefresher interface { Stop() Start() } type certRefresherImpl struct { nextRefresh time.Duration certificateIssue CertificateIssue } type CertificateIssuer interface { // Handle Central<>Sensor connection here and issue certificate // also persist to Kubernets secret // If successfully issued persist it into the secret. Error is send to an error channel in the manager Issue() func(...) error // Parse certificate and return next issue duration // Could also named `SchedulNextCertIssue` or `GetNextIssuingDuration` or along that lines GetExpiration func() (time.Duration, error) }

To me this looks like we try to invent cron jobs which has already a go implementation here. The CertManager could just warp this for our use-case for certificates.
To me it looks like only an initial timer setting (i.e. after sensor restart) is a bit tricky, but can be solved by either exchanging the job at runtime, always re-issue certs on restart or upstream the feature (forking at first).

note: we have also a concurrent job processor but does not behave like a cron job.

Also the CertManager should manage the certificates independently. In this implementation the refresh timer can't be set for individual certificates which can lead to problems, especially when shared by all secrets.

The idea es that you can have a CertManager handling a single certificate or many, because*storage.TypedServiceCertificateSet can contain certificates for a single service or for many. That way it would work both for the interface based on IssueLocalScannerCertsResponse and IssueLocalScannerCertsRequest that returns 2 certificates, and also for other RCPs that return a single certificate.
In fact we could have 2 CertManager one for Scanner and another for ScannerDB, and passing a CertIssuanceFunc that uses IssueLocalScannerCertsRequest under the hood. That would allow generalising the RPC for IssueLocalScannerCertsResponse and IssueLocalScannerCertsRequest to individual certificates later on, if we want

The CertManager imho should be only be responsible for managing when a cert should refreshed. It is also responsible for issuing but this should be configurable.

I understand you mean that the CertManager shouldn't handle either issue certificate responses (and updating the k8s secrets) or request timeouts. I agree that would simplify this greatly. But that logic should be moved somewhere else, if we move that back to the sensor component TLSIssuer from #278 we'd be in a similar situation because then TLSIssuer would need to handle concurrent requests from 1) a CertManager asking to refresh the certificates, which implies setting a new request id; 2) Central responding with the certificates; 3) a self setup timer to account for requests timeouts, which is basically the 3 cases in loop. And the TLSIssuer would also handle the logic to determine whether or not we should keep these secrets fresh for local scanner, and the k8s secret ownership. But in any case this code is too complex, so I have to come up with something to spread this complexity across additional simpler types.

On the other hand, making CertManager able to handle several independent certificates with several CertRefresher actually increases the complexity, but I agree that makes sense if we move other complexity somewhere else.

pkg/concurrency/retry_ticker.go

pkg/concurrency/retry_ticker_test.go

SimonBaeumer

Only smaller changes, otherwise looks good overall to me!

pkg/concurrency/retry_ticker.go

pkg/concurrency/retry_ticker_test.go

SimonBaeumer · 2022-01-31T10:07:37Z

pkg/concurrency/retry_ticker_test.go

I think you don't need the mock if you already had the mock, instead you created a timeout with polling.
Imho you could use a WaitGroup and waiting for the execution to be finished. By doing that the waitGroup could be added to the testCase struct itself and the function definitions too.
With this the creation of the fixtures is moved outside of the test execution and more declarative.

I understand you mean that m.AssertExpectations(t) is redundant because we are already checking the expectations with the Flags. That right, here I use the mock to have a simple way to control what we return in f, so it is a stub instead of a mock. So I'm removing m.AssertExpectations(t) to make this easier.

Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.
However, I agree we could move more stuff to the test case, I've changed the code to:

testCases := map[string]struct { timeToSecondTick time.Duration firstErr error }{ "success": {timeToSecondTick: 2 * capTime, firstErr: nil}, "with error should retry": {timeToSecondTick: 0, firstErr: errors.New("forced")}, }

and I think the test it easier to understand now. Please tell me what you think.

I've also increased timers following misberner's suggestion in #463 that

100ms should be plenty for what is going on, but might be too short for CI, where tests are running in containers with low priority on oversubscribed machines

but please tell me if you think timeouts are too long now

I think Malte's point is fine, did not considered overbooked CI machines. Good that you pointed it out!

Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.

The test improved a lot, I understood it now. Added a comment on it.

porridge

This seems to still be in flux, please ping me for a final review pass once you're done addressing Simon's comments.

pkg/concurrency/retry_ticker.go

porridge · 2022-01-31T10:44:32Z

pkg/concurrency/retry_ticker.go

Please mention what is the relation between this method returning and a potential tick function execution time.

As this is based on time.AfterFunc this should inherit the documented behaviour. I'll add a comment along those lines

pkg/concurrency/retry_ticker_test.go

SimonBaeumer · 2022-01-31T21:36:56Z

pkg/concurrency/retry_ticker_test.go

Sorry for the confusion from my comments, I think I now understood it.

Here it looks like the test should assert the exponential backoff step duration but instead uses the duration as a timeout for polling.
From my understanding the done flag should be set after the expectTimeToSecondAttempt is expired + a timeout.

But to be clear, I think this assertion might not be necessary, instead we can validate if the backoff.Step() was called and assume that the library works.
Interesting here is that our function is called twice and allows us to configure the next execution manually.

I tried asserting on backoff.Step() but I cannot mock wait.backoff because it is a struct, not an interface. So I ended up making time.AfterFunc a field of retryTickerImpl so we can inject a mock during testing to act as a spy. I also made the test faster by signalling only in the second call to doTick and asserting based on the times passed to the mock for time.AfterFunc. Please et me know what you think

SimonBaeumer · 2022-01-31T21:43:27Z

pkg/concurrency/retry_ticker_test.go

Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.

The test improved a lot, I understood it now. Added a comment on it.

SimonBaeumer

Please revisit the test assertions for polling the next intervals and naming of the ticker func, otherwise it looks very good overall!

Type to make requests to a retryable source that has an asynchronous interface, with timeout per request, and configurable backoff

it is called from several goroutines in Start, Stop, and the goroutines created by time.AfterFunc

in order to prevent concurrent modifications after Start

juanrh · 2022-02-03T08:46:53Z

just rebased from master to fix CI build issue

SimonBaeumer

Good job!
I noticed that the ticker can be started twice, from the code it looks like we should prevent this. After fixing it can be 🚢 !

pkg/concurrency/retry_ticker.go

pkg/concurrency/retry_ticker_test.go

- starting an started ticker leads to timer inteferences - starting an stopped ticker is not safe even in the same goroutine because a call to scheduleTick from a first Start can continue after a second Start is called, and that leads to concurrent modifications of t.backoff Also fix memory leak in test due to non stopped tickers

NewRetryTicker already returns a started ticker. That ensures tickers can only be started once

This reverts commit b6f69a5.

…th retries (#350)

juanrh requested a review from SimonBaeumer January 18, 2022 18:28

github-actions bot added the area/sensor label Jan 18, 2022

juanrh mentioned this pull request Jan 18, 2022

ROX-8752: Keep local scanner certificates up to date #278

Closed

4 tasks

juanrh requested a review from porridge January 19, 2022 13:38

SimonBaeumer suggested changes Jan 20, 2022

View reviewed changes

juanrh changed the base branch from master to juanrh/ROX-9022 January 21, 2022 14:08

juanrh changed the title ~~ROX-8969: Skeleton of CertManager component to keep TLS certificates updated~~ ROX-8969: Skeleton of CertRefresher component to keep TLS certificates updated Jan 21, 2022

juanrh force-pushed the juanrh/ROX-8969 branch from bf6b48d to ebab0f7 Compare January 21, 2022 14:15

juanrh mentioned this pull request Jan 24, 2022

ROX-9022: add type to make requests to a retryable source with asynchronous interface #383

Closed

4 tasks

juanrh changed the base branch from juanrh/ROX-9022 to master January 25, 2022 17:06

github-actions bot removed the area/sensor label Jan 25, 2022

juanrh mentioned this pull request Jan 25, 2022

ROX-9014: Draft TLSIssuer as a SensorComponent and CertificateIssuer #400

Closed

4 tasks

juanrh changed the title ~~ROX-8969: Skeleton of CertRefresher component to keep TLS certificates updated~~ ROX-8969: Add RetryTicker type for running a function periodically with retries Jan 25, 2022

juanrh marked this pull request as ready for review January 25, 2022 19:15

porridge reviewed Jan 26, 2022

View reviewed changes

pkg/concurrency/retry_ticker.go Outdated Show resolved Hide resolved

pkg/concurrency/retry_ticker.go Outdated Show resolved Hide resolved

pkg/concurrency/retry_ticker_test.go Outdated Show resolved Hide resolved

pkg/concurrency/retry_ticker_test.go Outdated Show resolved Hide resolved

juanrh requested review from SimonBaeumer and porridge January 26, 2022 13:21

SimonBaeumer suggested changes Jan 27, 2022

View reviewed changes

juanrh requested a review from SimonBaeumer January 27, 2022 16:23

SimonBaeumer suggested changes Jan 31, 2022

View reviewed changes

porridge reviewed Jan 31, 2022

View reviewed changes

juanrh requested a review from SimonBaeumer January 31, 2022 13:37

SimonBaeumer reviewed Jan 31, 2022

View reviewed changes

SimonBaeumer approved these changes Jan 31, 2022

View reviewed changes

juanrh requested a review from SimonBaeumer February 1, 2022 11:57

Juan Rodriguez Hortala added 3 commits February 3, 2022 09:43

First version of RetryableSourceRetriever

681334b

Type to make requests to a retryable source that has an asynchronous interface, with timeout per request, and configurable backoff

fix style

b801d57

add error handler and validator

b8def21

Juan Rodriguez Hortala added 17 commits February 3, 2022 09:43

fix style

f0a9c32

cleanup

88f5254

make setTickTimer concurrently safe

fca9cd2

it is called from several goroutines in Start, Stop, and the goroutines created by time.AfterFunc

simplify test dropping suite

773bbfb

use builder pattern and make event handlers private

0f85a0b

in order to prevent concurrent modifications after Start

use import from pkg/sync

100099f

remove handler functions and builder

42af3b5

use terser field names

2c0954a

cleanup code lingering from previous change

62da8c7

rename backoffPrototype to initialBackoff

f216984

cleanup test and make it easier to read

aab2d51

try to make test more readable moving config to test case struct

5e80465

improve comments

905ba26

rename mock f function as "doTick"

8ce6819

make test runs faster by reducing waits

0969569

fix bug where stopped timer was restarted by scheduleTick

b4626cf

reset backoff on start

7f8d96e

juanrh force-pushed the juanrh/ROX-8969 branch from 265f5a1 to 7f8d96e Compare February 3, 2022 08:46

SimonBaeumer suggested changes Feb 3, 2022

View reviewed changes

Juan Rodriguez Hortala added 2 commits February 3, 2022 12:39

improve comments

262d3c0

juanrh requested a review from SimonBaeumer February 3, 2022 14:30

simplify ticker interface by removing Start method

b6f69a5

NewRetryTicker already returns a started ticker. That ensures tickers can only be started once

SimonBaeumer approved these changes Feb 3, 2022

View reviewed changes

Revert "simplify ticker interface by removing Start method"

d9fa438

This reverts commit b6f69a5.

SimonBaeumer approved these changes Feb 3, 2022

View reviewed changes

juanrh merged commit d370ad2 into master Feb 3, 2022

juanrh deleted the juanrh/ROX-8969 branch February 3, 2022 15:21

RTann pushed a commit that referenced this pull request Apr 6, 2022

ROX-8969: Add RetryTicker type for running a function periodically wi…

2c740a5

…th retries (#350)

Conversation

juanrh commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Testing Performed

Uh oh!

ghost commented Jan 18, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

porridge commented Jan 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juanrh Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juanrh Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SimonBaeumer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

porridge left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer left a comment

Choose a reason for hiding this comment

Uh oh!

juanrh commented Feb 3, 2022

Uh oh!

SimonBaeumer left a comment

Choose a reason for hiding this comment

Uh oh!

juanrh commented Jan 18, 2022 •

edited

Loading

ghost commented Jan 18, 2022 •

edited by ghost

Loading

juanrh Jan 20, 2022 •

edited

Loading

juanrh Jan 20, 2022 •

edited

Loading