Skip to content

ROX-8969: Add RetryTicker type for running a function periodically with retries#350

Merged
juanrh merged 34 commits intomasterfrom
juanrh/ROX-8969
Feb 3, 2022
Merged

ROX-8969: Add RetryTicker type for running a function periodically with retries#350
juanrh merged 34 commits intomasterfrom
juanrh/ROX-8969

Conversation

@juanrh
Copy link
Contributor

@juanrh juanrh commented Jan 18, 2022

Description

Create a RetryTicker type for running a function periodically with retries.
This is a part of ROX-8752 for the client side part of refreshing the local scanner certificates.
This in concerned only with the retry and coordination logic.

This and subsequent PRs are replacing #278, as that PR was getting too long. This is a addressing some suggestions from the comments in that PR, as well as missing features in that PR:

  • Creating a separate CertManager that is a generic component to handle refreshing certificates for any resource. I renamed this to RetryTicker because the type was getting very generic.
  • Timeout when the certificate issue request is not responded on time.
  • Retry errors with exponential backoff.

Checklist

  • Investigated and inspected CI test results
  • Unit test and regression tests added
  • Evaluated and added CHANGELOG entry if required
  • Determined and documented upgrade steps

No CHANGELOG entry or upgrade instructions required.

Testing Performed

Added unit test. Test passing running as go test -count=500 -race -v -p=1 github.com/stackrox/rox/pkg/concurrency -run TestRetryTicker

@ghost
Copy link

ghost commented Jan 18, 2022

Tag for build #173216 is 3.68.x-146-gd9fa4380a8.

💻 For deploying this image using the dev scripts, run the following first:

export MAIN_IMAGE_TAG='3.68.x-146-gd9fa4380a8'

📦 You can also generate an installation bundle with:

docker run -i --rm stackrox/main:3.68.x-146-gd9fa4380a8 central generate interactive > bundle.zip

🕹️ A roxctl binary artifact can be downloaded from CircleCI.

@juanrh juanrh requested a review from porridge January 19, 2022 13:38
@porridge
Copy link
Contributor

Firs of all, I have to be honest I got lost when trying to understand the responsibilities of this manager. 😟 An interaction diagram could be worth a thousand words, but even a short description of who calls what and when would be extremely helpful.

Secondly, certManagerImpl contains 14 fields. There is research that shows that 7 is about the highest number of things that a typical human brain can hold at a time. So it seems like this should be decomposed further.

Lastly, we already talked about this offline, but mentioning it here FTR: CertManager could be confused with https://cert-manager.io/ so perhaps CertRefreshManager or CertRefresher? Though I admit I might be getting the purpose wrong as mentioned in the first comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with kept constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that these would be final or const if this was Java or C++. Because this is input configuration for the cert manager

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes already into the right direction but think we try to reinvent a bit here :)

The CertManager imho should be only be responsible for managing when a cert should refreshed. It is also responsible for issuing but this should be configurable.

Also the CertManager should manage the certificates independently. In this implementation the refresh timer can't be set for individual certificates which can lead to problems, especially when shared by all secrets.

Ideally the CertManager also does not need to know anything about kubernetes or stackrox internal APIs (like storage.* or the request ID from the Central <> Sensor connection.
Note: Handling the request ID in the CertManager means it knows how to handle bi-directional messaging synchronously, thus adding knowledge about Central<>Sensor communication.

From your implementation the CertManager needs to be splitted for this:

  • Extract request handling (i.e. everything related to see requestID. The code still need to be used but is passed into the function to be executed
  • Separate handling cert issue scheduling and registering the certs to have a "job" for every certificate added to the manager with an individual timer
  • Persisting the certs into k8s should be also done outside of this and configurable (like the strategy you've already implemented but added as an anonymus func to the cert struct)

The interfaces to interact could look like this:

type CertManager interface{
  RegisterCertIssuer(issuer CertificateIssue) func() // Preferably and option for the constructor
  Start()
  Stop()
}

type CertRefresher interface {
  Stop()
  Start()
}

type certRefresherImpl struct {
  nextRefresh time.Duration
  certificateIssue CertificateIssue
}

type CertificateIssuer interface {
  // Handle Central<>Sensor connection here and issue certificate
  // also persist to Kubernets secret
  // If successfully issued persist it into the secret. Error is send to an error channel in the manager
  Issue() func(...) error
  
  // Parse certificate and return next issue duration
  // Could also named `SchedulNextCertIssue` or `GetNextIssuingDuration` or along that lines
  GetExpiration func() (time.Duration, error)
}

To me this looks like we try to invent cron jobs which has already a go implementation here. The CertManager could just warp this for our use-case for certificates.
To me it looks like only an initial timer setting (i.e. after sensor restart) is a bit tricky, but can be solved by either exchanging the job at runtime, always re-issue certs on restart or upstream the feature (forking at first).

note: we have also a concurrent job processor but does not behave like a cron job.

Copy link
Contributor Author

@juanrh juanrh Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the CertManager should manage the certificates independently. In this implementation the refresh timer can't be set for individual certificates which can lead to problems, especially when shared by all secrets.

The idea es that you can have a CertManager handling a single certificate or many, because*storage.TypedServiceCertificateSet can contain certificates for a single service or for many. That way it would work both for the interface based on IssueLocalScannerCertsResponse and IssueLocalScannerCertsRequest that returns 2 certificates, and also for other RCPs that return a single certificate.
In fact we could have 2 CertManager one for Scanner and another for ScannerDB, and passing a CertIssuanceFunc that uses IssueLocalScannerCertsRequest under the hood. That would allow generalising the RPC for IssueLocalScannerCertsResponse and IssueLocalScannerCertsRequest to individual certificates later on, if we want

Copy link
Contributor Author

@juanrh juanrh Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CertManager imho should be only be responsible for managing when a cert should refreshed. It is also responsible for issuing but this should be configurable.

I understand you mean that the CertManager shouldn't handle either issue certificate responses (and updating the k8s secrets) or request timeouts. I agree that would simplify this greatly. But that logic should be moved somewhere else, if we move that back to the sensor component TLSIssuer from #278 we'd be in a similar situation because then TLSIssuer would need to handle concurrent requests from 1) a CertManager asking to refresh the certificates, which implies setting a new request id; 2) Central responding with the certificates; 3) a self setup timer to account for requests timeouts, which is basically the 3 cases in loop. And the TLSIssuer would also handle the logic to determine whether or not we should keep these secrets fresh for local scanner, and the k8s secret ownership. But in any case this code is too complex, so I have to come up with something to spread this complexity across additional simpler types.

On the other hand, making CertManager able to handle several independent certificates with several CertRefresher actually increases the complexity, but I agree that makes sense if we move other complexity somewhere else.

@juanrh juanrh changed the base branch from master to juanrh/ROX-9022 January 21, 2022 14:08
@juanrh juanrh changed the title ROX-8969: Skeleton of CertManager component to keep TLS certificates updated ROX-8969: Skeleton of CertRefresher component to keep TLS certificates updated Jan 21, 2022
@juanrh juanrh changed the base branch from juanrh/ROX-9022 to master January 25, 2022 17:06
@juanrh juanrh changed the title ROX-8969: Skeleton of CertRefresher component to keep TLS certificates updated ROX-8969: Add RetryTicker type for running a function periodically with retries Jan 25, 2022
@juanrh juanrh marked this pull request as ready for review January 25, 2022 19:15
Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only smaller changes, otherwise looks good overall to me!

@juanrh juanrh requested a review from SimonBaeumer January 27, 2022 16:23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need the mock if you already had the mock, instead you created a timeout with polling.
Imho you could use a WaitGroup and waiting for the execution to be finished. By doing that the waitGroup could be added to the testCase struct itself and the function definitions too.
With this the creation of the fixtures is moved outside of the test execution and more declarative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you mean that m.AssertExpectations(t) is redundant because we are already checking the expectations with the Flags. That right, here I use the mock to have a simple way to control what we return in f, so it is a stub instead of a mock. So I'm removing m.AssertExpectations(t) to make this easier.

Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.
However, I agree we could move more stuff to the test case, I've changed the code to:

	testCases := map[string]struct {
		timeToSecondTick time.Duration
		firstErr         error
	}{
		"success":  {timeToSecondTick: 2 * capTime, firstErr: nil},
		"with error should retry": {timeToSecondTick: 0, firstErr: errors.New("forced")},
	}

and I think the test it easier to understand now. Please tell me what you think.

I've also increased timers following misberner's suggestion in #463 that

100ms should be plenty for what is going on, but might be too short for CI, where tests are running in containers with low priority on oversubscribed machines

but please tell me if you think timeouts are too long now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Malte's point is fine, did not considered overbooked CI machines. Good that you pointed it out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.

The test improved a lot, I understood it now. Added a comment on it.

Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to still be in flux, please ping me for a final review pass once you're done addressing Simon's comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention what is the relation between this method returning and a potential tick function execution time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is based on time.AfterFunc this should inherit the documented behaviour. I'll add a comment along those lines

@juanrh juanrh requested a review from SimonBaeumer January 31, 2022 13:37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion from my comments, I think I now understood it.

Here it looks like the test should assert the exponential backoff step duration but instead uses the duration as a timeout for polling.
From my understanding the done flag should be set after the expectTimeToSecondAttempt is expired + a timeout.

But to be clear, I think this assertion might not be necessary, instead we can validate if the backoff.Step() was called and assume that the library works.
Interesting here is that our function is called twice and allows us to configure the next execution manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried asserting on backoff.Step() but I cannot mock wait.backoff because it is a struct, not an interface. So I ended up making time.AfterFunc a field of retryTickerImpl so we can inject a mock during testing to act as a spy. I also made the test faster by signalling only in the second call to doTick and asserting based on the times passed to the mock for time.AfterFunc. Please et me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.

The test improved a lot, I understood it now. Added a comment on it.

Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revisit the test assertions for polling the next intervals and naming of the ticker func, otherwise it looks very good overall!

@juanrh juanrh requested a review from SimonBaeumer February 1, 2022 11:57
Juan Rodriguez Hortala added 3 commits February 3, 2022 09:43
Type to make requests to a retryable source that has an
asynchronous interface, with timeout per request, and
configurable backoff
@juanrh
Copy link
Contributor Author

juanrh commented Feb 3, 2022

just rebased from master to fix CI build issue

Copy link
Contributor

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!
I noticed that the ticker can be started twice, from the code it looks like we should prevent this. After fixing it can be 🚢 !

Juan Rodriguez Hortala added 2 commits February 3, 2022 12:39
- starting an started ticker leads to timer inteferences
- starting an stopped ticker is not safe even in the same goroutine
because a call to scheduleTick from a first Start can continue
after a second Start is called, and that leads to concurrent
modifications of t.backoff

Also fix memory leak in test due to non stopped tickers
@juanrh juanrh requested a review from SimonBaeumer February 3, 2022 14:30
NewRetryTicker already returns a started ticker. That ensures
tickers can only be started once
@juanrh juanrh merged commit d370ad2 into master Feb 3, 2022
@juanrh juanrh deleted the juanrh/ROX-8969 branch February 3, 2022 15:21
RTann pushed a commit that referenced this pull request Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants