ROX-8969: Add RetryTicker type for running a function periodically with retries#350
ROX-8969: Add RetryTicker type for running a function periodically with retries#350
Conversation
|
Tag for build #173216 is 💻 For deploying this image using the dev scripts, run the following first: export MAIN_IMAGE_TAG='3.68.x-146-gd9fa4380a8'📦 You can also generate an installation bundle with: docker run -i --rm stackrox/main:3.68.x-146-gd9fa4380a8 central generate interactive > bundle.zip🕹️ A |
|
Firs of all, I have to be honest I got lost when trying to understand the responsibilities of this manager. 😟 An interaction diagram could be worth a thousand words, but even a short description of who calls what and when would be extremely helpful. Secondly, Lastly, we already talked about this offline, but mentioning it here FTR: |
There was a problem hiding this comment.
What do you mean with kept constant?
There was a problem hiding this comment.
I mean that these would be final or const if this was Java or C++. Because this is input configuration for the cert manager
There was a problem hiding this comment.
This goes already into the right direction but think we try to reinvent a bit here :)
The CertManager imho should be only be responsible for managing when a cert should refreshed. It is also responsible for issuing but this should be configurable.
Also the CertManager should manage the certificates independently. In this implementation the refresh timer can't be set for individual certificates which can lead to problems, especially when shared by all secrets.
Ideally the CertManager also does not need to know anything about kubernetes or stackrox internal APIs (like storage.* or the request ID from the Central <> Sensor connection.
Note: Handling the request ID in the CertManager means it knows how to handle bi-directional messaging synchronously, thus adding knowledge about Central<>Sensor communication.
From your implementation the CertManager needs to be splitted for this:
- Extract request handling (i.e. everything related to see
requestID. The code still need to be used but is passed into the function to be executed - Separate handling cert issue scheduling and registering the certs to have a "job" for every certificate added to the manager with an individual timer
- Persisting the certs into k8s should be also done outside of this and configurable (like the
strategyyou've already implemented but added as an anonymus func to the cert struct)
The interfaces to interact could look like this:
type CertManager interface{
RegisterCertIssuer(issuer CertificateIssue) func() // Preferably and option for the constructor
Start()
Stop()
}
type CertRefresher interface {
Stop()
Start()
}
type certRefresherImpl struct {
nextRefresh time.Duration
certificateIssue CertificateIssue
}
type CertificateIssuer interface {
// Handle Central<>Sensor connection here and issue certificate
// also persist to Kubernets secret
// If successfully issued persist it into the secret. Error is send to an error channel in the manager
Issue() func(...) error
// Parse certificate and return next issue duration
// Could also named `SchedulNextCertIssue` or `GetNextIssuingDuration` or along that lines
GetExpiration func() (time.Duration, error)
}
To me this looks like we try to invent cron jobs which has already a go implementation here. The CertManager could just warp this for our use-case for certificates.
To me it looks like only an initial timer setting (i.e. after sensor restart) is a bit tricky, but can be solved by either exchanging the job at runtime, always re-issue certs on restart or upstream the feature (forking at first).
note: we have also a concurrent job processor but does not behave like a cron job.
There was a problem hiding this comment.
Also the CertManager should manage the certificates independently. In this implementation the refresh timer can't be set for individual certificates which can lead to problems, especially when shared by all secrets.
The idea es that you can have a CertManager handling a single certificate or many, because*storage.TypedServiceCertificateSet can contain certificates for a single service or for many. That way it would work both for the interface based on IssueLocalScannerCertsResponse and IssueLocalScannerCertsRequest that returns 2 certificates, and also for other RCPs that return a single certificate.
In fact we could have 2 CertManager one for Scanner and another for ScannerDB, and passing a CertIssuanceFunc that uses IssueLocalScannerCertsRequest under the hood. That would allow generalising the RPC for IssueLocalScannerCertsResponse and IssueLocalScannerCertsRequest to individual certificates later on, if we want
There was a problem hiding this comment.
The CertManager imho should be only be responsible for managing when a cert should refreshed. It is also responsible for issuing but this should be configurable.
I understand you mean that the CertManager shouldn't handle either issue certificate responses (and updating the k8s secrets) or request timeouts. I agree that would simplify this greatly. But that logic should be moved somewhere else, if we move that back to the sensor component TLSIssuer from #278 we'd be in a similar situation because then TLSIssuer would need to handle concurrent requests from 1) a CertManager asking to refresh the certificates, which implies setting a new request id; 2) Central responding with the certificates; 3) a self setup timer to account for requests timeouts, which is basically the 3 cases in loop. And the TLSIssuer would also handle the logic to determine whether or not we should keep these secrets fresh for local scanner, and the k8s secret ownership. But in any case this code is too complex, so I have to come up with something to spread this complexity across additional simpler types.
On the other hand, making CertManager able to handle several independent certificates with several CertRefresher actually increases the complexity, but I agree that makes sense if we move other complexity somewhere else.
bf6b48d to
ebab0f7
Compare
SimonBaeumer
left a comment
There was a problem hiding this comment.
Only smaller changes, otherwise looks good overall to me!
pkg/concurrency/retry_ticker_test.go
Outdated
There was a problem hiding this comment.
I think you don't need the mock if you already had the mock, instead you created a timeout with polling.
Imho you could use a WaitGroup and waiting for the execution to be finished. By doing that the waitGroup could be added to the testCase struct itself and the function definitions too.
With this the creation of the fixtures is moved outside of the test execution and more declarative.
There was a problem hiding this comment.
I understand you mean that m.AssertExpectations(t) is redundant because we are already checking the expectations with the Flags. That right, here I use the mock to have a simple way to control what we return in f, so it is a stub instead of a mock. So I'm removing m.AssertExpectations(t) to make this easier.
Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.
However, I agree we could move more stuff to the test case, I've changed the code to:
testCases := map[string]struct {
timeToSecondTick time.Duration
firstErr error
}{
"success": {timeToSecondTick: 2 * capTime, firstErr: nil},
"with error should retry": {timeToSecondTick: 0, firstErr: errors.New("forced")},
}and I think the test it easier to understand now. Please tell me what you think.
I've also increased timers following misberner's suggestion in #463 that
100ms should be plenty for what is going on, but might be too short for CI, where tests are running in containers with low priority on oversubscribed machines
but please tell me if you think timeouts are too long now
There was a problem hiding this comment.
I think Malte's point is fine, did not considered overbooked CI machines. Good that you pointed it out!
There was a problem hiding this comment.
Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.
The test improved a lot, I understood it now. Added a comment on it.
porridge
left a comment
There was a problem hiding this comment.
This seems to still be in flux, please ping me for a final review pass once you're done addressing Simon's comments.
pkg/concurrency/retry_ticker.go
Outdated
There was a problem hiding this comment.
Please mention what is the relation between this method returning and a potential tick function execution time.
There was a problem hiding this comment.
As this is based on time.AfterFunc this should inherit the documented behaviour. I'll add a comment along those lines
pkg/concurrency/retry_ticker_test.go
Outdated
There was a problem hiding this comment.
Sorry for the confusion from my comments, I think I now understood it.
Here it looks like the test should assert the exponential backoff step duration but instead uses the duration as a timeout for polling.
From my understanding the done flag should be set after the expectTimeToSecondAttempt is expired + a timeout.
But to be clear, I think this assertion might not be necessary, instead we can validate if the backoff.Step() was called and assume that the library works.
Interesting here is that our function is called twice and allows us to configure the next execution manually.
There was a problem hiding this comment.
I tried asserting on backoff.Step() but I cannot mock wait.backoff because it is a struct, not an interface. So I ended up making time.AfterFunc a field of retryTickerImpl so we can inject a mock during testing to act as a spy. I also made the test faster by signalling only in the second call to doTick and asserting based on the times passed to the mock for time.AfterFunc. Please et me know what you think
pkg/concurrency/retry_ticker_test.go
Outdated
There was a problem hiding this comment.
Regarding using a waitGroup, we are asserting different times for each flag: the first flag should be completed immediately, to assert that the first tick happens as soon as we start the ticker; the second flag should be completed in the retry time in the case of a failure, or in the time returned by the first call to f in case of a success. AFAIK we cannot express that with a waitGroup because we'd waiting for the counter to get to 0, and we wouldn't distinguish between individual events for the counter decreasing.
The test improved a lot, I understood it now. Added a comment on it.
SimonBaeumer
left a comment
There was a problem hiding this comment.
Please revisit the test assertions for polling the next intervals and naming of the ticker func, otherwise it looks very good overall!
Type to make requests to a retryable source that has an asynchronous interface, with timeout per request, and configurable backoff
it is called from several goroutines in Start, Stop, and the goroutines created by time.AfterFunc
in order to prevent concurrent modifications after Start
265f5a1 to
7f8d96e
Compare
|
just rebased from master to fix CI build issue |
SimonBaeumer
left a comment
There was a problem hiding this comment.
Good job!
I noticed that the ticker can be started twice, from the code it looks like we should prevent this. After fixing it can be 🚢 !
- starting an started ticker leads to timer inteferences - starting an stopped ticker is not safe even in the same goroutine because a call to scheduleTick from a first Start can continue after a second Start is called, and that leads to concurrent modifications of t.backoff Also fix memory leak in test due to non stopped tickers
NewRetryTicker already returns a started ticker. That ensures tickers can only be started once
This reverts commit b6f69a5.
Description
Create a
RetryTickertype for running a function periodically with retries.This is a part of ROX-8752 for the client side part of refreshing the local scanner certificates.
This in concerned only with the retry and coordination logic.
This and subsequent PRs are replacing #278, as that PR was getting too long. This is a addressing some suggestions from the comments in that PR, as well as missing features in that PR:
CertManagerthat is a generic component to handle refreshing certificates for any resource. I renamed this toRetryTickerbecause the type was getting very generic.RetryTickerand should be handled by the function that is called on each tick, as part of ROX-9014: Draft TLSIssuer as a SensorComponent and CertificateIssuer #400. Same for secret resource ownership and creation.Checklist
No CHANGELOG entry or upgrade instructions required.
Testing Performed
Added unit test. Test passing running as
go test -count=500 -race -v -p=1 github.com/stackrox/rox/pkg/concurrency -run TestRetryTicker