ROX-9148: Create cert refresher for business logic of local scanner certificate refresh#474
ROX-9148: Create cert refresher for business logic of local scanner certificate refresh#474
Conversation
|
Tag for build #216678 is 💻 For deploying this image using the dev scripts, run the following first: export MAIN_IMAGE_TAG='3.68.x-235-g3f4e8b1ec4'📦 You can also generate an installation bundle with: docker run -i --rm stackrox/main:3.68.x-235-g3f4e8b1ec4 central generate interactive > bundle.zip🕹️ A |
porridge
left a comment
There was a problem hiding this comment.
Looks nice so far, looking forward to a version rebased on all the dependent PRs.
There was a problem hiding this comment.
Well, this is a lie in case timeToRefresh was > 0, right?
There was a problem hiding this comment.
timeToRefresh is the time for the next refresh, we just refreshed the certificates in line 85 with r.refreshCertificates(ctx). I'll rename timeToRefresh to timeToNextRefresh to make this more clear
There was a problem hiding this comment.
What I mean is that if we landed here because refreshCertificates returned early, on line 108 then nothing got refreshed this time round, and this message is a lie. I think we should log this around line 123 instead, and even potentially rename refreshCertificates to maybeRefreshCertificates.
6c8635f to
0943901
Compare
5d5b3f4 to
16707ef
Compare
There was a problem hiding this comment.
What I mean is that if we landed here because refreshCertificates returned early, on line 108 then nothing got refreshed this time round, and this message is a lie. I think we should log this around line 123 instead, and even potentially rename refreshCertificates to maybeRefreshCertificates.
There was a problem hiding this comment.
This is not an error right after deployment, when it's expected, right?
There was a problem hiding this comment.
As discussed in #457 the repository should already populate the secret data, so this would be unexpected. Although I wonder if this could also happen even after the repository creates the secrets initialising the secret data with the certificates, due to k8s eventual consistency. If that is the case we'd have the certificates emitted twice in a row, do you think that could happen too?
For the same, I think scanner should be able to wait without crashing until the secrets are created and populated. Is that correct or am I missing something?
There was a problem hiding this comment.
I think we should make it such that when a secret is present, it contains expected data. In other words we should never create an empty secret.
There was a problem hiding this comment.
I understand you mean that in that case we should not treat ErrEmptyCertificate differently to other errors. The current proposal for #457 creates secrets with data populated with the certificates. So I think we can go two paths here:
- Treat errors in
getTimeToRefreshas unrecoverable: sogetTimeToRefreshjust returns0, errfor any error returned byr.getCertsRenewalTime(certificates). This implies an infinite loop of errors when the secret data is corrupted. The current proposal for ROX-9128: Create certSecretsRepo type to store certificates in k8s secrets #457 only creates a secret when it doesn't exist, so restarting Sensor wouldn't fix that - Treat errors in
getTimeToRefreshas recoverable, and recover by returning0, nilfor any error returned byr.getCertsRenewalTime(certificates). This can also leads to an infinite loop of requesting certificates, if the certificate renewal process always fails.
I tend to prefer option 2, what do you think?
There was a problem hiding this comment.
the repository should already populate the secret data, so this would be unexpected
I understand you mean that in that case we should not treat
ErrEmptyCertificatedifferently to other errors.
Yes, if we never create an empty secret then we should not be having a special error instance for an empty secret. Just treat it as invalid cert ripe for a refresh.
There was a problem hiding this comment.
I first modified getTimeToRefresh to never fail and always recover using 0 as timeToNextRefresh.
But then I realised that if we get an error in the second call to r.getTimeToRefresh in the body of ensureCertificatesAreFresh, for the new certificates returned by r.requestCertificates(ctx), then we should not return 0, nil to the ticker that is calling RefreshCertificates, because that would imply an immediate retry, thus hammering central with certificate requests just because the refresher doesn't understand the certificates. So I have added a recoverFromErr parameter to getTimeToRefresh to optionally ignore errors, so in ensureCertificatesAreFresh for the the first call we ignore errors and refresh the certificate immediately if we cannot parse the certificates stored in the secrets, but for the second call where we parse the new certificates we don't recover from an error so the backoff mechanism in the ticker can work.
What do you think?
There was a problem hiding this comment.
Since there are no other errs in this function:
| renewalTime, renewalTimeErr := r.getCertsRenewalTime(certificates) | |
| renewalTime, err := r.getCertsRenewalTime(certificates) |
There was a problem hiding this comment.
I'm not sure this interface is all that useful...
There was a problem hiding this comment.
this would be useful for testing the sensor component that launches the refresher, as it would allow to mock the refresher easily
There was a problem hiding this comment.
Right, but an interface should be defined in the user's package, not in the implementer's one.
https://blog.chewxy.com/2018/03/18/golang-interfaces/
Here it's the same package, but arguably the same principles should apply when we split things across files.
There was a problem hiding this comment.
What I'm understanding here is that the suggestion is moving the certRefresher interface to the sensor component that uses it, e.g. to file tls_issuer.go in https://github.com/stackrox/stackrox/pull/565/files#diff-81bd3808339e34da64a1fa5e66c27f0ec9d75a8a51956b1773c4e0c8b6295c08. I'm thus removing the interface from this PR
There was a problem hiding this comment.
| ticker := concurrency.NewRetryTicker(refresher.RefreshCertificates, timeout, backoff) | |
| return ticker | |
| return concurrency.NewRetryTicker(refresher.RefreshCertificates, timeout, backoff) |
There was a problem hiding this comment.
| // PutServiceCertificates TODO replace by serviceCertificatesRepo from ROX-9128 | |
| type PutServiceCertificates interface { | |
| // certificatesRepo TODO replace by serviceCertificatesRepo from ROX-9128 | |
| type certificatesRepo interface { |
cb3c49e to
b651192
Compare
There was a problem hiding this comment.
I have to admit that because the last line of parameter list is indented the same as the first line of code, I find it hard to tell them apart visually. WDYT about inserting a blank line as the first line of function body whenever there is a line break inside the parameter list?
There was a problem hiding this comment.
done, I agree that is more readable
There was a problem hiding this comment.
This is only passed as the function to the ticker and not accessed anywhere else (apart from unit tests), right?
In that case I guess it does not need to be exported? 🤔
There was a problem hiding this comment.
renamed to lower case
There was a problem hiding this comment.
If the entire API is just creating a ticker and delegating Stop and Start to it, then perhaps the refresher should not be wrapping it in the first place? Perhaps we can simplify the whole thing by having newCertRefresher return an appropriately configured ticker instead? WDYT?
There was a problem hiding this comment.
that makes sense, and it simplifies the code significantly, done
There was a problem hiding this comment.
the repository should already populate the secret data, so this would be unexpected
I understand you mean that in that case we should not treat
ErrEmptyCertificatedifferently to other errors.
Yes, if we never create an empty secret then we should not be having a special error instance for an empty secret. Just treat it as invalid cert ripe for a refresh.
porridge
left a comment
There was a problem hiding this comment.
LGTM, apart from a few nitpicks.
There was a problem hiding this comment.
Nit: corrupted might be too strong a word for the case where CA certs differ. Perhaps just inconsistent state?
There was a problem hiding this comment.
changed this to
log.Errorf("local scanner certificates are in an inconsistent state, will refresh certificates immediately: %s", getCertsErr)There was a problem hiding this comment.
| timeout time.Duration, backoff wait.Backoff) concurrency.RetryTicker { | |
| return concurrency.NewRetryTicker(func(ctx context.Context) (timeToNextTick time.Duration, err error) { | |
| timeout time.Duration, backoff wait.Backoff) concurrency.RetryTicker { | |
| return concurrency.NewRetryTicker(func(ctx context.Context) (timeToNextTick time.Duration, err error) { |
There was a problem hiding this comment.
Please use a consistent wrapping style for the same function prototypes (i.e. here vs refreshCertificates). For 4 parameters I'd prefer the style used here, but up to you as long as it's consistent.
There was a problem hiding this comment.
now using longer lines for refreshCertificates too
There was a problem hiding this comment.
| // recoverFromErr to so the ticker knows this is an error, and it retries with backoff. | |
| // recoverFromEr=false so the ticker knows this is an error, and it retries with backoff. |
There was a problem hiding this comment.
TBH my preference would be to inline this function. The extra recoverFromErr parameter seems to introduce more complexity than is solved by having this standalone function. At least I had a hard time figuring out what is going on :-)
There was a problem hiding this comment.
You are right, I like how your comments keep making the code smaller!
There was a problem hiding this comment.
| // ErrDifferentCAForDifferentServiceTypes TODO replace by ROX-9128 | |
| // ErrDifferentCAForDifferentServiceTypes TODO(ROX-9128): <description why> |
There was a problem hiding this comment.
references to ROX-9128 removed after rebase
There was a problem hiding this comment.
| ErrDifferentCAForDifferentServiceTypes = errors.New("found different CA PEM in secret Data for different service types") | |
| ErrDifferentCAForDifferentServiceTypes = errors.New("found different CA PEM in secret Data field for different service types") |
There was a problem hiding this comment.
| // ErrMissingSecretData TODO replace by ROX-9128 | |
| // ErrMissingSecretData TODO(ROX-9128): <description> |
There was a problem hiding this comment.
references to ROX-9128 removed after rebase
There was a problem hiding this comment.
This is a mock right? Can this be removed?
If you add temporary code try to mark it as TODO(do-not-merge): ... to prevent accidental merges.
There was a problem hiding this comment.
That is not a mock, per another review comment I moved the interface definition to the place where they are used, see #474 (comment) for context
There was a problem hiding this comment.
Does it mean we add the interfaces where they are needed?
Tbh I've never seen this before it and seems to not make sense anymore if we have another package using the same interface as a dependency (in structs or function signatures).
cc: @porridge
There was a problem hiding this comment.
I.e. the Reader is also defined in io.go where the implementation lives
There was a problem hiding this comment.
For this particular case, another option is putting all interfaces in a new file sensor/kubernetes/localscanner/types.go, what do you think?
There was a problem hiding this comment.
@porridge Do you have any production examples? Preferably in a well-known project?
Also took a look at the uber style guide where it is not mentioned.
There was a problem hiding this comment.
@SimonBaeumer here's the most authoritative source I could find: https://github.com/golang/go/wiki/CodeReviewComments#interfaces
This is also alluded to by people like Rob Pike in places like https://groups.google.com/g/golang-nuts/c/mg-8_jMyasY/m/lo-kDuEd540J which explains the philosophical background of why this rule was created in the first place.
I'm also reasonably sure I've seen a talk by one of the Go creators that mentioned this, but I cannot find it :-(
Some more places where it's mentioned by other people:
https://blog.logrocket.com/exploring-structs-interfaces-go/#other-interface-types-considerations
https://commandercoriander.net/blog/2018/03/30/go-interfaces/
Of course there are cases where this is hard to apply, but I don't think this is one of them.
There was a problem hiding this comment.
Not the talk I had in mind, but fun to watch: https://www.youtube.com/watch?v=PAAkCSZUG1c&t=318s
There was a problem hiding this comment.
Very interesting discussion. Some golang concepts like interfaces or methods are sneaky to me because they sound similar to Java or C++ concepts, but are actually quite different, so this material is really useful
There was a problem hiding this comment.
Interesting, thanks for sharing @porridge. I was not aware of this, need to dig deeper later 👍
There was a problem hiding this comment.
| // refreshCertificates determines refreshes the certificate secrets if needed, and returns the time | |
| // refreshCertificates refreshes the certificate secrets if needed, and returns the time |
There was a problem hiding this comment.
I think this needs to be replaced with ensureSecrets after the repo secrets PR is merged
|
I did not review tests as waiting for updating the interface for ensure secrets. |
it should be declared instead in the client code that is the sensor component that uses it
c1ce53b to
f8c139b
Compare
SimonBaeumer
left a comment
There was a problem hiding this comment.
Already looks very good.
Small remarks on the interface discussion and mocking (using generated mocks instead).
Good job here!
pkg/concurrency/retry_ticker_test.go
Outdated
| time.Sleep(capTime) | ||
| assert.Nil(t, ticker.getTickTimer()) |
There was a problem hiding this comment.
Do we have something like an assert.Until function?
Or retry this until either ticker.getTickerTimer==null or timeout exceeded?
I looks like a general pattern here. Would be great to have something like that here. (Must not be part of this PR, instead adding it as a separate ticket.)
There was a problem hiding this comment.
I haven't seen that in testify, but we can implement something like that with PollWithTimeout
func assertTickerEventuallyStops(t *testing.T, ticker *retryTickerImpl) {
ok := PollWithTimeout(func() bool {
return ticker.getTickTimer() == nil
}, 10 * time.Millisecond, capTime)
assert.True(t, ok, "ticker should eventually stop")
}I have checked that this catches the error for TestRetryTickerStop and TestRetryTickerStopsOnNonRecoverableErrors when changing scheduleTick to work wrong, so I'm using this instead.
Nice catch, this should make our tests slightly faster
| } | ||
|
|
||
| func (s *certRefresherSuite) TestRefreshCertificatesRequestCertificatesFailure() { | ||
| certificates := (*storage.TypedServiceCertificateSet)(nil) |
There was a problem hiding this comment.
| certificates := (*storage.TypedServiceCertificateSet)(nil) | |
| var certificates *storage.TypedServiceCertificateSet |
| } | ||
|
|
||
| func (s *certRefresherSuite) TestRefreshCertificatesRequestCertificatesResponseFailure() { | ||
| certificates := (*storage.TypedServiceCertificateSet)(nil) |
There was a problem hiding this comment.
| certificates := (*storage.TypedServiceCertificateSet)(nil) | |
| var certificates *storage.TypedServiceCertificateSet |
There was a problem hiding this comment.
Does it mean we add the interfaces where they are needed?
Tbh I've never seen this before it and seems to not make sense anymore if we have another package using the same interface as a dependency (in structs or function signatures).
cc: @porridge
There was a problem hiding this comment.
I.e. the Reader is also defined in io.go where the implementation lives
| return 0, nil | ||
| } | ||
| if k8sErrors.IsNotFound(getCertsErr) { | ||
| log.Warnf("local scanner certificates are missing, "+ |
There was a problem hiding this comment.
| log.Warnf("local scanner certificates are missing, "+ | |
| log.Warnf("local scanner certificates not found, " + |
SimonBaeumer
left a comment
There was a problem hiding this comment.
Besides the interface location this looks good. Imho if we don't have evidence to put in the cert_refresher let's put it back to implementation side.
Description
This is a fragment of #400 that has the business logic for the client side of keeping local scanner certificates updated, that is not related to a sensor component.
Checklist
No need to add a CHANGELOG entry or update steps.
Testing Performed
New unit tests