ROX-9022: add type to make requests to a retryable source with asynchronous interface by juanrh · Pull Request #383 · stackrox/stackrox

juanrh · 2022-01-20T17:54:27Z

Description

Create type to make requests to a retryable source with asynchronous interface with timeout per request.
This would be used by the CertManager (final name TBD) to reliably request the TLS certificates for a set of components

Previous attempts

I started with this, but the problem is that this doesn't retries the timeouts, which is essentials for using this for fetching values through the Communicate RPC in central, that is bidirectional and has no timeouts or requests failures implemented:

type result struct {
	v interface {}
	err error
}
type RetryableSource interface {
    // ask the source for a result, and assumes the source will perform the necessary 
    // cancellation procedure
	Request()
    // where the source puts the result: we only care about the first one
	ResponseC() chan result
}
type backoffRequesterImpl struct {
	source RetryableSource
	stopC concurrency.ErrorSignal
	backoff wait.Backoff
}

func (r *backoffRequesterImpl) Request(parentCtx context.Context, timeout time.Duration, errReporter errorReporter) (interface{}, error) {
	r.source.Request()
	ctx, cancel := context.WithTimeout(parentCtx, timeout)
	defer cancel()
	for {
		select {
		case <-r.stopC.Done():
			return nil, errors.New("request cancelled")
		case <-ctx.Done():
			// FIXME retry
			return nil, errors.New("timeout")
		case response := <-r.source.ResponseC():
			if response.err != nil {
				if errReporter != nil {
					errReporter.Report(response.err)
				}
				time.AfterFunc(r.backoff.Step(), func() {
					r.source.Request()
				})
			} else {
				return response.v, nil
			}
		}
	}
}

I tried retrying timeouts, but this doesn't reset the timeout timer when a new request is sent.

// NOTE BLOCKING: loops forever or until you cancel
func (r *backoffRequesterImpl) Request(parentCtx context.Context, timeout time.Duration) (interface{}, error) {
	r.source.Request()
	cancelCtx, cancel := context.WithCancel(parentCtx)
	defer cancel()
	timeoutCtx, _ := context.WithTimeout(cancelCtx, timeout)
	for {
		select {
		case <-r.cancelC.Done():
			return nil, errors.New("request cancelled")
		case <-timeoutCtx.Done():
			r.handleError(errors.New("timeout"))
			timeoutCtx, _ = context.WithTimeout(cancelCtx, timeout)
		case response := <-r.source.ResponseC():
			err := response.err
			if err != nil {
				r.handleError(err)
			} else {
				return response.v, nil
			}
		}
	}
}

func (r *backoffRequesterImpl) handleError(err error) {
	time.AfterFunc(r.backoff.Step(), func() {
		r.source.Request()
	})
}

func (r *backoffRequesterImpl) Cancel() {
	r.cancelC.SignalWithError(errors.New("stop"))
}

This does reset the timeout timer on a new request, but I'm not sure it is safe to modify r.timeoutCtx, that is used in a case in the select, while the select is running. It doesn't look concurrently safe but I'm not sure, any insight is appreciated.

type backoffRequesterImpl struct {
	source RetryableSource
	cancelC concurrency.ErrorSignal
	backoff wait.Backoff
	timeoutCtx context.Context
}

// NOTE BLOCKING: loops forever or until you cancel
func (r *backoffRequesterImpl) Request(parentCtx context.Context, timeout time.Duration) (interface{}, error) {
	r.source.Request()
	cancelCtx, cancel := context.WithCancel(parentCtx)
	defer cancel()
	r.timeoutCtx, _ = context.WithTimeout(cancelCtx, timeout)
	for {
		select {
		case <-r.cancelC.Done():
      // assume request will never come. We take first value in `r.source.ResponseC()` as
			// the response, assume `r.source.Request()` does any cancellation required.
			return nil, errors.New("request cancelled")
		case <-r.timeoutCtx.Done():
			r.handleError(cancelCtx, timeout, errors.New("timeout"))
			// r.timeoutCtx, _ = context.WithTimeout(cancelCtx, timeout)
		case response := <-r.source.ResponseC():
			err := response.err
			if err != nil {
				r.handleError(cancelCtx, timeout, err)
			} else {
				return response.v, nil
			}
		}
	}
    // TODO wait for parentCtx and Cancel if done: consider that the way to cancel instead,so no `cancelC concurrency.ErrorSignal`
}

func (r *backoffRequesterImpl) handleError(cancelCtx context.Context, timeout time.Duration, err error) {
	// BAD: starts timeout before request: `r.timeoutCtx, _ = context.WithTimeout(cancelCtx, timeout)`
    r.timeoutCtx = cancelCtx
	time.AfterFunc(r.backoff.Step(), func() {
		r.source.Request()
		// BAD: concurrently modifies timeout, Request loop might not catch it
		r.timeoutCtx, _ = context.WithTimeout(cancelCtx, timeout)
	})
}

func (r *backoffRequesterImpl) Cancel() {
	r.cancelC.SignalWithError(errors.New("stop"))
}

The current code uses a channel and a timer for timeouts instead of contexts. I also replaced the signal with a context for cancellation and a global timeout for the whole process.

Checklist

Investigated and inspected CI test results
Unit test and regression tests added
Evaluated and added CHANGELOG entry if required
Determined and documented upgrade steps

If any of these don't apply, please comment below.

Testing Performed

TODO(replace-me)

Type to make requests to a retryable source that has an asynchronous interface, with timeout per request, and configurable backoff

ghost · 2022-01-20T18:16:22Z

Tag for build #128932 is 3.68.x-6-g29f5cf6a2e.

💻 For deploying this image using the dev scripts, run the following first:

export MAIN_IMAGE_TAG='3.68.x-6-g29f5cf6a2e'

📦 You can also generate an installation bundle with:

docker run -i --rm stackrox/main:3.68.x-6-g29f5cf6a2e central generate interactive > bundle.zip

🕹️ A roxctl binary artifact can be downloaded from CircleCI.

porridge · 2022-01-21T11:06:59Z

pkg/retry/retry_source.go

+	"k8s.io/apimachinery/pkg/util/wait"
+)
+
+// RetryableSource is a value that allows asking for a result, and returns the


Can you provide an existing example of something that will provide an interface similar to this?

It seems strange that we are:

not propagating the context, and

the usage model is reading from the channel and cracking the AskForResult whip from time to time?

If our notion of timeout does not match the source's notion, this could lead to strange situations I think...

It's a little hard to think about how the Retriever would work without having well defined source semantics 🤔

I've added a couple of commits, that change the interface a bit.

The idea now is to call AskForResult to ask for a result and a new channel just for that result, and call Retry if the result is not obtained on time. I also added passing the context on AskForResult.

type RetryableSource interface { AskForResult(ctx context.Context) chan *Result Retry() }

The idea is that the source can forget that you made a request. An example of a RetryableSource would be a SensorComponent that communicates with Central through the Communicate RPC. Sensor side this is implemented by registering a bunch of SensorComponents that get messages to / from central as follows:

Send to central: each SensorComponent has method ResponsesC() <-chan *central.MsgFromSensor that centralSenderImpl is listening, and forwarding all messages found to central.

Receive from central: each SensorComponent has method ProcessMessage(msg *central.MsgToSensor) error that centralReceiverImpl calls each time it gets a message from central. Each message is broadcasted to all SensorComponents

Sending messages through Communicate is fire and forget, so this is what I've come up with to encapsulate timeouts and retries.
For local scanner we'd have a SensorComponent that sends a IssueLocalScannerCertsRequest to ask central for certificates. On AskForResult it would sent that request to central and create a new chan *Result, and on Communicate it would send the message to that channel. On Retry it would send a new request to central, and update an internal request ID that is used in IssueLocalScannerCertsRequest and IssueLocalScannerCertsResponse to correlate requests and responses.

Regarding

If our notion of timeout does not match the source's notion, this could lead to strange situations I think...

I don't get what you mean with that

As a somewhat related precedent this is similar to Erlang's gen_server call but with retries, although the language is quite different

See example of certificate source in draft PR #400

By saying:

If our notion of timeout does not match the source's notion, this could lead to strange situations I think...

I meant that it is not clear what is the relation between the ctx and the recommended wait period between calls to Retry().

Also, unfortunately I got somewhat lost when reading #400 and the things that were most interesting to me, like how the contexts are propagated, does not seem to be finished yet.
Maybe I should just wait until both PRs are a bit more fleshed out...

I've been able to simplify #400 applying @SimonBaeumer 's suggestions. The code in this PR is not used anymore, please take a look when you have some time.
I'm also putting #350 on hold in favor of #400

juanrh · 2022-01-25T19:13:04Z

closed in favor of #350

Juan Rodriguez Hortala added 2 commits January 20, 2022 18:43

First version of RetryableSourceRetriever

f1dac85

Type to make requests to a retryable source that has an asynchronous interface, with timeout per request, and configurable backoff

fix style

211c07d

juanrh requested review from SimonBaeumer and porridge January 20, 2022 17:54

porridge reviewed Jan 21, 2022

View reviewed changes

Juan Rodriguez Hortala added 2 commits January 21, 2022 12:47

add error handler and validator

cf03df5

Pass context on AskForResult

29f5cf6

juanrh closed this Jan 25, 2022

msugakov deleted the juanrh/ROX-9022 branch September 16, 2025 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-9022: add type to make requests to a retryable source with asynchronous interface#383

ROX-9022: add type to make requests to a retryable source with asynchronous interface#383
juanrh wants to merge 4 commits intomasterfrom
juanrh/ROX-9022

juanrh commented Jan 20, 2022

Uh oh!

ghost commented Jan 20, 2022 •

edited by ghost

Loading

Uh oh!

porridge Jan 21, 2022

Uh oh!

juanrh Jan 21, 2022 •

edited

Loading

Uh oh!

juanrh Jan 21, 2022

Uh oh!

juanrh Jan 21, 2022

Uh oh!

porridge Jan 24, 2022

Uh oh!

juanrh Jan 24, 2022 •

edited

Loading

Uh oh!

juanrh commented Jan 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanrh commented Jan 20, 2022

Description

Previous attempts

Checklist

Testing Performed

Uh oh!

ghost commented Jan 20, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

porridge Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

juanrh Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juanrh Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

juanrh Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

porridge Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

juanrh Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juanrh commented Jan 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghost commented Jan 20, 2022 •

edited by ghost

Loading

juanrh Jan 21, 2022 •

edited

Loading

juanrh Jan 24, 2022 •

edited

Loading