owned this note
owned this note
Published
Linked with GitHub
# Retries
**Author**: @aat
<!-- Status of the document - draft, in-review, etc. - is conveyed via HackMD labels -->
## Description (what)
This is a proposal to add rudimentary retry support to FTL. Its initial use case will be for use with the [Distributed State Machine](/DWVZqIRsQRSG0w6pcYR7tw) design, but will be extended in future to support more generalised use cases.
## Motivation (why, optional)
## Goals
- Rudimentary user-controlled retries for retrying state machine tasks.
### Non-Goals (optional)
- Complex retry logic.
- Automatic retries of anything but state machine calls.
- Cron jobs and retries are both examples of "timed" calls, so we should ponder unifying them under a single system at some point, but for now we'll defer that.
- We won't propagate retry policies to module code yet, but will at some point.
## Design (how)
Allow verbs to be customised with a retry annotation specifying the number of retries and the interval between retries. If two intervals are provided, exponential backoff is used with the first interval being the lower bound and the second interval being the upper bound. If the retry count is omitted the number of retries is unlimited.
Retries will only be attempted if a verb responds with a specific error type, `ftl.ErrRetry`. The error can be wrapped as usual to provide useful context, eg. `ftm.Errorf("failed to make external API call: %w", ftl.ErrRetry)`
The retry policy is utilised by clients. In the case of the FSM, a retry attempt will be tied to a particular execution of the state machine.
The grammar might be something like this:
```
//ftl:retry [<count>] <min> [<max>]
```
Some examples:
1. Retry every 5s:
```go
//ftl:retry 5s
```
2. Retry with exponential backoff starting at 5s up to 10m, ie. 5s, 10s, 20s, 40s, 80s, etc.
```go
//ftl:retry 5s 10m
```
3. Retry 10 times with a 5s interval:
```go
//ftl:retry 10 5s
```
The retries table will look something like this:
```sql
CREATE TABLE retries (
id BIGINT NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
deployment_id BIGINT NOT NULL REFERENCES deployment(id),
max_attempts INT,
min_delay INTERVAL NOT NULL,
max_delay INTERVAL,
attempt INT NOT NULL,
next_attempt
ve
);
```
### Required changes (how)
- Introduce a retries table.
- Add a metadata entry ot the schema for retries.
- Parse retry annotations into the schema annotation.
- A `bool retry` field will be added to the `CallResponse` protobuf message.