Retries - HackMD

# Retries **Author**: @mtoohey  ## Description (what) This is a proposal to add rudimentary retry support to FTL. Initial scope is for retrying Finite State Machine transitions. ## Motivation This will eventually be useful for: - [Finite State Machines](/DWVZqIRsQRSG0w6pcYR7tw) - [Pub/Sub](/JUIv11IgQ1q72mBKgfQkSg) - Cronjobs - Callbacks - Synchronous ftl calls With all these different use cases it would be good to have a standard way of doing it. ## Goals - Rudimentary user-controlled retries for retrying state machine calls. ### Non-Goals - Complex retry logic. - Special handling when reaching end of retry count (FSM goes to failed state) - Automatic retries of anything but state machine calls. ## Design We will build retries on top of async calls so that it can eventually be useful to other features built on top of async calls. Retry behaviour is defined on the FSM transition verb, with this pattern ```go! //ftl:retry [<count>] <min> [<max>=1hr] ``` Some examples: 1. Retry every 5s: ```go // An example FSM transition // //ftl:verb //ftl:retry 5s func Created(ctx context.Context, in OnlinePaymentCreated) error { ... } ``` 2. Retry with exponential backoff starting at 5s up to 10m, ie. 5s, 10s, 20s, 40s, 80s, etc. ```go //ftl:retry 5s 10m ``` 3. Retry 10 times with a 5s interval: ```go //ftl:retry 10 5s ``` Other behaviours: - If no retry directive is included for a transition, no retries will occur. - There is no differentiation regarding which errors are retriable (out of scope). ### Required changes We will add the following to the existing async_calls table: ```sql CREATE TABLE async_calls ( [...] scheduled_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'), -- retry state remaining_attempts INT NOT NULL backoff INTERVAL NOT NULL max_backoff INTERVAL NOT NULL ); ``` When an async call is created: - the new columns in `async_calls` will be populated based on the verb's retry directive in the schema When controllers attempt to start async calls: - they will filter the db query with `scheduled >= (NOW() AT TIME ZONE 'utc')` When an async call fails: - All within a single transaction: - As already implemented, async_calls record will update so that: - `state = error` - `error` gets filled in with error text - If there are remaining retries to attempt, a new async_calls record will be inserted, with - `scheduled = NOW() + failedCall.backoff` - `backoff = MIN(failedCall.backoff * 2, failedCall.max_backoff)` - `remaining_retries = failedCall.remaining_retries - 1` - otherwise the `fsm_instance` record needs to be updated to `state = error` ## Longer term considerations #### Where do retry policies belong? One possible way of thinking about this: - In asynchronous use cases (cron, FSM, Pub/Sub), the receiving verb to dictates retry policy. - Whereas in synchronous use cases (`ftl.Call(...)`), the caller is the appropriate place. - Callees would need to opt in to retries, something like directive `ftl:idempotent` / `ftl:verb idempotent` ## Rejected Alternatives - [Retries table](https://hackmd.io/Cbvh60deTPiW6p8B9op9JQ)