Try   HackMD

Distributed Finite State Machines in FTL

Author: @aat

Description

FTL's ability to compose functions lends itself well to defining both distributed workflows and FSM's.

This document deals with an initial implementation of FSM's.

What is the difference between FSM's and workflows? To quote this page:

In a workflow engine, transitioning to the next step occurs when a previous action is completed, whilst a state machine needs an external event that will cause branching to the next activity. In other words, state machine is event driven and workflow engine is not.

Goals

  • Support DFA's only

Non-goals

  • We will not support workflows initially
  • We wil not support NFA's initially - it would be more complex to reason about, but support can be added in the future relatively simply.

Prerequisites

  1. This design requires support for leases to prevent duplicate execution.
  2. Support for retries.
  3. It also benefits from sum types to allow multiple events to transition into a single state.

Design

Given an FSM is a set of transitions in the form (from, event, to) we can model each state as a function, and the event as the input to the to function. As an example, given these verbs:

verb from(From) Unit
verb to(To) Unit

The transition would be (from, To, to). That is, when an event of type To is received when in state from, a transition would occur to state to.

FSMs will be defined in code (see below), and represented in the schema (also below).

Each "execution" of a state machine definition must be given a unique key that is used to reference all future events. The first event sent with a new unique key will create a new execution. That is, each send of an event must be in the form (key, event).

Required changes

  • Add FSM to schema.
  • A new database table to track FSM execution.
  • Go API in go-runtime/ftl for constructing an FSM.
  • New endpoint VerbService.FSMSend() for sending an event to an FSM, including retry behaviour.
  • Add a scheduler to backend/controller/fsm for controlling the FSM and performing retries.

Schema evolution

With traditional schema evolution it would be very difficult to reason about changes to an FSM over time. Once versioning lands however, it will be much more straightforward: an FSM will continue execution using the versions of modules/verbs/types it referenced at deploy time. As with normal FTL versioning, extra constraints will need to be applied to persistent storage - databases, queues, etc.

Go API

Given the example FSM from here visualised as follows:

OnlinePaymentCreated
OfflineRepaymentPaid
OnlineRepaymentPaid
OnlineRepaymentFailed
PaymentRegistered
PaymentCompleted
PaymentCompleted
PaymentRegistered
Created
Paid
Failed
Registered
Completed

The FTL representation of this in Go might look something like:

//ftl:sumtype
type PaymentPaid interface { paymentPaidSumType() }

type OnlinePaymentPaid struct {}
func (OnlinePaymentPaid) paymentPaidSumType() {}

type OfflinePaymentPaid struct {}
func (OfflinePaymentPaid) paymentPaidSumType() {}

//ftl:export
func Created(ctx context.Context, in OnlinePaymentCreated) error { ... }

//ftl:export
func Paid(ctx context.Context, in PaymentPaid) error { ... }

//ftl:export
func Failed(ctx context.Context, in OlinePaymentFailed) error { ... }

//ftl:export
func Registered(ctx context.Context, in PaymentRegistered) error { ... }

// Note: requires sum-type support to be complete
//ftl:sumtype
type CompletedIn interface { completedSumType() }

type PaymentCompleted struct {}
func (PaymentCompleted) completedSumType() {}

type PaymentRegistered struct {}
func (PaymentRegistered) completedSumType() {}

//ftl:export
func Completed(ctx context.Context, in CompletedIn) error { ... }

var paymentFSM = ftl.StateMachine(
  "payment",
  // Two start states.
  ftl.Start(Created),
  ftl.Start(Paid),
  ftl.Transition(Created, Paid),
  ftl.Transition(Created, Failed),
  ftl.Transition(Paid, Registered),
  ftl.Transition(Paid, Completed),
  ftl.Transition(Completed, Completed),
)

//ftl:ingress http POST /payment
func StartOnlinePayment(ctx context.Context, in OnlinePayment) error {
  // Sending an event to a start state will create a new execution of the state
  // machine. All sends require a unique key to identify the execution, in this
  // case it's the invoice ID.
  return paymentFSM.Send(ctx, in.InvoiceID, OnlinePaymentCreated{
    InvoiceID: in.InvoiceID,
    Balance: in.Balance,
    ...,
  })
}

The FTL schema extracted from this will look something like:

fsm payment {
  start created
  start paid
  transition created to paid
  transition created to failed
  transition paid to registered
  transition paid to completed
  transition completed to completed
}

SQL Changes

The execution of state machines will be modelled in a single table tracking each "execution". Each execution is associated with a unique user-provided key, and created on-demand. Each execution also references a specific deployment in order to support versioning.

CREATE TYPE fsm_status AS ENUM ('running', 'completed', 'failed');

CREATE TABLE fsm_executions (
  id BIGINT NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
  created_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
  -- Unique user-provided key identifying the run. In the example above this would be the "invoice ID".
  key TEXT UNIQUE NOT NULL,
  deployment_id BIGINT NOT NULL REFERENCES deployment(id),
  name TEXT NOT NULL,
  status fsm_status NOT NULL DEFAULT 'running'::fsm_status,
  -- The current state of the state machine.
  current_state TEXT NOT NULL,
);

CREATE INDEX idx_fsm_executions_key ON fsm_executions(key);
CREATE INDEX idx_fsm_executions_state ON fsm_executions(status);

We'll also introduce a retries table. This will initially be added purley so that states can retry, but at some point in the future we'll extend this to be more flexible and generalised.