Leases

Author: @aat, @mtoohey

Description

FTL support for leases for use internally and by modules.

Motivation

Some kind of distributed locking mechanism is a very common requirement when building distributed systems. Leases are a (relatively) simple solution to this problem.

Internally, the FTL Controller has implemented ad-hoc leases in a number of places, so a general purpose leasing system would be very useful internally as well. Specific examples include:

Goals

  • Allow modules to acquire a lease on a resource
  • Safely enforce expiry of leases

Non-Goals

  • No hierarchy in namespaces (a possible future feature)
  • Re: Module Leases
    • Recursive leases (A possible future feature)

Terminology

  • namespace: a list of strings (["a", "b"], can also be represented as a period separated string (a.b)
  • key: provided at runtime to obtain a lease. A key and namespace are used to look for competing leases
    • TODO: rename as value? confusing with leasekey
  • scope: namespace + key
  • lease key: a unique key for each lease obtained

Design

Each controller will have a LeaseCoordinator to manage leases.

func (c *LeaseCoordinator)Acquire(lease Lease, wait time.Duration) error
func (c *LeaseCoordinator)Heartbeat(lease Lease) error
func (c *LeaseCoordinator)ReleaseOk(lease Lease[S], success S) error
func (c *LeaseCoordinator)ReleaseFail(lease Lease, error err) error

func (c *LeaseCoordinator)ReapStaleLeases(namespaces []Namespace, includeChilden, func(LeaseKey, Scope, *dal.Tx) error) error

Lease is a protocol

type Namespace []string

type Scope struct {
    Namespace Namespace
    Key string
}

type Lease[Success any] interface {
    Key() LeaseKey
    Scope() Scope
    
    Duration() time.Duration
    SetDeadline() time.Time

    // May be called multiple times until a transaction succeeds
    WillInsert(ctx context.Context, tx *dal.Tx, deadline time.Time) 
    
    WillHeartbeat(ctx context.Context, tx *dal.Tx, deadline time.Time) error
    WillReleaseOk(ctx context.Context, tx *dal.Tx, success Success) error
    WillReleaseFail(ctx context.Context, tx *dal.Tx, failure error) error
}

Leases have a scope made of a namespace (list of strings) and a key (eg: user id)

  • . is an invalid character, so that we can store this list as concatenated in the db. This could be checked at validation time if literals are used, and at runtime safety

There will be a grace period where owners expire their timeouts before the controllers consider a lease timed out.

Acquiring and releasing a lease

To acquire a lease, the following happens:

  • Create a new stuct (a use case specific struct conforming to Lease)
  • Acquire the lease by calling LeaseCoordinator.Acquire(lease, waitingTimeout)
    • If you want an immediate lease, and to fail otherwise, have a zero waiting timeout. otherwise pass in a reasonable timeout.
    • Coordinator starts a transaction
    • Coordinator attempts to insert the lease into the database. It then calls WillInsert() on the lease to allow the lease to do any custom db changes
    • Coordinator then commits the transaction.
      • If it succeeds, it returns the new deadline to the caller.
      • If it fails, then the lease coordinator waits until the db notifies of a lease that can block this lease has been released, and retries. It gives us once retry timeout has been reached.

Heartbeating (aka renewing) & expiring a lease

  • Leases can be renewed to extend their deadline
  • It is expected that leases are renewed at a higher frequency than the deadline, such that if an attempt to renew fails, theres a chance to recover and maintain the lease.
  • It is the responsibility of lease to attempt renewals when appropriate.
  • Renewals update the deadline to Now() + Duration
  • It is up to the lease to stop their work by the deadline. The coordinator will not notify the lease at the time of the deadline
  • Controllers are responsible for finding expired leases in the db and reaping them
    • TODO: define how this responsibility is shared across controllers
    • TODO: Is this just a periodic job, or is it maintained like cronjobs and polled less often?
    • TODO: custom logic per usecase?

Reaping

Lease coordinator can not be the sole component responsible for reaping because the lease table is not the only table that needs updating in some use cases.
Instead, it is up to different use cases to reap their own leases with the lease coordinator.
ReapStaleLeases() will need to be called on the lease coordinator along with namespaces and a function to call when reaping a lease to allow other db updates.

DB tables and queries

// TODO
table needs:
leasekey
deadline
state => [active, ] ?
namespace
key

db events / channel

queries:
acquire
release
renew
upsert?

Use case: Controllers

[proposed refactor of existing code]

Use case: Runners

[proposed refactor of existing code]

  • Namespace: runner.???
  • Controllers already have a stream from the runner which it uses to upsert the latest info for the runner into the db, including it's last_seen time.
  • We will remove the last_seen time from the runners db table
  • When a controller receives a streamed message from the runner, it will upsert (obtain / renew) a lease for the runner
  • TODO: reaping, any custom logic?

Use case: Runner Reservation

[proposed refactor of existing code]

  • Namespace: runner.reserve
  • TODO: remove some columns from runner table
  • When a controller wants to reserve a runner, it will acquire a lease for that runner
  • While the controller waits for the module server to respond it will heartbeat the lease
  • The controller will release the lease once the runner completes reservation

Use case: Async calls

[no existing code for this yet]

  • Async calls will have their own table of scheduled calls
  • To execute a call, an async call coordinator will attempt to get a lease for the call
  • As the call continues, the controller will heartbeat the lease
  • If it gets the lease, it can execute the call. At the end of the call it will complete the async call and release the lease in the same db transaction

Use Case: Modules

[new feature]
Namespace: module.<...namespace provided by module...>

To define a lease with a namespace
In go:

var userLease :=  ftl.Lease("user", "update")

We will not include leases in the schema for now.
We may want to in the future, as it could be nice to see what a module/verb could be in contention with for leases in the console.

Acquiring and releasing:

To obtain a lease and then release a lease at the end of a verb:

//ftl:export
func UpdateUser(ctx context.Context, req UpdateUserRequest) error {
    lease, err := userLease.Acquire(ctx, request.UserId)
    if err != nil {
        return err
    }
    defer lease.Release()
    
    // process data, update db, etc
}
// TODO: include VerbService change
  • The module will make a gRPC call to the controller
  • The controller will attempt to acquire a lease with a reasonable waiting time
  • In case of an error:
    • Controller responds to module with an error, module server returns an error back to user code
  • In case of success
    • Controller responds to module with a lease key and a deadline
    • module then returns back to user code with an aquired lease

Renewing and expiry:

Module code does not need to worry about renewing leases. The ftl package will make periodic calls back to the controller to renew active leases.

  • the ftl package will track the deadline and if it is reached, will call the context's cancel handler
    • This will kill the verb execution if it does not finish in time. Same will occur for any other subsequent calls which are tied to the lease.

Propogation:

If a call with an acquired lease makes another call, we need to make sure that at expiry time, all these calls must be terminated

  • This should happen due to how contexts and gRPC are set up. But we should test to confirm.

Required changes

  • Module server needs to obtain a context with a cancel handler before executing a verb. This allows ftl to cancel the context if a lease deadline is reached and avoids needing to fiddle with ctx within a verb when obtaining/releasing verbs

Known Issues

  • A lease can be acquired, and then a db transaction may be happening when the lease expires. A new lease may be granted while db transaction is ongoing, making the lease unsafe.

Rejected Alternatives

  • We could use go's context timeout instead of panic()
    • This made the API more finicky by needing to set ctx to the new context
    • This also made it harder to remove the timeout once a lease was released
  • We could avoid the need of a prior declaration of a lease variable, and instead have code call ftl.ObtainLease("user", "update", userId)
    • Not a bad approach, but declaring leases obtained by a module might be a pattern we want to encourage, as it helps people new to the module understand how the module is ensuring safety.