FTL support for leases for use internally and by modules.
Motivation
Some kind of distributed locking mechanism is a very common requirement when building distributed systems. Leases are a (relatively) simple solution to this problem.
Internally, the FTL Controller has implemented ad-hoc leases in a number of places, so a general purpose leasing system would be very useful internally as well. Specific examples include:
type Namespace []stringtype Scope struct{
Namespace Namespace
Key string}type Lease[Success any]interface{Key() LeaseKey
Scope() Scope
Duration() time.Duration
SetDeadline() time.Time
// May be called multiple times until a transaction succeedsWillInsert(ctx context.Context, tx *dal.Tx, deadline time.Time)WillHeartbeat(ctx context.Context, tx *dal.Tx, deadline time.Time)errorWillReleaseOk(ctx context.Context, tx *dal.Tx, success Success)errorWillReleaseFail(ctx context.Context, tx *dal.Tx, failure error)error}
Leases have a scope made of a namespace (list of strings) and a key (eg: user id)
. is an invalid character, so that we can store this list as concatenated in the db. This could be checked at validation time if literals are used, and at runtime safety
There will be a grace period where owners expire their timeouts before the controllers consider a lease timed out.
Acquiring and releasing a lease
To acquire a lease, the following happens:
Create a new stuct (a use case specific struct conforming to Lease)
Acquire the lease by calling LeaseCoordinator.Acquire(lease, waitingTimeout)
If you want an immediate lease, and to fail otherwise, have a zero waiting timeout. otherwise pass in a reasonable timeout.
Coordinator starts a transaction
Coordinator attempts to insert the lease into the database. It then calls WillInsert() on the lease to allow the lease to do any custom db changes
Coordinator then commits the transaction.
If it succeeds, it returns the new deadline to the caller.
If it fails, then the lease coordinator waits until the db notifies of a lease that can block this lease has been released, and retries. It gives us once retry timeout has been reached.
Heartbeating (aka renewing) & expiring a lease
Leases can be renewed to extend their deadline
It is expected that leases are renewed at a higher frequency than the deadline, such that if an attempt to renew fails, theres a chance to recover and maintain the lease.
It is the responsibility of lease to attempt renewals when appropriate.
Renewals update the deadline to Now() + Duration
It is up to the lease to stop their work by the deadline. The coordinator will not notify the lease at the time of the deadline
Controllers are responsible for finding expired leases in the db and reaping them
TODO: define how this responsibility is shared across controllers
TODO: Is this just a periodic job, or is it maintained like cronjobs and polled less often?
TODO: custom logic per usecase?
Reaping
Lease coordinator can not be the sole component responsible for reaping because the lease table is not the only table that needs updating in some use cases. Instead, it is up to different use cases to reap their own leases with the lease coordinator. ReapStaleLeases() will need to be called on the lease coordinator along with namespaces and a function to call when reaping a lease to allow other db updates.
DB tables and queries
// TODO table needs: leasekey deadline state => [active, …] ? namespace key
db events / channel
queries: acquire release renew upsert?
Use case: Controllers
[proposed refactor of existing code]
Use case: Runners
[proposed refactor of existing code]
Namespace: runner.???
Controllers already have a stream from the runner which it uses to upsert the latest info for the runner into the db, including it's last_seen time.
We will remove the last_seen time from the runners db table
When a controller receives a streamed message from the runner, it will upsert (obtain / renew) a lease for the runner
TODO: reaping, any custom logic?
Use case: Runner Reservation
[proposed refactor of existing code]
Namespace: runner.reserve
TODO: remove some columns from runner table
When a controller wants to reserve a runner, it will acquire a lease for that runner
While the controller waits for the module server to respond it will heartbeat the lease
The controller will release the lease once the runner completes reservation
Use case: Async calls
[no existing code for this yet]
Async calls will have their own table of scheduled calls
To execute a call, an async call coordinator will attempt to get a lease for the call
As the call continues, the controller will heartbeat the lease
If it gets the lease, it can execute the call. At the end of the call it will complete the async call and release the lease in the same db transaction
Use Case: Modules
[new feature] Namespace: module.<...namespace provided by module...>
To define a lease with a namespace In go:
var userLease := ftl.Lease("user","update")
We will not include leases in the schema for now. We may want to in the future, as it could be nice to see what a module/verb could be in contention with for leases in the console.
Acquiring and releasing:
To obtain a lease and then release a lease at the end of a verb:
The module will make a gRPC call to the controller
The controller will attempt to acquire a lease with a reasonable waiting time
In case of an error:
Controller responds to module with an error, module server returns an error back to user code
In case of success
Controller responds to module with a lease key and a deadline
module then returns back to user code with an aquired lease
Renewing and expiry:
Module code does not need to worry about renewing leases. The ftl package will make periodic calls back to the controller to renew active leases.
the ftl package will track the deadline and if it is reached, will call the context's cancel handler
This will kill the verb execution if it does not finish in time. Same will occur for any other subsequent calls which are tied to the lease.
Propogation:
If a call with an acquired lease makes another call, we need to make sure that at expiry time, all these calls must be terminated
This should happen due to how contexts and gRPC are set up. But we should test to confirm.
Required changes
Module server needs to obtain a context with a cancel handler before executing a verb. This allows ftl to cancel the context if a lease deadline is reached and avoids needing to fiddle with ctx within a verb when obtaining/releasing verbs
…
Known Issues
A lease can be acquired, and then a db transaction may be happening when the lease expires. A new lease may be granted while db transaction is ongoing, making the lease unsafe.
Rejected Alternatives
We could use go's context timeout instead of panic()
This made the API more finicky by needing to set ctx to the new context
This also made it harder to remove the timeout once a lease was released
We could avoid the need of a prior declaration of a lease variable, and instead have code call ftl.ObtainLease("user", "update", userId)
Not a bad approach, but declaring leases obtained by a module might be a pattern we want to encourage, as it helps people new to the module understand how the module is ensuring safety.