# Leases **Author**: @aat, @mtoohey <!-- Status of the document - draft, in-review, etc. - is conveyed via HackMD labels --> ## Description FTL support for [leases](https://martinfowler.com/articles/patterns-of-distributed-systems/lease.html) for use internally and by modules. ## Motivation Some kind of distributed locking mechanism is a very common requirement when building distributed systems. Leases are a (relatively) simple solution to this problem. Internally, the FTL Controller has implemented ad-hoc leases in a number of places, so a general purpose leasing system would be very useful internally as well. Specific examples include: - [Finite state machines](https://hackmd.io/@ftl/BJA1l21bR) - [Async functions](/r8Q3GHWuQFKJ2Bm1dEnd7g) - Assignment modules to Runners. ## Goals - Allow modules to acquire a lease on a resource - Safely enforce expiry of leases ### Non-Goals - No hierarchy in namespaces (a possible future feature) - Re: Module Leases - Recursive leases (A possible future feature) ## Terminology - namespace: a list of strings (`["a", "b"]`, can also be represented as a period separated string (`a.b`) - key: provided at runtime to obtain a lease. A key and namespace are used to look for competing leases - TODO: rename as value? confusing with leasekey - scope: namespace + key - lease key: a unique key for each lease obtained ## Design Each controller will have a `LeaseCoordinator` to manage leases. ```go func (c *LeaseCoordinator)Acquire(lease Lease, wait time.Duration) error func (c *LeaseCoordinator)Heartbeat(lease Lease) error func (c *LeaseCoordinator)ReleaseOk(lease Lease[S], success S) error func (c *LeaseCoordinator)ReleaseFail(lease Lease, error err) error func (c *LeaseCoordinator)ReapStaleLeases(namespaces []Namespace, includeChilden, func(LeaseKey, Scope, *dal.Tx) error) error ``` Lease is a protocol ```go type Namespace []string type Scope struct { Namespace Namespace Key string } type Lease[Success any] interface { Key() LeaseKey Scope() Scope Duration() time.Duration SetDeadline() time.Time // May be called multiple times until a transaction succeeds WillInsert(ctx context.Context, tx *dal.Tx, deadline time.Time) WillHeartbeat(ctx context.Context, tx *dal.Tx, deadline time.Time) error WillReleaseOk(ctx context.Context, tx *dal.Tx, success Success) error WillReleaseFail(ctx context.Context, tx *dal.Tx, failure error) error } ``` Leases have a scope made of a namespace (list of strings) and a key (eg: user id) - `.` is an invalid character, so that we can store this list as concatenated in the db. This could be checked at validation time if literals are used, and at runtime safety There will be a grace period where owners expire their timeouts before the controllers consider a lease timed out. #### Acquiring and releasing a lease To acquire a lease, the following happens: - Create a new stuct (a use case specific struct conforming to Lease) - Acquire the lease by calling `LeaseCoordinator.Acquire(lease, waitingTimeout)` - If you want an immediate lease, and to fail otherwise, have a zero waiting timeout. otherwise pass in a reasonable timeout. - Coordinator starts a transaction - Coordinator attempts to insert the lease into the database. It then calls `WillInsert()` on the lease to allow the lease to do any custom db changes - Coordinator then commits the transaction. - If it succeeds, it returns the new deadline to the caller. - If it fails, then the lease coordinator waits until the db notifies of a lease that can block this lease has been released, and retries. It gives us once retry timeout has been reached. #### Heartbeating (aka renewing) & expiring a lease - Leases can be renewed to extend their deadline - It is expected that leases are renewed at a higher frequency than the deadline, such that if an attempt to renew fails, theres a chance to recover and maintain the lease. - It is the responsibility of lease to attempt renewals when appropriate. - Renewals update the deadline to `Now() + Duration` - It is up to the lease to stop their work by the deadline. The coordinator will not notify the lease at the time of the deadline - Controllers are responsible for finding expired leases in the db and reaping them - TODO: define how this responsibility is shared across controllers - TODO: Is this just a periodic job, or is it maintained like cronjobs and polled less often? - TODO: custom logic per usecase? #### Reaping Lease coordinator can not be the sole component responsible for reaping because the lease table is not the only table that needs updating in some use cases. Instead, it is up to different use cases to reap their own leases with the lease coordinator. `ReapStaleLeases()` will need to be called on the lease coordinator along with namespaces and a function to call when reaping a lease to allow other db updates. #### DB tables and queries // TODO table needs: leasekey deadline state => [active, ...] ? namespace key db events / channel queries: acquire release renew upsert? ### Use case: Controllers [proposed refactor of existing code] ### Use case: Runners [proposed refactor of existing code] - Namespace: `runner.???` - Controllers already have a stream from the runner which it uses to upsert the latest info for the runner into the db, including it's last_seen time. - We will remove the last_seen time from the runners db table - When a controller receives a streamed message from the runner, it will upsert (obtain / renew) a lease for the runner - TODO: reaping, any custom logic? ### Use case: Runner Reservation [proposed refactor of existing code] - Namespace: `runner.reserve` - TODO: remove some columns from runner table - When a controller wants to reserve a runner, it will acquire a lease for that runner - While the controller waits for the module server to respond it will heartbeat the lease - The controller will release the lease once the runner completes reservation ### Use case: Async calls [no existing code for this yet] - Async calls will have their own table of scheduled calls - To execute a call, an async call coordinator will attempt to get a lease for the call - As the call continues, the controller will heartbeat the lease - If it gets the lease, it can execute the call. At the end of the call it will complete the async call and release the lease in the same db transaction ### Use Case: Modules [new feature] Namespace: `module.<...namespace provided by module...>` To define a lease with a namespace In go: ```go var userLease := ftl.Lease("user", "update") ``` We will not include leases in the schema for now. We may want to in the future, as it could be nice to see what a module/verb could be in contention with for leases in the console. #### Acquiring and releasing: To obtain a lease and then release a lease at the end of a verb: ```go //ftl:export func UpdateUser(ctx context.Context, req UpdateUserRequest) error { lease, err := userLease.Acquire(ctx, request.UserId) if err != nil { return err } defer lease.Release() // process data, update db, etc } ``` ```protobuf! // TODO: include VerbService change ``` - The module will make a gRPC call to the controller - The controller will attempt to acquire a lease with a reasonable waiting time - In case of an error: - Controller responds to module with an error, module server returns an error back to user code - In case of success - Controller responds to module with a lease key and a deadline - module then returns back to user code with an aquired lease #### Renewing and expiry: Module code does not need to worry about renewing leases. The ftl package will make periodic calls back to the controller to renew active leases. - the ftl package will track the deadline and if it is reached, will call the context's cancel handler - This will kill the verb execution if it does not finish in time. Same will occur for any other subsequent calls which are tied to the lease. #### Propogation: If a call with an acquired lease makes another call, we need to make sure that at expiry time, all these calls must be terminated - This should happen due to how contexts and gRPC are set up. But we should test to confirm. ### Required changes - Module server needs to obtain a context with a cancel handler before executing a verb. This allows ftl to cancel the context if a lease deadline is reached and avoids needing to fiddle with ctx within a verb when obtaining/releasing verbs - ... ### Known Issues - A lease can be acquired, and then a db transaction may be happening when the lease expires. A new lease may be granted while db transaction is ongoing, making the lease unsafe. ## Rejected Alternatives - We could use go's context timeout instead of `panic()` - This made the API more finicky by needing to set `ctx` to the new context - This also made it harder to remove the timeout once a lease was released - We could avoid the need of a prior declaration of a lease variable, and instead have code call `ftl.ObtainLease("user", "update", userId)` - Not a bad approach, but declaring leases obtained by a module might be a pattern we want to encourage, as it helps people new to the module understand how the module is ensuring safety.