# Cron Jobs
**Author**: Matt Toohey
## Description
Cron Jobs can be scheduled to run at scheduled times.
From now on, this doc will refer to these as "jobs" for simplicity.
## Motivation (optional)
This is useful to kick off periodic jobs (eg: clean-up, batching, reporting...)
## Goals
- Allow verbs to declare themselves as jobs with a schedule
## Non-Goals (optional)
- Specific handling of errors from scheduled verbs (retry, etc)
- Configuring schedules based on environment. dev/staging/prod all use the same job schedule
## Design
### Jobs are verbs with annotations
Go:
```
//ftl:cron 0 0 * * *
func ExampleJob(ctx context.Context) error { … }
```
Kotlin:
```
@Export
@Cron("0 0 * * *")
```
These verbs need to be empty (no request/response parameters), otherwise it will be a schema error.
There is no need to also include `//tbd:export` above these verbs, as the new directive is clear enough.
In the schema, verbs will be annotated with cron details.
```
verb exampleJob(Unit) Unit
+cron * * * * * * *
```
When deploying a module, cron jobs are extracted from the schema and inserted into the `cron_jobs` table.
### Supported cron features
We will support the following patterns in cron:
- These variations:
- 5 fields: `<minutes> <hours> <day-of-month> <month> <day-of-week>`
- 6 fields: `<seconds> <minutes> <hours> <day-of-month> <month> <day-of-week>`
- 7 fields: `<seconds> <minutes> <hours> <day-of-month> <month> <day-of-week> <year>`
- These features:
- Ranges: `x-y`
- Unrestricted ranges: `*`
- Steps: `x/y`
- Lists: `1,3,7`
- Not supported:
- Special characters: `?`, `L`, `W`, `LW`, `#`
- Special words:
- `@hourly`, `@daily`, ...
- `SUN`, `MON`, ...
- `JAN`, `FEB`, ...
### Data Model
```sql
CREATE TYPE job_state AS ENUM (
'idle',
'executing'
);
CREATE TABLE cron_jobs
(
id BIGINT NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
deployment_id BIGINT NOT NULL REFERENCES deployments (id) ON DELETE CASCADE,
verb VARCHAR NOT NULL,
schedule VARCHAR NOT NULL,
start_time TIMESTAMPTZ NOT NULL,
next_execution TIMESTAMPTZ NOT NULL,
state job_state NOT NULL DEFAULT 'idle',
-- Some denormalisation for performance. Without this we need to do a two table join.
module_name VARCHAR NOT NULL
);
```
#### GetCronJobs
```sql
SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule, j.start_time, j.next_execution, j.state
FROM cron_jobs j
INNER JOIN deployments d on j.deployment_id = d.id
WHERE d.min_replicas > 0;
```
#### CreateCronJob
Creates and returns the full representation of the row (inc joins)
```sql
WITH j AS (
INSERT INTO cron_jobs (deployment_id, module_name, verb, schedule, start_time, next_execution)
VALUES ((SELECT id FROM deployments WHERE key = sqlc.arg('deployment_key')::deployment_key LIMIT 1),
sqlc.arg('module_name')::TEXT,
sqlc.arg('verb')::TEXT,
sqlc.arg('schedule')::TEXT,
sqlc.arg('start_time')::TIMESTAMPTZ,
sqlc.arg('next_execution')::TIMESTAMPTZ)
RETURNING *
)
SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule, j.start_time, j.next_execution, j.state
FROM j
INNER JOIN deployments d on j.deployment_id = d.id
LIMIT 1;
```
#### StartCronJobs
- Attempts to start multiple jobs in the db
- Returns rows for all jobs attempted so caller knows the current state, with extra columns indicating if:
- job was successfully updated to `executing`
- deployment has been set to `minReplicas` == 0
```sql
WITH updates AS (
UPDATE cron_jobs
SET state = 'executing',
start_time = (NOW() AT TIME ZONE 'utc')::TIMESTAMPTZ
WHERE id = ANY (sqlc.arg('ids'))
AND state = 'idle'
AND start_time < next_execution
AND (next_execution AT TIME ZONE 'utc') < (NOW() AT TIME ZONE 'utc')::TIMESTAMPTZ
RETURNING id, state, start_time, next_execution)
SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule,
COALESCE(u.start_time, j.start_time) as start_time,
COALESCE(u.next_execution, j.next_execution) as next_execution,
COALESCE(u.state, j.state) as state,
d.min_replicas > 0 as has_min_replicas,
CASE WHEN u.id IS NULL THEN FALSE ELSE TRUE END as updated
FROM cron_jobs j
INNER JOIN deployments d on j.deployment_id = d.id
LEFT JOIN updates u on j.id = u.id
WHERE j.id = ANY (sqlc.arg('ids'));
```
#### EndCronJob
- Used when finishing or timing out a job
```sql
-- name: EndCronJob :exec
WITH j AS (
UPDATE cron_jobs
SET state = 'idle',
next_execution = sqlc.arg('next_execution')::TIMESTAMPTZ
WHERE id = sqlc.arg('id')::BIGINT
AND state = 'executing'
AND start_time = sqlc.arg('start_time')::TIMESTAMPTZ
RETURNING *
)
SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule, j.start_time, j.next_execution, j.state
FROM j
INNER JOIN deployments d on j.deployment_id = d.id
LIMIT 1;
```
#### [TBD] Indexes
### Controllers & Coordination
Each controller will have a cronjob service, which is responsible for maintaining the state of cron jobs and triggering their execution.
What jobs is each controller responsible for executing?
- Jobs will be assigned to multiple (2) controllers using a hashring
- Exception: When a deployment is brand new, controllers will not know of the deployment until the next reset (see below). Only the controller which created the deployment knows about the newly created jobs.
- This controller will treat these jobs as its responsibility until it resets its list of jobs
The cronjob service will be notified by the controller of the following cases
- Created deployment:
- Happens when the controller created a deployement (not notified when other controllers create a deployment)
- Finds verbs with cronjob metadata, adds them to the db, and updates known list of cronjobs
- Killed/Replaced deployment:
- Only the controller executing this change can respond to this, others will wait for the next reset, or will find out about the change when trying to execute a relevant job (see below)
The cronjob service will respond to the following internal events:
- Reset (every 1 min):
- Refetch all cronjobs from db
- Hash ring updated (5s max):
- May cause scheduling changes if controller changes which cronjobs it is responsible for
- A cronjob(s) is ready to be attemped:
- Tries to update the db (`StartCronJobs`), and if the cronjob row was successfully updated to `executing`, triggers an FTL call to the verb
- All cronjobs that were attempted receive the version from the db, so the service can update the known cronjob list
- Synchronously waits for the call to finish, then updates the db (EndCronJob) and triggers a FinishedJob event
- Job finished:
- Updates the list of known cronjobs with the newer versions
### Detecting time outs:
- Soft timeout:
- When a controller executes a job, it uses a context with a timeout. If execution does not complete within the expected time, the call should end with an error
- Hard timeout:
- Controllers also query the db for any jobs which have overrun the timeout with a grace period and reset the state to idle. This handles cases where the controller which started the call is no longer around.
- Timeouts are set to 5 mins by default but can be overridden with configuration
### Scheduled jobs may be skipped
- If a schedule is frequent, jobs will not be triggered while previous executions are still active
- eg: A schedule of `* * * * *` for a verb that takes 5 mins will skip the next 4 triggers
- Timeouts can affect this as well
### TBD:
- write up details of history table (link to call table, create the row when starting a call)
- Multiple deployments of same module... If currently running old and new versions of a module, do we want multiple runs of cronjobs? Is safety around this out of scope?
## Rejected Alternatives (optional)
- Every controller attempts every job:
- Rejected: too much db load/
- One controller attempts every job:
- Too much load on one controller