Cron Jobs - HackMD

# Cron Jobs **Author**: Matt Toohey ## Description Cron Jobs can be scheduled to run at scheduled times. From now on, this doc will refer to these as "jobs" for simplicity. ## Motivation (optional) This is useful to kick off periodic jobs (eg: clean-up, batching, reporting...) ## Goals - Allow verbs to declare themselves as jobs with a schedule ## Non-Goals (optional) - Specific handling of errors from scheduled verbs (retry, etc) - Configuring schedules based on environment. dev/staging/prod all use the same job schedule ## Design ### Jobs are verbs with annotations Go: ``` //ftl:cron 0 0 * * * func ExampleJob(ctx context.Context) error { … } ``` Kotlin: ``` @Export @Cron("0 0 * * *") ``` These verbs need to be empty (no request/response parameters), otherwise it will be a schema error. There is no need to also include `//tbd:export` above these verbs, as the new directive is clear enough. In the schema, verbs will be annotated with cron details. ``` verb exampleJob(Unit) Unit +cron * * * * * * * ``` When deploying a module, cron jobs are extracted from the schema and inserted into the `cron_jobs` table. ### Supported cron features We will support the following patterns in cron: - These variations: - 5 fields: `<minutes> <hours> <day-of-month> <month> <day-of-week>` - 6 fields: `<seconds> <minutes> <hours> <day-of-month> <month> <day-of-week>` - 7 fields: `<seconds> <minutes> <hours> <day-of-month> <month> <day-of-week> <year>` - These features: - Ranges: `x-y` - Unrestricted ranges: `*` - Steps: `x/y` - Lists: `1,3,7` - Not supported: - Special characters: `?`, `L`, `W`, `LW`, `#` - Special words: - `@hourly`, `@daily`, ... - `SUN`, `MON`, ... - `JAN`, `FEB`, ... ### Data Model ```sql CREATE TYPE job_state AS ENUM ( 'idle', 'executing' ); CREATE TABLE cron_jobs ( id BIGINT NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, deployment_id BIGINT NOT NULL REFERENCES deployments (id) ON DELETE CASCADE, verb VARCHAR NOT NULL, schedule VARCHAR NOT NULL, start_time TIMESTAMPTZ NOT NULL, next_execution TIMESTAMPTZ NOT NULL, state job_state NOT NULL DEFAULT 'idle', -- Some denormalisation for performance. Without this we need to do a two table join. module_name VARCHAR NOT NULL ); ``` #### GetCronJobs ```sql SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule, j.start_time, j.next_execution, j.state FROM cron_jobs j INNER JOIN deployments d on j.deployment_id = d.id WHERE d.min_replicas > 0; ``` #### CreateCronJob Creates and returns the full representation of the row (inc joins) ```sql WITH j AS ( INSERT INTO cron_jobs (deployment_id, module_name, verb, schedule, start_time, next_execution) VALUES ((SELECT id FROM deployments WHERE key = sqlc.arg('deployment_key')::deployment_key LIMIT 1), sqlc.arg('module_name')::TEXT, sqlc.arg('verb')::TEXT, sqlc.arg('schedule')::TEXT, sqlc.arg('start_time')::TIMESTAMPTZ, sqlc.arg('next_execution')::TIMESTAMPTZ) RETURNING * ) SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule, j.start_time, j.next_execution, j.state FROM j INNER JOIN deployments d on j.deployment_id = d.id LIMIT 1; ``` #### StartCronJobs - Attempts to start multiple jobs in the db - Returns rows for all jobs attempted so caller knows the current state, with extra columns indicating if: - job was successfully updated to `executing` - deployment has been set to `minReplicas` == 0 ```sql WITH updates AS ( UPDATE cron_jobs SET state = 'executing', start_time = (NOW() AT TIME ZONE 'utc')::TIMESTAMPTZ WHERE id = ANY (sqlc.arg('ids')) AND state = 'idle' AND start_time < next_execution AND (next_execution AT TIME ZONE 'utc') < (NOW() AT TIME ZONE 'utc')::TIMESTAMPTZ RETURNING id, state, start_time, next_execution) SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule, COALESCE(u.start_time, j.start_time) as start_time, COALESCE(u.next_execution, j.next_execution) as next_execution, COALESCE(u.state, j.state) as state, d.min_replicas > 0 as has_min_replicas, CASE WHEN u.id IS NULL THEN FALSE ELSE TRUE END as updated FROM cron_jobs j INNER JOIN deployments d on j.deployment_id = d.id LEFT JOIN updates u on j.id = u.id WHERE j.id = ANY (sqlc.arg('ids')); ``` #### EndCronJob - Used when finishing or timing out a job ```sql -- name: EndCronJob :exec WITH j AS ( UPDATE cron_jobs SET state = 'idle', next_execution = sqlc.arg('next_execution')::TIMESTAMPTZ WHERE id = sqlc.arg('id')::BIGINT AND state = 'executing' AND start_time = sqlc.arg('start_time')::TIMESTAMPTZ RETURNING * ) SELECT j.id as id, d.key as deployment_key, j.module_name as module, j.verb, j.schedule, j.start_time, j.next_execution, j.state FROM j INNER JOIN deployments d on j.deployment_id = d.id LIMIT 1; ``` #### [TBD] Indexes ### Controllers & Coordination Each controller will have a cronjob service, which is responsible for maintaining the state of cron jobs and triggering their execution. What jobs is each controller responsible for executing? - Jobs will be assigned to multiple (2) controllers using a hashring - Exception: When a deployment is brand new, controllers will not know of the deployment until the next reset (see below). Only the controller which created the deployment knows about the newly created jobs. - This controller will treat these jobs as its responsibility until it resets its list of jobs The cronjob service will be notified by the controller of the following cases - Created deployment: - Happens when the controller created a deployement (not notified when other controllers create a deployment) - Finds verbs with cronjob metadata, adds them to the db, and updates known list of cronjobs - Killed/Replaced deployment: - Only the controller executing this change can respond to this, others will wait for the next reset, or will find out about the change when trying to execute a relevant job (see below) The cronjob service will respond to the following internal events: - Reset (every 1 min): - Refetch all cronjobs from db - Hash ring updated (5s max): - May cause scheduling changes if controller changes which cronjobs it is responsible for - A cronjob(s) is ready to be attemped: - Tries to update the db (`StartCronJobs`), and if the cronjob row was successfully updated to `executing`, triggers an FTL call to the verb - All cronjobs that were attempted receive the version from the db, so the service can update the known cronjob list - Synchronously waits for the call to finish, then updates the db (EndCronJob) and triggers a FinishedJob event - Job finished: - Updates the list of known cronjobs with the newer versions ### Detecting time outs: - Soft timeout: - When a controller executes a job, it uses a context with a timeout. If execution does not complete within the expected time, the call should end with an error - Hard timeout: - Controllers also query the db for any jobs which have overrun the timeout with a grace period and reset the state to idle. This handles cases where the controller which started the call is no longer around. - Timeouts are set to 5 mins by default but can be overridden with configuration ### Scheduled jobs may be skipped - If a schedule is frequent, jobs will not be triggered while previous executions are still active - eg: A schedule of `* * * * *` for a verb that takes 5 mins will skip the next 4 triggers - Timeouts can affect this as well ### TBD: - write up details of history table (link to call table, create the row when starting a call) - Multiple deployments of same module... If currently running old and new versions of a module, do we want multiple runs of cronjobs? Is safety around this out of scope? ## Rejected Alternatives (optional) - Every controller attempts every job: - Rejected: too much db load/ - One controller attempts every job: - Too much load on one controller