Zero-Downtime Policy for Migrations

# Zero-Downtime Policy for Migrations ###### tags: `PulpCon 2020` `Migrations` ## Motivation * Downtime is not desirable in a lot of environments. * Clustered installations require that all parts of the cluster are stopped before any server is upgraded. This extends the downtime of an upgrade. * Deployments on Kubernetes that are managed by the Pulp operator need to be able to support a mixture of containers that are at pulpcore 3.y and 3.y+1 ## What does the upgrade process look like now? 1) Stop all services 1) Upgrade code 1) Run migrations 1) Start all services ## What does the Zero-downtime upgrade process look like? When upgrading from 3.y.z to 3.y+1 the following is possible: 1) Upgrade code 1) Run migrations 1) Restart all services (in a cluster this would be a "rolling restart") ## How can we get there? * Plugin Writers need documentation on how they can introduce gradual changes that will allow users to upgrade part of their Pulp infrastructure at a time. * Changes that require a change in the database schema need to be split into two migrations. * The first migration delivered with 3.y+1 needs to add a new column that the 3.y+1 will use. * The second migration delivered with 3.y+2 needs to remove the old column that is no longer needed in 3.y+1 and 3.y+2. * CI that checks that migrations are compatible with the latest 3.y.z release ## Resources * https://pypi.org/project/django-pg-zero-downtime-migrations/ * CI checks to look for risky migrations https://wxweekly.com/zero-downtime-deploys-a-tale-of-django-migrations-7a040f425e4a * Pulp ticket https://pulp.plan.io/issues/7120 ## Next steps * Add CI that looks at our migrations and provides analysis similar to the article on wxweekly.com [dkliban] * Send out an email to pulp-dev list announcing the plan to start working toward this goal [dkliban] ###### tags: `PulpCon 2020`