Zero Downtime Working Group

https://discourse.pulpproject.org/t/support-zero-downtime-updates/645

Upcoming

Maybe a great example to excercise zdu migrations https://github.com/pulp/pulpcore/issues/3400
- Labels are persisted as a large join table, but maybe an HStoreField is better suited and more performant. This would give a rather intricate migration we can start to learn with.

Oct 20, 2022

Attendees: dkliban, ipanova, mdellweg, bmbouter

Oct 27

Attendees: awcrosby, ggainey, ipanova, mdellweg, bmbouter

Nov 14

Attendees: bmbouter, dkliban, mdellweg, ggainey

When to meet next? - Dec 01 1100 EST
Since last meeting, an epic has been opened
reviewed the above
- worker-serializer could/should include version-info desired
- workers have a minimal/maximal?
- do we really need 3365?
  - discussion ensues
  - maybe not necessary?
  - this was to let migrations decide "this might be bad" - really only need info from the db
discussion on 3368
- tasks must be backwards-compatible?
- "if you're going to add a new, required, parameter, the policy is 'you need to have a new method', not change the old one"
- do we need version-info at task-added-to-queue time? And have workers know "their" acceptable versions?
- how can worker decide "I can't pick up this task?"

Problem Statement Roundup

Migrations need to run while Pulp is offline
Solution: run them online

Problems running migrations while Pulp is online:

Old application code is not compatible with new data format

Migrations running with long transactions lead to deadlock issues.

"real" deadlocks? Or just "it's taking too long, there might be a problem somewhere"
for whole-table updates, might be better to have the migration lock the table to update it

Plugin migrations depend on pulpcore migrations.

Upgrade Path Migration Problems

Plugins cannot upgrade many releases at a time

Examples

A database table is added
A database table is renamed
A database table is removed
A database column is renamed
A database column is removed
A database column s added
The data of a column is reformatted, e.g. a new choice is added to a choice field

Hypothetical Zero Downtime Upgrade Steps

Run the migrations while old pulp code is online (old code, new data format)
Rolling restart old code to become new code
- old and new code is running at the same time!
Apply post-upgrade migrations

What if…

Breaking change migrations specify which versions of which components "must be" running before this change can be allowed
Such migrations start by querying the DB for the current status (via heartbeat-msgs) of everything running in the system/cluster
If anything is running at lower-than-required, the migration is DENIED and the upgrade stops
In this situation, the "old" components are identified, and what-they-need-to-be-running is identified, and the admin is expected to upgrade/shut down any pods w/ old-code
At that point, the upgrade can be restarted
AI: All components MUST have a heartbeat-checkin, for this to work
- API-workers prob need this
- gunicorn processes?

Detailed Example

Assume the tasking system has RUNNING and CANCELED but not CANCELING. The change is to add CANCELING in the new application code. Assume the original column was named "states".

The migration would:

Create a new column named "new_states" in postgresql but the ORM can refer to it as the same name. It would have all 3 states.
- Maybe invent a versioning scheme: "states@v2".
- You can tell django orm to use a different name for the db column than the python object.
A trigger would be setup to cause a write to "new_states" column to set the value on "states" as part of the write.
- If RUNNING written, write RUNNING to "states"
- If CANCELED written, write CANCELED to "states"
- If CANCELING written, do not write anything to "states". The old application code is not prepared to deal with this
A second trigger would be setup to cause a write to "states" column to set the value on "new states" as part of the write.
- If RUNNING written, write RUNNING to "new_states"
- If CANCELED written, write CANCELED to "new_states"
Data from the existing column would be copied to the new column transactionally in batches as the migration ran

Questions

What happens when new-column has an initial-condition that is new?
- db-triggers have to be thought about REALLY HARD in terms of "what to do" in this case
How about "migration-dependent app-code paths"?
- python code queries "is migration-001-foo applied to the db?", and runs new-code if so, else runs old-code-path

Relavent tickets

See the EPIC here: https://github.com/pulp/pulpcore/issues/3278

Scratchpad Notes

Zero downtime does not come as a free gift. We cannot just add a library that takes care of it, but we need to be very careful writing migrations, in a way that both new and old code can work with it. We may need a lot of database version aware code.
In any case, it is worth shutting down workers while migrating.
Maybe we can introduce a maintenance mode where no costly tasks can be dispatched, but at least the content app stays online to reduce the code that needs awareness.
Our plugin infrastructure makes things additionally complex.
We cannot upgrade from anywhere without downtime.
We need to select windows for zero downtime upgrades. (single commits; whole releases?)
We probably need to align destructive database migrations with the deprecation policy.
Not allowing migrations in z-stream releases was a very wise decision.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.