<style>
.reveal {
font-size: 24px;
}
</style>
# DEADLOCKS:
## Fear and Loathing
Presenter: Grant Gainey
Event: PulpCon 2022
URL: https://hackmd.io/@ggainey/deadlocks_f_and_l
---
# Agenda
* Introduction
* Problem Space
* Postgres locks
* Alerting
* Prevention
* Code examples
* Conclusions
* Q&A
---
# Introduction
* Who am I?
* Grant Gainey, Principle Software Engineer, Red Hat
* What are we talking about?
* Postgres database deadlocks in Pulp
* What are we NOT talking about?
* Application-level lock issues
* Deep dive into low-level Postgres locking details
---
# Problem Space
* What are deadlocks?
* Whose fault are they?
* How can one deal with them?
* Why are they so hard to test for/fix?
----
<!-- .slide: style="font-size: 18px;" -->
## What is a deadlock?
* Two (or more!) processes all waiting for the others to finish, forever
* Four Requirements
* Mutual exclusion
* a Thing can only be held by one process at a time
* shared-resources are Nifty!
* Hold and Wait
* Once I have a Thing and need another, I'm going to hold onto My Thing till I get what I want
* No Preemption
* I can't **force** you to give me what I'm waiting for
* Circular Wait
* I have a fork and need a knife; you have the knife and need the fork; we both starve.
* See ["The Dining Philosophers" problem](https://en.wikipedia.org/wiki/Dining_philosophers_problem)
![](https://i.imgur.com/buyZnym.png)
----
## Whose fault is it?
### The First Rule of Deadlocks:
Deadlocks are **application programming errors**. Always.
* It's not Postgres' fault.
* It's not Django's fault.
* It's not the user's workflow's fault.
You can decide not to fix a deadlock, for good reasons - but that doesn't mean your code isn't in error.
----
## What ways are there to deal with them?
* Ignore
* Without forced-failure - whole system can back up
* see "Olde Tasking System" in Pulp
* With forced-failure (eg current Postgres) - someone fails
* 2 syncs, deadlock happens, one sync fails
* user has to retry, is sad/outraged, we get bug reports
* Detect/retry
* On failure, completely restart the failing task (possibly repeatedly)
* Resetting to initial-conditions can lead to interesting code-complexity
* Prevent
* What we're going to be talking about here
* Requires code that is aware of concurrency and Takes Steps to protect itself
----
## Why are they so hard to test for/fix?
* Data dependent
* No overlapping data? No problem!
* **Example**: sync 15 independent repositories simultaneously - no problem!
* Timing-window dependent
* locks may be held for very short time periods
* **Example**: update one field on 100 records - microseconds of locking
* May need hundreds of times through flawed code before hitting the problem
* Concurrency-dependent
* Getting the data and timing to line up can take significant concurrency
* **Example**: could **only** be recreated "in the wild" when:
* immediate-sync five or more copies of the **exact same repository**
* at the **exact** same time,
* with ~14K pieces of content,
* into a Pulp instance **that had never seen the data before**
---
# Postgres lock details
* Lock types/hierarchy
* Table locks
* Row locks
----
## Lock types/hierarchy
* There are many lock types
* Here are [Gory Details](https://www.postgresql.org/docs/current/explicit-locking.html)
* 13.3.1. Table-Level Locks
* 13.3.2. Row-Level Locks
* "The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order."
* As a general rule, you shouldn't know or care at the Python level
* All rules have exceptions, because Software
----
## Table locks
* Can put a lock on an entire table
* Ruins concurrency for use of entire data models
* If EXCLUSIVE - nobody else can touch that table until released
* This is a Big Hammer, and almost never works as well as you're hoping
----
## Row locks
* Locks an individual row or rows in a table
* No concurrency impact as long as the data doesn't overlap
* "update" locks rows - but in **arbitrary order**
* one can deliberately "select_for_update" in a way that enforces order at SQL level, from Django
* example later
---
# Alerting
## How do you know you have a deadlock problem?
* You will **NOT** have "tasks all stuck waiting"
* Postgres detects deadlocks, picks a philosopher to kill
* **Pulp task-error-field** tells you 'deadlock detected'
* **journalctl** tells you the same
* **Postgres log** tells you the same, with SQL specifics
* grep logs for "deadlock detected" and/or "ShareLock"
----
## Which log will tell me what?
* Postgres log
* will show the db-processes that are colliding
* can show the exact SQL involved
* log_statement = 'all' in `data/postgresql.conf`
* knows nothing about tasks or python objects
* `/var/lib/pgsql/data/log/postgresql-DDD.log` (Fedora)
* `/var/opt/rh/rh-postgresql12/lib/pgsql/data/log/postgresql-DDD.log` (RHEL7)
* journalctl log
* will show the Pulp failures
* will show some of the Postgres errors (e.g., not the SQL)
* will show the **context** of the failure
* knowing what was going on when the deadlock happened is critical
* START WITH THE LOGS
* assuming you know what's happening leads only to tears
----
<!-- .slide: style="font-size: 12px;" -->
## Some Sample Output
* Pulp task output
```json
{'child_tasks': [],
'created_resources': [],
'error': {'description': 'deadlock detected\n'
'DETAIL: Process 31559 waits for ShareLock on '
'transaction 4972396; blocked by process 31883.\n'
'Process 31883 waits for ShareLock on transaction '
'4972398; blocked by process 31559.\n'
'HINT: See server log for query details.\n'
'CONTEXT: while inserting index tuple (49588,3) in '
'relation "rpm_package_pkgId_key"\n',
'traceback': '...'
}
}
```
* Postgres log output
```
127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:ERROR: deadlock detected
127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:DETAIL: Process 21577 waits for ShareLock on transaction 13163681; blocked by process 21565.
Process 21565 waits for ShareLock on transaction 13163680; blocked by process 21577.
Process 21577: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512") VALUES (all-the-values-here)
Process 21565: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512") VALUES (all-the-values-here)
127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:HINT: See server log for query details.
127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:CONTEXT: while inserting index tuple (10762,2) in relation "core_artifact_sha256_key"
127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:STATEMENT: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512") VALUES ('15b7946a-1e75-461c-99d9-c2244eea44e0'::uuid, '2021-06-16T08:40:16.019791+00:00'::timestamptz, '2021-06-16T08:40:16.019827+00:00'::timestamptz, 'artifact/44/414c7788cd8a1722906fb34bb019290d1269bb24bbf3ce9f1d63fe15464354', 83110084, NULL, 'ca43066416a794b8cfb822eff9ebf9f28cb202cd', '6699ad24ccbecc3508bc22d5890f489ea047518d4e43c4766a39e33a', '44414c7788cd8a1722906fb34bb019290d1269bb24bbf3ce9f1d63fe15464354', 'dbff1aaf5bec37bbe77a028826fa2df26a1cc23185d9fb8e32160a2fe626bda835c81232ba10092cec9d40e175cd1b8d', '0e2bb59c3bcb8304e7879ed1a7a6f099be74cde005a814ea57df1e8cd670f19f0638cfaddecfd1804632bbac599a34729c23d6a9b0bdfdc5ab87f29bfe040194'), ('eb61e401-00aa-405d-9746-788e0cb8171e'::uuid, '2021-06-16T08:40:16.025499+00:00'::timestamptz, '2021-06-16T08:40:16.025541+00:00'::timestamptz, 'artifact/aa/71e111378ebf86acc7cc3be9342bd531686e7da57c4ba9d11dce28568356fd', 83759432, NULL, '2144350cd1442c7cb26ca6bc9d8d62723cb885de', '822899be4cb83ecb84951913d66b6e77c9b75d9d17e610b835ab4445', 'aa71e111378ebf86acc7cc3be9342bd531686e7da57c4ba9d11dce28568356fd', 'fe2f747706bdb6e59fd6e072a4ef4de240f4aaaa5fee3230a5a8386d9efdfc45561c066dfb55577eb10c150db5189aee', '28e0923bce01f818c174868538c3fb321c469458485978d8a0dff93ea2efbddbcb2de9e671a9924197fd3e74b7c99d43d1bf8a0744e3277ea1f7e5827ddc358f'), ('6ad9c70e-1b89-4141-80d8-5c08c62a6391'::uuid, '2021-06-16T08:40:16.026218+00:00'::timestamptz, '2021-06-16T08:40:16.026239+00:00'::timestamptz, 'artifact/01/41012c79a64e19eb7226de0c92f7132aaf85aba98c90fde4190a912e646bbf', 83113088, NULL, 'e916686c96edb4552054fb6de000ccfcb6bf3740', 'fe480371f7cd2a95c36d0e1b42f8d6179838f5bd6f8e97de9b5fcd8c', '0141012c79a64e19eb7226de0c92f7132aaf85aba98c90fde4190a912e646bbf', '5523d2af63f3f39676e86776871f13fbcd45c540eec274a7bd2df31bba1aaf9619d1e6abfa175ebcf16c0aa69642ffe0', 'cec5e688a31602bab11595f5316e5c0ce7278271a612737d9b41a67f1b9873713781b2f1bdec8b0f7d6a0e27728a831714bd99d79cec7916e3d7751f09976369'), ('52688fd7-23ef-40da-8f0e-680ecf79af9e'::uuid, '2021-06-16T08:40:16.026752+00:00'::timestamptz, '2021-06-16T08:40:16.026772+00:00'::timestamptz, 'artifact/2a/20884fea781c0e27b39ea3def7da7e5c829c0c26fe9cca8a9a9a7317adc580', 81421384, NULL, '63e540ef02aad8dc10395a31c1e8c52f62e9c25b', 'c93578e511e306afde1a50307eacd61647ce30020b4f5d36c11983e4', '2a20884fea781c0e27b39ea3def7da7e5c829c0c26fe9cca8a9a9a7317adc580', 'b438fc8d7efcfdebe356ab35bedf4ac61fabbff4cddb09885a570c7fc95aca1f594e2b32e8eefed03c9274c35eadd60d', '799fdc774c33e1c9227c5c5a9d50cc421ab053899de630e65ae48420d5af204839c4674f7a2e5c474a056d2ad64f65a5506825fab94db838e7cef1e4de032be7') RETURNING "core_artifact"."pulp_id"
```
* Journalctl output
```
Aug 10 15:51:24 host pulpcore-api[54517]: pulp [88649f1c-3393-4693-aab7-d73b62eeda62]: - - [10/Aug/2022:19:51:24 +0000] "GET /pulp/api/v3/tasks/06d88a9c-2048-4aad-9967-f5012aafbf3f/ HTTP/1.1" 200 578 "-" "OpenAPI-Generator/3.14.15/ruby"
Aug 10 15:51:30 host pulpcore-worker-5[54158]: pulp [88649f1c-3393-4693-aab7-d73b62eeda62]: pulpcore.tasking.pulpcore_worker:INFO: Task 407bb67b-65d0-4d65-b9c8-b1aa1f2c87fd failed (deadlock detected
Aug 10 15:51:30 host pulpcore-worker-5[54158]: DETAIL: Process 55740 waits for ShareLock on transaction 61803; blocked by process 55746.
Aug 10 15:51:30 host pulpcore-worker-5[54158]: Process 55746 waits for ShareLock on transaction 61805; blocked by process 55740.
Aug 10 15:51:30 host pulpcore-worker-5[54158]: HINT: See server log for query details.
Aug 10 15:51:30 host pulpcore-worker-5[54158]: CONTEXT: while locking tuple (209,51) in relation "core_contentartifact"
Aug 10 15:51:30 host pulpcore-worker-5[54158]: )
Aug 10 15:51:30 host pulpcore-worker-5[54158]: pulp [88649f1c-3393-4693-aab7-d73b62eeda62]: pulpcore.tasking.pulpcore_worker:INFO: File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 342, in _perform_task
Aug 10 15:51:30 host pulpcore-worker-5[54158]: result = func(*args, **kwargs)
...
```
---
# How can I make my deadlock happen?!?
* Large, inconsistent reproducers can make debug/fix cycle take...weeks
* Example: "Sync a 12K entity file-repo into 15 repos in a clean Pulp simultaneously."
* **Q:** How can one force a problem "artifically"?
* **A:** Get creative, and use `pulpcore-manager shell`!
* Example follows from a "fix the deadlock in `touch()`" experiments:
----
## FAIL
<!-- .slide: style="font-size: 18px;" -->
```python
from threading import Thread
from pulpcore.plugin.models import Content, RepositoryVersion
from django.utils.timezone import now
durations = []
def update_timestamp(index):
content_q = Content.objects.filter().order_by("pk").values_list("pk", flat=True)[0:30000]
Content.objects.filter(
pk__in=content_q
).select_for_update().update(timestamp_of_interest=now())
print(">>>done")
for r in range(3):
threads = []
for i in range(5):
threads.append(Thread(target=update_timestamp, args=(i,)))
for t in threads:
t.start()
for t in threads:
t.join()
```
----
## SUCCESS
<!-- .slide: style="font-size: 18px;" -->
```python
from threading import Thread
import statistics
import time
from pulpcore.plugin.models import Content, RepositoryVersion
from django.db import transaction, connection
from django.utils.timezone import now
durations = []
def update_timestamp(index):
print(">>>in update_timedstamp index {}".format(index))
start = time.time()
with transaction.atomic():
sub_q = Content.objects.filter().order_by("pk").select_for_update()
Content.objects.filter(pk__in=sub_q).update(timestamp_of_interest=now())
end = time.time()
durations.append(end-start)
print(">>>done")
for r in range(3):
threads = []
for i in range(5):
threads.append(Thread(target=update_timestamp, args=(i,)))
for t in threads:
t.start()
for t in threads:
t.join()
print("Avg time : {}".format(sum(durations) / len(durations)))
print("Median time : {}".format(statistics.median(durations)))
print("StdDev : {}".format(statistics.stdev(durations)))
```
* note: select_for_update(skip_locked=True) worked **for touch** and was WAY faster
---
# Prevention
* avoid operations on overlapping content
* consistently order updates in Python
* pre-locking bulk objects ("select for update")
* locking entire tables (NO!)
---
# Specific code examples from Pulp3
* sync pipeline, ordering
* [PR 1637](https://github.com/pulp/pulpcore/pull/1637/files)
* touch(), ordering and "Somebody Else's Problem" (!)
* [PR 2159](https://github.com/ggainey/pulpcore/blob/fb1edee7686ff39d8cf493d08d61502f259107d2/pulpcore/app/models/content.py#L104-L145) raw-SQL select_for_update()
* [PR 2230](https://github.com/ggainey/pulpcore/blob/e0c782969fbbe81c02c4f65cbbb734e23d52eb3a/pulpcore/app/models/content.py#L103-L116) Fixed to use Django correctly
* more sync pipeline (avoid)
* [PR 3286](https://github.com/pulp/pulpcore/pull/3286/files)
---
# Conclusions
* Think about concurrency when you're coding
* Think harder about it when you're testing
* When you're doing more than one thing at a time - ORDER MATTERS
* ordering costs performance, but not as much as Being Wrong
* Anything with the word "bulk" in the name should be viewed with Deepest Suspicion
* We might decide not to fix a deadlock - but it's still **OUR BUG**
---
# Questions?
* Answers are not guaranteed
* Sympathy is
---
# Bibliography/Suggested Reading
* Postgres, [Gory Locking Details](https://www.postgresql.org/docs/current/explicit-locking.html)
* Pulp, [touch() deadlock investigation](https://hackmd.io/@pulp/touch_deadlock)
* [redmine-9441](https://pulp.plan.io/issues/9441)
* Ordering, [PR 1644](https://github.com/pulp/pulpcore/pull/1644)
* Raw SQL, [PR 2159](https://github.com/pulp/pulpcore/pull/2159)
* Using Django correctly, [PR 2230](https://github.com/pulp/pulpcore/pull/2230)
* Pulp, [sync-pipeline deadlocks](https://hackmd.io/@ggainey/fun_w_deadlocks)
* [redmine-8750](https://pulp.plan.io/issues/8750)
* [PR 1637](https://github.com/pulp/pulpcore/pull/1637)
* Wikipedia, [Deadlock](https://en.wikipedia.org/wiki/Deadlock)
* Wikipedia, [Dining Philosophers](https://en.wikipedia.org/wiki/Dining_philosophers_problem)
---
# Thanks!
{"metaMigratedAt":"2023-06-17T12:51:36.818Z","metaMigratedFrom":"YAML","title":"DEADLOCKS","breaks":"true","slideOptions":"{\"center\":false,\"transition\":\"fade\",\"slideNumber\":true}","contributors":"[{\"id\":\"1a6032ad-e6b7-434d-aa35-96e6ebf50256\",\"add\":21902,\"del\":5941}]"}