<style> .reveal { font-size: 24px; } </style> # DEADLOCKS: ## Fear and Loathing Presenter: Grant Gainey Event: PulpCon 2022 URL: https://hackmd.io/@ggainey/deadlocks_f_and_l --- # Agenda * Introduction * Problem Space * Postgres locks * Alerting * Prevention * Code examples * Conclusions * Q&A --- # Introduction * Who am I? * Grant Gainey, Principle Software Engineer, Red Hat * What are we talking about? * Postgres database deadlocks in Pulp * What are we NOT talking about? * Application-level lock issues * Deep dive into low-level Postgres locking details --- # Problem Space * What are deadlocks? * Whose fault are they? * How can one deal with them? * Why are they so hard to test for/fix? ---- <!-- .slide: style="font-size: 18px;" --> ## What is a deadlock? * Two (or more!) processes all waiting for the others to finish, forever * Four Requirements * Mutual exclusion * a Thing can only be held by one process at a time * shared-resources are Nifty! * Hold and Wait * Once I have a Thing and need another, I'm going to hold onto My Thing till I get what I want * No Preemption * I can't **force** you to give me what I'm waiting for * Circular Wait * I have a fork and need a knife; you have the knife and need the fork; we both starve. * See ["The Dining Philosophers" problem](https://en.wikipedia.org/wiki/Dining_philosophers_problem) ![](https://i.imgur.com/buyZnym.png) ---- ## Whose fault is it? ### The First Rule of Deadlocks: Deadlocks are **application programming errors**. Always. * It's not Postgres' fault. * It's not Django's fault. * It's not the user's workflow's fault. You can decide not to fix a deadlock, for good reasons - but that doesn't mean your code isn't in error. ---- ## What ways are there to deal with them? * Ignore * Without forced-failure - whole system can back up * see "Olde Tasking System" in Pulp * With forced-failure (eg current Postgres) - someone fails * 2 syncs, deadlock happens, one sync fails * user has to retry, is sad/outraged, we get bug reports * Detect/retry * On failure, completely restart the failing task (possibly repeatedly) * Resetting to initial-conditions can lead to interesting code-complexity * Prevent * What we're going to be talking about here * Requires code that is aware of concurrency and Takes Steps to protect itself ---- ## Why are they so hard to test for/fix? * Data dependent * No overlapping data? No problem! * **Example**: sync 15 independent repositories simultaneously - no problem! * Timing-window dependent * locks may be held for very short time periods * **Example**: update one field on 100 records - microseconds of locking * May need hundreds of times through flawed code before hitting the problem * Concurrency-dependent * Getting the data and timing to line up can take significant concurrency * **Example**: could **only** be recreated "in the wild" when: * immediate-sync five or more copies of the **exact same repository** * at the **exact** same time, * with ~14K pieces of content, * into a Pulp instance **that had never seen the data before** --- # Postgres lock details * Lock types/hierarchy * Table locks * Row locks ---- ## Lock types/hierarchy * There are many lock types * Here are [Gory Details](https://www.postgresql.org/docs/current/explicit-locking.html) * 13.3.1. Table-Level Locks * 13.3.2. Row-Level Locks * "The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order." * As a general rule, you shouldn't know or care at the Python level * All rules have exceptions, because Software ---- ## Table locks * Can put a lock on an entire table * Ruins concurrency for use of entire data models * If EXCLUSIVE - nobody else can touch that table until released * This is a Big Hammer, and almost never works as well as you're hoping ---- ## Row locks * Locks an individual row or rows in a table * No concurrency impact as long as the data doesn't overlap * "update" locks rows - but in **arbitrary order** * one can deliberately "select_for_update" in a way that enforces order at SQL level, from Django * example later --- # Alerting ## How do you know you have a deadlock problem? * You will **NOT** have "tasks all stuck waiting" * Postgres detects deadlocks, picks a philosopher to kill * **Pulp task-error-field** tells you 'deadlock detected' * **journalctl** tells you the same * **Postgres log** tells you the same, with SQL specifics * grep logs for "deadlock detected" and/or "ShareLock" ---- ## Which log will tell me what? * Postgres log * will show the db-processes that are colliding * can show the exact SQL involved * log_statement = 'all' in `data/postgresql.conf` * knows nothing about tasks or python objects * `/var/lib/pgsql/data/log/postgresql-DDD.log` (Fedora) * `/var/opt/rh/rh-postgresql12/lib/pgsql/data/log/postgresql-DDD.log` (RHEL7) * journalctl log * will show the Pulp failures * will show some of the Postgres errors (e.g., not the SQL) * will show the **context** of the failure * knowing what was going on when the deadlock happened is critical * START WITH THE LOGS * assuming you know what's happening leads only to tears ---- <!-- .slide: style="font-size: 12px;" --> ## Some Sample Output * Pulp task output ```json {'child_tasks': [], 'created_resources': [], 'error': {'description': 'deadlock detected\n' 'DETAIL: Process 31559 waits for ShareLock on ' 'transaction 4972396; blocked by process 31883.\n' 'Process 31883 waits for ShareLock on transaction ' '4972398; blocked by process 31559.\n' 'HINT: See server log for query details.\n' 'CONTEXT: while inserting index tuple (49588,3) in ' 'relation "rpm_package_pkgId_key"\n', 'traceback': '...' } } ``` * Postgres log output ``` 127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:ERROR: deadlock detected 127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:DETAIL: Process 21577 waits for ShareLock on transaction 13163681; blocked by process 21565. Process 21565 waits for ShareLock on transaction 13163680; blocked by process 21577. Process 21577: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512") VALUES (all-the-values-here) Process 21565: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512") VALUES (all-the-values-here) 127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:HINT: See server log for query details. 127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:CONTEXT: while inserting index tuple (10762,2) in relation "core_artifact_sha256_key" 127.0.0.1(43873):60c9b8ea.5449:2021-06-16 04:40:17.065 EDT:app=[unknown]:pid=21577:STATEMENT: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512") VALUES ('15b7946a-1e75-461c-99d9-c2244eea44e0'::uuid, '2021-06-16T08:40:16.019791+00:00'::timestamptz, '2021-06-16T08:40:16.019827+00:00'::timestamptz, 'artifact/44/414c7788cd8a1722906fb34bb019290d1269bb24bbf3ce9f1d63fe15464354', 83110084, NULL, 'ca43066416a794b8cfb822eff9ebf9f28cb202cd', '6699ad24ccbecc3508bc22d5890f489ea047518d4e43c4766a39e33a', '44414c7788cd8a1722906fb34bb019290d1269bb24bbf3ce9f1d63fe15464354', 'dbff1aaf5bec37bbe77a028826fa2df26a1cc23185d9fb8e32160a2fe626bda835c81232ba10092cec9d40e175cd1b8d', '0e2bb59c3bcb8304e7879ed1a7a6f099be74cde005a814ea57df1e8cd670f19f0638cfaddecfd1804632bbac599a34729c23d6a9b0bdfdc5ab87f29bfe040194'), ('eb61e401-00aa-405d-9746-788e0cb8171e'::uuid, '2021-06-16T08:40:16.025499+00:00'::timestamptz, '2021-06-16T08:40:16.025541+00:00'::timestamptz, 'artifact/aa/71e111378ebf86acc7cc3be9342bd531686e7da57c4ba9d11dce28568356fd', 83759432, NULL, '2144350cd1442c7cb26ca6bc9d8d62723cb885de', '822899be4cb83ecb84951913d66b6e77c9b75d9d17e610b835ab4445', 'aa71e111378ebf86acc7cc3be9342bd531686e7da57c4ba9d11dce28568356fd', 'fe2f747706bdb6e59fd6e072a4ef4de240f4aaaa5fee3230a5a8386d9efdfc45561c066dfb55577eb10c150db5189aee', '28e0923bce01f818c174868538c3fb321c469458485978d8a0dff93ea2efbddbcb2de9e671a9924197fd3e74b7c99d43d1bf8a0744e3277ea1f7e5827ddc358f'), ('6ad9c70e-1b89-4141-80d8-5c08c62a6391'::uuid, '2021-06-16T08:40:16.026218+00:00'::timestamptz, '2021-06-16T08:40:16.026239+00:00'::timestamptz, 'artifact/01/41012c79a64e19eb7226de0c92f7132aaf85aba98c90fde4190a912e646bbf', 83113088, NULL, 'e916686c96edb4552054fb6de000ccfcb6bf3740', 'fe480371f7cd2a95c36d0e1b42f8d6179838f5bd6f8e97de9b5fcd8c', '0141012c79a64e19eb7226de0c92f7132aaf85aba98c90fde4190a912e646bbf', '5523d2af63f3f39676e86776871f13fbcd45c540eec274a7bd2df31bba1aaf9619d1e6abfa175ebcf16c0aa69642ffe0', 'cec5e688a31602bab11595f5316e5c0ce7278271a612737d9b41a67f1b9873713781b2f1bdec8b0f7d6a0e27728a831714bd99d79cec7916e3d7751f09976369'), ('52688fd7-23ef-40da-8f0e-680ecf79af9e'::uuid, '2021-06-16T08:40:16.026752+00:00'::timestamptz, '2021-06-16T08:40:16.026772+00:00'::timestamptz, 'artifact/2a/20884fea781c0e27b39ea3def7da7e5c829c0c26fe9cca8a9a9a7317adc580', 81421384, NULL, '63e540ef02aad8dc10395a31c1e8c52f62e9c25b', 'c93578e511e306afde1a50307eacd61647ce30020b4f5d36c11983e4', '2a20884fea781c0e27b39ea3def7da7e5c829c0c26fe9cca8a9a9a7317adc580', 'b438fc8d7efcfdebe356ab35bedf4ac61fabbff4cddb09885a570c7fc95aca1f594e2b32e8eefed03c9274c35eadd60d', '799fdc774c33e1c9227c5c5a9d50cc421ab053899de630e65ae48420d5af204839c4674f7a2e5c474a056d2ad64f65a5506825fab94db838e7cef1e4de032be7') RETURNING "core_artifact"."pulp_id" ``` * Journalctl output ``` Aug 10 15:51:24 host pulpcore-api[54517]: pulp [88649f1c-3393-4693-aab7-d73b62eeda62]: - - [10/Aug/2022:19:51:24 +0000] "GET /pulp/api/v3/tasks/06d88a9c-2048-4aad-9967-f5012aafbf3f/ HTTP/1.1" 200 578 "-" "OpenAPI-Generator/3.14.15/ruby" Aug 10 15:51:30 host pulpcore-worker-5[54158]: pulp [88649f1c-3393-4693-aab7-d73b62eeda62]: pulpcore.tasking.pulpcore_worker:INFO: Task 407bb67b-65d0-4d65-b9c8-b1aa1f2c87fd failed (deadlock detected Aug 10 15:51:30 host pulpcore-worker-5[54158]: DETAIL: Process 55740 waits for ShareLock on transaction 61803; blocked by process 55746. Aug 10 15:51:30 host pulpcore-worker-5[54158]: Process 55746 waits for ShareLock on transaction 61805; blocked by process 55740. Aug 10 15:51:30 host pulpcore-worker-5[54158]: HINT: See server log for query details. Aug 10 15:51:30 host pulpcore-worker-5[54158]: CONTEXT: while locking tuple (209,51) in relation "core_contentartifact" Aug 10 15:51:30 host pulpcore-worker-5[54158]: ) Aug 10 15:51:30 host pulpcore-worker-5[54158]: pulp [88649f1c-3393-4693-aab7-d73b62eeda62]: pulpcore.tasking.pulpcore_worker:INFO: File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 342, in _perform_task Aug 10 15:51:30 host pulpcore-worker-5[54158]: result = func(*args, **kwargs) ... ``` --- # How can I make my deadlock happen?!? * Large, inconsistent reproducers can make debug/fix cycle take...weeks * Example: "Sync a 12K entity file-repo into 15 repos in a clean Pulp simultaneously." * **Q:** How can one force a problem "artifically"? * **A:** Get creative, and use `pulpcore-manager shell`! * Example follows from a "fix the deadlock in `touch()`" experiments: ---- ## FAIL <!-- .slide: style="font-size: 18px;" --> ```python from threading import Thread from pulpcore.plugin.models import Content, RepositoryVersion from django.utils.timezone import now durations = [] def update_timestamp(index): content_q = Content.objects.filter().order_by("pk").values_list("pk", flat=True)[0:30000] Content.objects.filter( pk__in=content_q ).select_for_update().update(timestamp_of_interest=now()) print(">>>done") for r in range(3): threads = [] for i in range(5): threads.append(Thread(target=update_timestamp, args=(i,))) for t in threads: t.start() for t in threads: t.join() ``` ---- ## SUCCESS <!-- .slide: style="font-size: 18px;" --> ```python from threading import Thread import statistics import time from pulpcore.plugin.models import Content, RepositoryVersion from django.db import transaction, connection from django.utils.timezone import now durations = [] def update_timestamp(index): print(">>>in update_timedstamp index {}".format(index)) start = time.time() with transaction.atomic(): sub_q = Content.objects.filter().order_by("pk").select_for_update() Content.objects.filter(pk__in=sub_q).update(timestamp_of_interest=now()) end = time.time() durations.append(end-start) print(">>>done") for r in range(3): threads = [] for i in range(5): threads.append(Thread(target=update_timestamp, args=(i,))) for t in threads: t.start() for t in threads: t.join() print("Avg time : {}".format(sum(durations) / len(durations))) print("Median time : {}".format(statistics.median(durations))) print("StdDev : {}".format(statistics.stdev(durations))) ``` * note: select_for_update(skip_locked=True) worked **for touch** and was WAY faster --- # Prevention * avoid operations on overlapping content * consistently order updates in Python * pre-locking bulk objects ("select for update") * locking entire tables (NO!) --- # Specific code examples from Pulp3 * sync pipeline, ordering * [PR 1637](https://github.com/pulp/pulpcore/pull/1637/files) * touch(), ordering and "Somebody Else's Problem" (!) * [PR 2159](https://github.com/ggainey/pulpcore/blob/fb1edee7686ff39d8cf493d08d61502f259107d2/pulpcore/app/models/content.py#L104-L145) raw-SQL select_for_update() * [PR 2230](https://github.com/ggainey/pulpcore/blob/e0c782969fbbe81c02c4f65cbbb734e23d52eb3a/pulpcore/app/models/content.py#L103-L116) Fixed to use Django correctly * more sync pipeline (avoid) * [PR 3286](https://github.com/pulp/pulpcore/pull/3286/files) --- # Conclusions * Think about concurrency when you're coding * Think harder about it when you're testing * When you're doing more than one thing at a time - ORDER MATTERS * ordering costs performance, but not as much as Being Wrong * Anything with the word "bulk" in the name should be viewed with Deepest Suspicion * We might decide not to fix a deadlock - but it's still **OUR BUG** --- # Questions? * Answers are not guaranteed * Sympathy is --- # Bibliography/Suggested Reading * Postgres, [Gory Locking Details](https://www.postgresql.org/docs/current/explicit-locking.html) * Pulp, [touch() deadlock investigation](https://hackmd.io/@pulp/touch_deadlock) * [redmine-9441](https://pulp.plan.io/issues/9441) * Ordering, [PR 1644](https://github.com/pulp/pulpcore/pull/1644) * Raw SQL, [PR 2159](https://github.com/pulp/pulpcore/pull/2159) * Using Django correctly, [PR 2230](https://github.com/pulp/pulpcore/pull/2230) * Pulp, [sync-pipeline deadlocks](https://hackmd.io/@ggainey/fun_w_deadlocks) * [redmine-8750](https://pulp.plan.io/issues/8750) * [PR 1637](https://github.com/pulp/pulpcore/pull/1637) * Wikipedia, [Deadlock](https://en.wikipedia.org/wiki/Deadlock) * Wikipedia, [Dining Philosophers](https://en.wikipedia.org/wiki/Dining_philosophers_problem) --- # Thanks!
{"title":"DEADLOCKS","tags":"presentation, PulpCon 2022","slideOptions":{"center":false,"transition":"fade","slideNumber":true}}
    110 views