# Reserved-Resource debugging ## Why? It's possible to kill pulp-workers **so hard** that they disappear without being able to clean up. In this case, Pulp thinks they're still around. This can cause several problems. Symptoms include: * Some tasks never leave 'waiting', even tho nothing is running on the system. This happens when the tasks are assigned to a zombie worker - it's never going to come back, but pulp keeps giving it things to do because it looks like it's still alive. * NOTHING happens - everything it sitting in 'waiting', nothing in 'running'. This happens when there's a ReservedResource that is never going to be released, and every task is backing up on waiting for it, or for something that is waiting for it. Debugging what exactly is going on can be hard. This doc is a braindump of commands we've used to investigate when things go Awry. Tools used in the sample commands below include: * pulp-cli for the `pulp` command * httpie for the `http` command * `jq` for massaging json output * `rq` for Querying the RQ Queues * `pulpcore-manager` for gathering data directly from Django without having to crawl inside the database * running in a katello/pulp env can have some gotchas * run pulpcore-manager as the pulp user, from a directory pulp can read * OR, apply this patch : https://github.com/rochacbruno/dynaconf/pull/570/files * invoke as follows: * `sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' DJANGO_SETTINGS_MODULE='pulpcore.app.settings' pulpcore-manager` ## Gimme answer RIGHT NOW! The following will tell you if you have a zombie worker holding a resource: ``` sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' \ DJANGO_SETTINGS_MODULE='pulpcore.app.settings' \ pulpcore-manager shell <<EOF from pulpcore.app.models import ReservedResource, Worker worker_to_res = {} for rr in ReservedResource.objects.all(): worker_to_res[rr.worker_id] = rr.pulp_id workers = [w.pulp_id for w in Worker.objects.online_workers()] for rwork in worker_to_res: if rwork not in workers: print(f'Worker {rwork} owns ReservedResource {worker_to_res[rwork]} and is not in online_workers!!') EOF ``` Example output when Something Is Wrong: ``` Worker f346bf07-eff3-4039-8507-41cb571c0e54 owns ReservedResource 0dbea5c6-b1e4-4463-9cde-7447e0f48911 and is not in online_workers!! ``` To clean up the locked ReservedResource and clean up the zombie, you can use this: ``` sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' \ DJANGO_SETTINGS_MODULE='pulpcore.app.settings' \ pulpcore-manager shell <<EOF from pulpcore.app.models import ReservedResource, Worker worker_to_res = {} for rr in ReservedResource.objects.all(): worker_to_res[rr.worker_id] = rr.pulp_id workers = [w.pulp_id for w in Worker.objects.online_workers()] for rwork in worker_to_res: if rwork not in workers: print(f'Worker {rwork} owns ReservedResource {worker_to_res[rwork]} and is not in online_workers!!') print('Cleaning up...') ReservedResource.objects.get(pk=worker_to_res[rwork]).delete() w = Worker.objects.get(pulp_id=rwork) w.cleaned_up=True w.save() EOF ``` Output: ``` Worker f346bf07-eff3-4039-8507-41cb571c0e54 owns ReservedResource 24975e48-ccb2-4d48-9bbe-f3c50d43b608 and is not in online_workers!! Cleaning up... ``` ## Details, Details ### Task questions * How many tasks are waiting? `pulp task list --state=waiting | jq length` OR `http :/pulp/api/v3/tasks/?state=waiting | jq '.count'` * Is anybody running? `pulp task list --state=running | jq length` OR `http :/pulp/api/v3/tasks/?state=running | jq '.count'` * How many have failed? `pulp task list --state=failed | jq length` OR `http :/pulp/api/v3/tasks/?state=failed | jq '.count'` ### RQ Stuff * Show RQ stats: `rq info` or, on a Sat6 system `rq info -u redis://localhost:6379/8` Should show >0 workers and queues - if either is 0, Something Has Gone Horribly Wrong with RQ * Empty queues when things have Gone Awry: `rq empty -a` ### Finding/removing 'stuck' ReservedResource * Find online-workers: `sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' \ DJANGO_SETTINGS_MODULE='pulpcore.app.settings' \ pulpcore-manager shell -c "from pulpcore.app.models import Worker; print(Worker.objects.online_workers())"` OR `pulp status | jq '.online_workers | .[] | {name, pulp_href}'` * Find held resource(s): ALL: `pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; print(ReservedResource.objects.all())"` SPECIFIC RESOURCE: `pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; print(ReservedResource.objects.get(pk='9762ffab-a473-42b5-85af-5150f3673812').__dict__)"` * Find worker that is holding a reserved resource: `pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; print(ReservedResource.objects.get(pk='9762ffab-a473-42b5-85af-5150f3673812').worker)"` If a Worker, above, isn't in the online_workers - that's your problem. Be sad. That reserved-resource Has To Go. * Clean up reserved-resource: `pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; ReservedResource.objects.get(pk='9762ffab-a473-42b5-85af-5150f3673812').delete()"` * Update state of zombie-worker so we stop giving it work to do after it's dead: `pulpcore-manager shell -c "from pulpcore.app.models import Worker; w=Worker.objects.get(name='worker@pulp-worker-5976996b5d-r5hwc'); w.cleaned_up=True; w.save()` ## Questions * What happens if a worker that is gracefully-stopped but still somehow holds a reservedresource? * Rewrite commands to use "all workers that aren't online"