Reserved-Resource debugging

Why?

It's possible to kill pulp-workers so hard that they disappear without being able to clean up. In this case, Pulp thinks they're still around.

This can cause several problems. Symptoms include:

Some tasks never leave 'waiting', even tho nothing is running on the system. This happens when the tasks are assigned to a zombie worker - it's never going to come back, but pulp keeps giving it things to do because it looks like it's still alive.
NOTHING happens - everything it sitting in 'waiting', nothing in 'running'. This happens when there's a ReservedResource that is never going to be released, and every task is backing up on waiting for it, or for something that is waiting for it.

Debugging what exactly is going on can be hard. This doc is a braindump of commands we've used to investigate when things go Awry.

Tools used in the sample commands below include:

pulp-cli for the pulp command
httpie for the http command
jq for massaging json output
rq for Querying the RQ Queues
pulpcore-manager for gathering data directly from Django without having to crawl inside the database
- running in a katello/pulp env can have some gotchas
  - run pulpcore-manager as the pulp user, from a directory pulp can read
  - OR, apply this patch : https://github.com/rochacbruno/dynaconf/pull/570/files
  - invoke as follows:
    - sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' DJANGO_SETTINGS_MODULE='pulpcore.app.settings' pulpcore-manager

Gimme answer RIGHT NOW!

The following will tell you if you have a zombie worker holding a resource:

sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' \
   DJANGO_SETTINGS_MODULE='pulpcore.app.settings' \
   pulpcore-manager shell <<EOF
from pulpcore.app.models import ReservedResource, Worker
worker_to_res = {}
for rr in ReservedResource.objects.all():
  worker_to_res[rr.worker_id] = rr.pulp_id
workers = [w.pulp_id for w in Worker.objects.online_workers()]
for rwork in worker_to_res:
  if rwork not in workers:
    print(f'Worker {rwork} owns ReservedResource {worker_to_res[rwork]} and is not in online_workers!!')
EOF

Example output when Something Is Wrong:

Worker f346bf07-eff3-4039-8507-41cb571c0e54 owns ReservedResource 0dbea5c6-b1e4-4463-9cde-7447e0f48911 and is not in online_workers!!

To clean up the locked ReservedResource and clean up the zombie, you can use this:

sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' \
   DJANGO_SETTINGS_MODULE='pulpcore.app.settings' \
   pulpcore-manager shell <<EOF
from pulpcore.app.models import ReservedResource, Worker
worker_to_res = {}
for rr in ReservedResource.objects.all():
  worker_to_res[rr.worker_id] = rr.pulp_id
workers = [w.pulp_id for w in Worker.objects.online_workers()]
for rwork in worker_to_res:
  if rwork not in workers:
    print(f'Worker {rwork} owns ReservedResource {worker_to_res[rwork]} and is not in online_workers!!')
    print('Cleaning up...')
    ReservedResource.objects.get(pk=worker_to_res[rwork]).delete()
    w = Worker.objects.get(pulp_id=rwork)
    w.cleaned_up=True 
    w.save()
EOF

Output:

Worker f346bf07-eff3-4039-8507-41cb571c0e54 owns ReservedResource 24975e48-ccb2-4d48-9bbe-f3c50d43b608 and is not in online_workers!!
Cleaning up...

Details, Details

Task questions

How many tasks are waiting?
pulp task list --state=waiting | jq length
OR
http :/pulp/api/v3/tasks/?state=waiting | jq '.count'
Is anybody running?
pulp task list --state=running | jq length
OR
http :/pulp/api/v3/tasks/?state=running | jq '.count'
How many have failed?
pulp task list --state=failed | jq length
OR
http :/pulp/api/v3/tasks/?state=failed | jq '.count'

RQ Stuff

Show RQ stats:
rq info
or, on a Sat6 system
rq info -u redis://localhost:6379/8
Should show >0 workers and queues - if either is 0, Something Has Gone Horribly Wrong with RQ
Empty queues when things have Gone Awry:
rq empty -a

Finding/removing 'stuck' ReservedResource

Find online-workers:
sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' \ DJANGO_SETTINGS_MODULE='pulpcore.app.settings' \ pulpcore-manager shell -c "from pulpcore.app.models import Worker; print(Worker.objects.online_workers())"
OR
pulp status | jq '.online_workers | .[] | {name, pulp_href}'
Find held resource(s):
ALL:
pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; print(ReservedResource.objects.all())"
SPECIFIC RESOURCE:
pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; print(ReservedResource.objects.get(pk='9762ffab-a473-42b5-85af-5150f3673812').__dict__)"
Find worker that is holding a reserved resource:
pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; print(ReservedResource.objects.get(pk='9762ffab-a473-42b5-85af-5150f3673812').worker)"
If a Worker, above, isn't in the online_workers - that's your problem. Be sad. That reserved-resource Has To Go.
Clean up reserved-resource:
pulpcore-manager shell -c "from pulpcore.app.models import ReservedResource; ReservedResource.objects.get(pk='9762ffab-a473-42b5-85af-5150f3673812').delete()"
Update state of zombie-worker so we stop giving it work to do after it's dead:
pulpcore-manager shell -c "from pulpcore.app.models import Worker; w=Worker.objects.get(name='worker@pulp-worker-5976996b5d-r5hwc'); w.cleaned_up=True; w.save()

Questions

What happens if a worker that is gracefully-stopped but still somehow holds a reservedresource?
Rewrite commands to use "all workers that aren't online"

Reserved-Resource debugging

Why?

Gimme answer RIGHT NOW!

Details, Details

Task questions

RQ Stuff

Finding/removing 'stuck' ReservedResource

Questions

Read more

Pulpcore team meeting

Open Floor Agenda

CLI Team

Pulpcore/katello/pulp_deb integration meeting