HackMD - Collaborative Markdown Knowledge Base

Hey folks, a few days ago, I accidentally brough down a 6-node kubernetes cluster (all nodes forced reboot simultaneously). After the cluster is back, pvc volumes can no longer mount to pods properly. I can keep restarting the pods and there is a < 5% chance that the PVC is mounted properly to the pod. But most of the pods in the cluster that required PVCs just failed with this error `an operation with the given Volume ID 0001-0009... already exists`. I tried really hard to understand what it means but i couldn't This cluster has been running for almost 4 months without issue. I used the exact same ceph settings on 3 of my other on prem clusters and they have been working fine. I never encountered any issues with rook ceph deployment. I have been using rook since late 2019 and never had this kind of issue. ### Basic info: rook version: v1.8.8 ceph version: quay.io/ceph/ceph:v16.2.7-20220317 CSI versions: default versions with rook v1.8.8 k8s version: v1.22.5 cloud or onprem: onprem (kubespray) CNI: cilium v1.10.5 (status all OK) similar issue I see on github issue: https://github.com/rook/rook/issues/4896 ### sample pod describe ```bash # kubectl describe pod minio-3 -n minio ...... Node: node6/10.195.137.35 ...... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 59m (x4 over 75m) kubelet Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[kube-api-access-rv86w export]: timed out waiting for the condition Warning FailedMount 9m20s (x23 over 72m) kubelet Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export kube-api-access-rv86w]: timed out waiting for the condition Warning FailedMount 5m (x41 over 74m) kubelet MountVolume.MountDevice failed for volume "pvc-876c02ef-5425-4dd2-a229-0123c26101ff" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-e303347b-7e36-11ec-a17d-de98ef639746 already exists ``` ### rook ceph cluster status (provisioned by rook, not external ceph cluster): ```bash # ceph status cluster: id: 64ffe70f-1485-4fd2-907c-d56774186339 health: HEALTH_WARN Reduced data availability: 3 pgs inactive 76 daemons have recently crashed 154 slow ops, oldest one blocked for 4232 sec, daemons [osd.11,osd.2] have slow ops. services: mon: 3 daemons, quorum d,e,f (age 69m) mgr: a(active, since 71m) osd: 18 osds: 18 up (since 69m), 18 in (since 5h) data: pools: 2 pools, 129 pgs objects: 320.42k objects, 1.2 TiB usage: 3.4 TiB used, 6.4 TiB / 9.8 TiB avail pgs: 2.326% pgs unknown 126 active+clean 3 unknown io: client: 4.7 KiB/s wr, 0 op/s rd, 0 op/s wr progress: Global Recovery Event (69m) [===========================.] (remaining: 99s) ``` all 3 unknown pgs, brought by the forced reboot, have 0 objects, which I still have not figured out how to fix yet. ### csi-rbdplugin-provisioner -> csi-provisioner ``` I0409 08:07:05.590779 1 controller.go:1471] delete "pvc-988f8d19-5b31-4291-a8da-7c4338dd6014": started I0409 08:07:05.591053 1 controller.go:1471] delete "pvc-7de43f06-685e-41cc-a35a-8f5ade2cf796": started E0409 08:07:05.596313 1 controller.go:1481] delete "pvc-7de43f06-685e-41cc-a35a-8f5ade2cf796": volume deletion failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-2284fda2-7786-11ec-a17d-de98ef639746 already exists E0409 08:07:05.596299 1 controller.go:1481] delete "pvc-988f8d19-5b31-4291-a8da-7c4338dd6014": volume deletion failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-be158406-7470-11ec-a17d-de98ef639746 already exists ``` but I cant find these pvcs ```bash # kubectl get pvc -A | grep pvc-988f8d19-5b31-4291-a8da-7c4338dd6014 # kubectl get pvc -A | grep pvc-7de43f06-685e-41cc-a35a-8f5ade2cf796 ``` ### csi-rbdplugin-provisioner -> csi-rbdplugin ``` E0409 08:13:07.305869 1 controllerserver.go:827] ID: 134 Req-ID: 0001-0009-rook-ceph-0000000000000002-be158406-7470-11ec-a17d-de98ef639746 an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-be158406-7470-11ec-a17d-de98ef639746 already exists E0409 08:13:07.305933 1 controllerserver.go:827] ID: 133 Req-ID: 0001-0009-rook-ceph-0000000000000002-2284fda2-7786-11ec-a17d-de98ef639746 an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-2284fda2-7786-11ec-a17d-de98ef639746 already exists E0409 08:13:07.305959 1 utils.go:200] ID: 134 Req-ID: 0001-0009-rook-ceph-0000000000000002-be158406-7470-11ec-a17d-de98ef639746 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-be158406-7470-11ec-a17d-de98ef639746 already exists E0409 08:13:07.306005 1 utils.go:200] ID: 133 Req-ID: 0001-0009-rook-ceph-0000000000000002-2284fda2-7786-11ec-a17d-de98ef639746 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-2284fda2-7786-11ec-a17d-de98ef639746 already exists ``` ### csi-rbdplugin -> driver-registrar ``` I0409 07:05:04.714299 2484 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/rook-ceph.rbd.csi.ceph.com/registration" I0409 07:05:15.781589 2484 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,} ``` ### csi-rbdplugin -> csi-rbdplugin ``` E0409 08:16:30.776309 2549 utils.go:200] ID: 1431 Req-ID: 0001-0009-rook-ceph-0000000000000002-00f4764b-a0c6-11ec-a1ba-da2817286f1c GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-00f4764b-a0c6-11ec-a1ba-da2817286f1c already exists E0409 08:16:37.161811 2549 nodeserver.go:279] ID: 1434 Req-ID: 0001-0009-rook-ceph-0000000000000002-77f676f2-886a-11ec-a17d-de98ef639746 an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-77f676f2-886a-11ec-a17d-de98ef639746 already exists E0409 08:16:37.161886 2549 utils.go:200] ID: 1434 Req-ID: 0001-0009-rook-ceph-0000000000000002-77f676f2-886a-11ec-a17d-de98ef639746 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-77f676f2-886a-11ec-a17d-de98ef639746 already exists E0409 08:16:38.983931 2549 nodeserver.go:279] ID: 1437 Req-ID: 0001-0009-rook-ceph-0000000000000002-2ad307d9-61e8-11ec-af47-5a3527526f98 an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-2ad307d9-61e8-11ec-af47-5a3527526f98 already exists E0409 08:16:38.983999 2549 utils.go:200] ID: 1437 Req-ID: 0001-0009-rook-ceph-0000000000000002-2ad307d9-61e8-11ec-af47-5a3527526f98 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-2ad307d9-61e8-11ec-af47-5a3527526f98 already exists E0409 08:16:46.687122 2549 nodeserver.go:279] ID: 1440 Req-ID: 0001-0009-rook-ceph-0000000000000002-e303347b-7e36-11ec-a17d-de98ef639746 an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-e303347b-7e36-11ec-a17d-de98ef639746 already exists E0409 08:16:46.687197 2549 utils.go:200] ID: 1440 Req-ID: 0001-0009-rook-ceph-0000000000000002-e303347b-7e36-11ec-a17d-de98ef639746 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-e303347b-7e36-11ec-a17d-de98ef639746 already exists ``` ### target node kubelet logs (node6) (note: the issue happens on all 6 nodes of my cluster instead of just 1) ``` Apr 9 04:20:23 odc-dbkb06p kubelet[1980]: I0409 04:20:23.170402 1980 reconciler.go:254] "Starting operationExecutor.MountVolume for volume \"pvc-876c02ef-5425-4dd2-a229-0123c26101ff\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000002-e303347b-7e36-11ec-a17d-de98ef639746\") pod \"minio-3\" (UID: \"d2947dc6-e3e5-4bee-b224-f547fe9297af\") " Apr 9 04:20:23 odc-dbkb06p kubelet[1980]: I0409 04:20:23.170542 1980 reconciler.go:254] "Starting operationExecutor.MountVolume for volume \"pvc-fefae570-f973-44d3-8f4e-7c56783bc244\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000002-2ad307d9-61e8-11ec-af47-5a3527526f98\") pod \"vault-0\" (UID: \"42b90945-edb6-4f60-98d8-8927fbbad9e4\") " ``` ### other kubelet symptoms： I dont know if this is related but it has a very high chance for kubelet to become zombie process when doing 'systemctl restart kubelet' or 'systemctl stop kubelet' ### some osd prepare pods getting stuck and sometimes after reboot they starts up magically (stuck == running indefinitely) rarely after cluster reboot (yes, I try to reboot all 6 nodes sometimes in order to get this fixed with magic), all osd parepare are completed without issues ``` rook-ceph-osd-prepare-node1--1-ttqqm 0/1 Completed 0 86m rook-ceph-osd-prepare-node2--1-6wq8l 1/1 Running 0 86m rook-ceph-osd-prepare-node3--1-z6snr 0/1 Completed 0 86m rook-ceph-osd-prepare-node4--1-bg8sv 1/1 Running 0 86m rook-ceph-osd-prepare-node5--1-886dc 1/1 Running 0 86m rook-ceph-osd-prepare-node6--1-x7p4f 0/1 Completed 0 86m ``` when they are stuck, they are all stuck at the exact same command: ``` 2022-04-09 07:17:35.460753 D | exec: Running command: udevadm info --query=property /dev/sda2 2022-04-09 07:17:35.467824 D | exec: Running command: lsblk /dev/sda2 --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME 2022-04-09 07:17:35.470148 D | exec: Running command: ceph-volume inventory --format json /dev/sda2 ``` where sda2 is not even included in deviceFilter in CephCluster CR because it is my root device: ```yaml storage: deviceFilter: ^sd[b-z] ``` when I check target node process on nodes where, the process is hung: ```bash # ps -ef | grep ceph-volume root 25444 25304 0 03:17 ? 00:00:00 /usr/libexec/platform-python -s /usr/sbin/ceph-volume inventory --format json /dev/sda2 root 140658 140572 0 04:45 pts/0 00:00:00 grep --color=auto ceph-volume ``` when I execute the ceph-volume inventory command manually inside csi-rbdplugin pods, my manual command is hung as well also /dev/sda2 is a less than 1K partition: ```bash # lsblk --all sda 8:0 0 278.9G 0 disk |-sda1 8:1 0 512M 0 part |-sda2 8:2 0 1K 0 part `-sda5 8:5 0 278.4G 0 part |-vgodc--dbkb02p-root | 253:2 0 277.4G 0 lvm / `-vgodc--dbkb02p-swap_1 253:3 0 980M 0 lvm ``` ### rbd image bouncing between nodes: sometimes when a pod which was originally scheduled to node4 is rescheduled to node6, the rbd image stays mounted on node4. After manual intervention (umount/unmap) on node4, it gets mapped on node6. And then this ‘operation already exists' error occurred. And after some time, the rbd image slips back to node4 again, even if it is already rescheduled to node6. I see this when checking rbd device list on the csi-rbdplugin ds ### kubelet log spamming: ``` Apr 9 04:59:21 odc-dbkb06p kubelet[1980]: I0409 04:59:21.920439 1980 reconciler.go:254] "Starting operationExecutor.MountVolume for volume \"kube-api-access-mfxmv\" (UniqueName: \"kubernetes.io/projected/965032ee-b012-4203-9fff-c7ad0a5ee899-kube-api-access-mfxmv\") pod \"cnx-css-api-scheduling-5c4bb97955-qlwd6\" (UID: \"965032ee-b012-4203-9fff-c7ad0a5ee899\") Volume is already mounted to pod, but remount was requested." Apr 9 04:59:21 odc-dbkb06p kubelet[1980]: I0409 04:59:21.920562 1980 reconciler.go:254] "Starting operationExecutor.MountVolume for volume \"pvc-e652fa97-b8f6-42f5-88e0-e8d5c356b9ce\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000002-00f4764b-a0c6-11ec-a1ba-da2817286f1c\") pod \"my-cluster-kafka-1\" (UID: \"e1eebd60-3525-4dde-9065-6eda19e4f827\") " Apr 9 04:59:21 odc-dbkb06p kubelet[1980]: I0409 04:59:21.920644 1980 reconciler.go:254] "Starting operationExecutor.MountVolume for volume \"pvc-8e73afef-f889-486e-aa56-6e83daec8175\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000002-e56a9aa1-a0c5-11ec-a1ba-da2817286f1c\") pod \"my-cluster-zookeeper-1\" (UID: \"74c59474-c9fa-4cbf-ab07-d080f5357b6f\") " Apr 9 04:59:22 odc-dbkb06p kubelet[1980]: I0409 04:59:22.021233 1980 reconciler.go:254] "Starting operationExecutor.MountVolume for volume \"pvc-fefae570-f973-44d3-8f4e-7c56783bc244\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000002-2ad307d9-61e8-11ec-af47-5a3527526f98\") pod \"vault-0\" (UID: \"42b90945-edb6-4f60-98d8-8927fbbad9e4\") " Apr 9 04:59:22 odc-dbkb06p kubelet[1980]: I0409 04:59:22.021304 1980 reconciler.go:254] "Starting operationExecutor.MountVolume for volume \"pvc-404a7757-6533-4d11-bfce-091fa28cff48\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000002-77f676f2-886a-11ec-a17d-de98ef639746\") pod \"vault-0\" (UID: \"42b90945-edb6-4f60-98d8-8927fbbad9e4\") " ``` kubelet spams this kind of logs hundreds of lines every second, as if it is inside an infinite loop. ### kubelet orphaned pods after the reboot I saw some github issues claiming this is related. I had some orphaned pods too after the forced reboot. but removing them does not help with the issue. ### on target node, some rbd process stuck: ```bash # kubectl exec -it csi-rbdplugin-7jrvk /bin/bash -c csi-rbdplugin [root@odc-dbkb06p /]# ps -ef | grep rbd root 2484 2414 0 07:04 ? 00:00:00 /csi-node-driver-registrar --v=0 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/lib/kubelet/plugins/rook-ceph.rbd.csi.ceph.com/csi.sock root 2549 2414 0 07:04 ? 00:00:15 /usr/local/bin/cephcsi --nodeid=node6 --endpoint=unix:///csi/csi.sock --v=0 --type=rbd --nodeserver=true --drivername=rook-ceph.rbd.csi.ceph.com --pidlimit=-1 --metricsport=9092 --metricspath=/metrics --enablegrpcmetrics=true --stagingpath=/var/lib/kubelet/plugins/kubernetes.io/csi/pv/ root 2685 2 0 07:04 ? 00:00:00 [rbd] root 9793 8510 0 07:07 ? 00:00:06 /usr/local/bin/cephcsi --nodeid=node6 --endpoint=unix:///csi/csi-provisioner.sock --v=0 --type=rbd --controllerserver=true --drivername=rook-ceph.rbd.csi.ceph.com --pidlimit=-1 --metricsport=9092 --metricspath=/metrics --enablegrpcmetrics=true root 29539 2 0 07:16 ? 00:00:00 [rbd0-tasks] root 29540 2 0 07:16 ? 00:00:00 [rbd1-tasks] root 29541 2 0 07:16 ? 00:00:00 [rbd2-tasks] root 29558 2549 0 07:16 ? 00:00:00 fsck -a /dev/rbd2 root 29560 29558 0 07:16 ? 00:00:00 fsck.ext4 -a /dev/rbd2 root 29566 2 0 07:16 ? 00:00:00 [jbd2/rbd0-8] root 29571 2 0 07:16 ? 00:00:00 [jbd2/rbd1-8] root 29955 2 0 07:17 ? 00:00:00 [rbd3-tasks] root 29957 2 0 07:17 ? 00:00:00 [rbd4-tasks] root 29959 2 0 07:17 ? 00:00:00 [rbd5-tasks] root 29986 2549 0 07:17 ? 00:00:00 fsck -a /dev/rbd4 root 29987 29986 0 07:17 ? 00:00:00 fsck.ext4 -a /dev/rbd4 root 30008 2 0 07:17 ? 00:00:00 [rbd6-tasks] root 30048 2 0 07:17 ? 00:00:00 [rbd7-tasks] root 30063 2 0 07:17 ? 00:00:00 [jbd2/rbd3-8] root 30103 2 0 07:17 ? 00:00:00 [jbd2/rbd7-8] root 30294 2 0 07:17 ? 00:00:00 [rbd8-tasks] root 30429 2 0 07:17 ? 00:00:00 [rbd9-tasks] root 30504 2 0 07:17 ? 00:00:00 [jbd2/rbd9-8] root 30509 2549 0 07:17 ? 00:00:00 fsck -a /dev/rbd5 root 30511 2549 0 07:17 ? 00:00:00 fsck -a /dev/rbd6 root 30518 30509 0 07:17 ? 00:00:00 fsck.ext4 -a /dev/rbd5 root 30519 30511 0 07:17 ? 00:00:00 fsck.ext4 -a /dev/rbd6 root 30568 2 0 07:17 ? 00:00:00 [jbd2/rbd8-8] root 464309 463976 0 09:13 pts/1 00:00:00 grep --color=auto rbd ``` ### What other debug steps I have done: 1. pod network connection to mon all working fine ( csi-rbdplugin, csi-rbdplugin-provisioner) 2. in the hope that magic happens, I upgrade rook from v1.5.3 to v1.8.8, and ceph from v15.2.7 to ceph:v16.2.7-20220317. I am pretty sure all the errors remain unchanged. And site note: when the operator is trying to upgrade the osd process, all the osd processes are all not ok-to-stop, even though when I check from ceph toolbox, they all have 3 replicas and are all ok to stop. similar to what was described here https://github.com/rook/rook/issues/5243#issuecomment-618899510. so I had to delete the osd deployment manually to force rollout. 3. firewall is not active on these nodes