CephCSI part 2: Troubleshooting

# CephCSI part 2: Troubleshooting 1. Verify Ceph health: proceed only if it is **"health_ok"** 2. Common Errors: * `GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-186f0c50-7ee2-4ef2-a937-37a14998b1f6 already exists`: * check ceph health * check slow iops in cluster ? * check network connectivity from that pod to ceph cluster ` $ ceph -s --id=csi-rbd-node -m=10.111.136.166:6789 --key=AQDpIQhg+v83EhAAgLboWIbl+FL/nThJzoI3Fg== ` * collect MG and restart that pod 4. Provisioner level errors: Create/Delete/Resize/Controller-publish Volume/Snaspshot * Execute `oc describe pvc/vs <name>`. * Find corresponding errors in pod with label `app=csi-rbdplugin-provisioner -c csi-rbdplugin` or `app=csi-cephfsplugin-provisioner -c csi-cephfsplugin`. * if no error found, check csi-[provisioner|snapshotter|resizer] or snapshot-controller as the case maybe and proceed (search this error for known issues on OCP storage).[may need to increase `CSI_SIDECAR_LOG_LEVEL: "5"` in `rook-ceph-operator-config` ] * Action depends on the error. Use this error to search for existing bugs/issues. * If none found, report BZ with MG. 5. Node level errors: * Execute `oc describe pod <name>` that is having trouble mounting/unmounting. Determine the node where the pod is scheduled. * Find corresponding errors in csi pod on the same node with label `app=csi-rbdplugin -c csi-rbdplugin` or `app=csi-cephfsplugin -c csi-cephfsplugin`. * if no error found, its coming from kublet (search this error for known issues on OCP storage). (known common issue ex: selinux relabeling,fsgroup, csidriver not found in list) * Action depends on the error. Use this error to search for existing bugs/issues. * If none found, report BZ with MG. 5. Over-provisioning of ceph cluster: * Flattening and cephfs cloning may be causing increase in storage consumption even when there is application activity. * Use `ceph rbd task ls -p <pool_name>` to determine flattening activity. * Use `ceph fs subvolumegroup info <vol_name> <group_name>` to check if any clone is in in-progress/pending. * Both the above information is present in MG. 6. QnA --- Steps to troubleshoot CSI-Addons: 1. Obtain `-o yaml` of the corresponding CR. 2. Check logs in cs-addons-controller-manager pod. 3. Determine the csi pod to which request was sent. 4. Check logs on corresponding csi pod.