Azure Disk CSI driver LUN duplicated cause data lost
Environment
- Red Hat OpenShift Service on Azure (ARO)
- 4
Issue
- After an ARO node reboot, the Azure Disk volume content including the GID setting been switched in two pods related pv.
Resolution
- The fixed driver had been installed in ARO 4.13.z
- Towards ARO 4.12.z , the issue has been backporting in OCPBUGS-22832 and will provide a fix on 4.12.44
Root Cause
- The issue is caused by azure disk csi race condition BUG.
- issue is fixed by https://github.com/kubernetes-sigs/cloud-provider-azure/pull/2805 in Release v1.25.0 release · kubernetes-sigs/azuredisk-csi-driver
Diagnostic Steps
- In CSI controller LOGS we can detect lun number have duplicate in same node
node c
pod c0 LUN 0
pod c1 LUN 1 <- trouble
pod a1 LUN 1 <- trouble
pod c2 LUN 2
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments