Pod startup failure on alternate nodes during node shutdown in vSphere environments due to VM and disk snapshots

Solution Unverified - Updated -

Issue

  • In an OpenShift cluster deployed on vSphere, when a node fails, Pods need to be moved to a different node. However, the Pods fail to start on the target node. The event logs indicate that the volume is still attached to the original node, causing the pod startup failure on the new node.
$ oc get pod -n openshift-monitoring -owide
NAME                                                    READY   STATUS     RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
prometheus-k8s-0                                        0/6     Init:0/1   0          1h    <none>         node03   <none>           <none>

$ oc get events -n openshift-monitoring
NAMESPACE                              LAST SEEN   TYPE      REASON                                           OBJECT                                                       MESSAGE
1h34m       Warning   FailedAttachVolume        pod/prometheus-k8s-0                                         Multi-Attach error for volume "prometheus-0-pv" Volume is already exclusively attached to one node and can't be attached to another
1h32m       Warning   FailedAttachVolume        pod/prometheus-k8s-0                                         AttachVolume.Attach failed for volume "prometheus-0-pv" : rpc error: code = Internal desc = failed to attach disk: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" with node: "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy" err failed to attach cns volume: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" to node vm: "VirtualMachine:vm-01 [VirtualCenterHost: node01.example.com, UUID: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-01, VirtualCenterHost: node01.example.com]]". fault: "(*types.LocalizedMethodFault)(0xc000bbf900)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (*types.ResourceInUse)(0xc001007b40)({\n  VimFault: (types.VimFault) {\n   MethodFault: (types.MethodFault) {\n    FaultCause: (*types.LocalizedMethodFault)(<nil>),\n    FaultMessage: ([]types.LocalizableMessage) <nil>\n   }\n  },\n  Type: (string) \"\",\n  Name: (string) (len=6) \"volume\"\n }),\n LocalizedMessage: (string) (len=32) \"The resource 'volume' is in use.\"\n})\n". opId: "49301d98"

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • vSphere

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content