Why are pods mounting volumes with a huge number of files failing to start after upgrading to Openshift Data Foundation 4.12?

Solution In Progress - Updated -

Issue

  • After upgrading from ODF 4.11 to 4.12, pods attaching volumes with many files ( round to millions ) fail to start with a timeout. These are some sample events reported. In the example below, the affected pod is named simple-app-67dfcff4c8-v7gxv. Note the events timed out waiting for the condition:

    oc describe pod simple-app-67dfcff4c8-v7gxv
    
        Type     Reason                  Age                    From                                                   Message
        ----     ------                  ----                   ----                                                   -------
        Normal   Scheduled               <unknown>                                                                     Successfully assigned test/simple-app-67dfcff4c8-v7gxv to worker-2.example.com
        Normal   SuccessfulAttachVolume  30m                    attachdetach-controller                                AttachVolume.Attach succeeded for volume "pvc-0ea2d69a-8e9d-41b2-bfa5-85d0ede6211b"
        Warning  FailedMount             23m (x2 over 25m)      kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[volume-wz7wf kube-api-access-hmlzb]: timed out waiting for the condition
        Warning  FailedMount             19m (x3 over 28m)      kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[kube-api-access-hmlzb volume-wz7wf]: timed out waiting for the condition
        Warning  FailedMount             4m32s (x5 over 15m)    kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[volume-wz7wf kube-api-access-hmlzb]: timed out waiting for the condition
        Warning  FailedMount             2m15s (x2 over 6m46s)  kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[kube-api-access-hmlzb volume-wz7wf]: timed out waiting for the condition
    
  • The volume is correctly mounted in the node hosting the pod. In this example, it's a CephFS volume:

    $ oc debug node/worker-2.example.com
    # chroot /host
    sh-4.4# mount -l | grep cephfs
      <mon-ip-1>:6789,<mon-ip-2>:6789,<mon-ip-3>:6789,<mon-ip-4>:6789,<mon-ip-4>:6789:/volumes/csi/csi-vol-5c8b10be-6760-11ee-ada8-0a580a800215/c7feed46-8956-45c2-b71f-906ae4ad4718 on /host/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/c7dcb34afe060d6cd58e994fc5c10868624970393d6415e2c085b8c6630532b0/globalmount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=my-filesystem)
      <mon-ip-1>:6789,<mon-ip-2>:6789,<mon-ip-3>:6789,<mon-ip-4>:6789,<mon-ip-5>:6789:/volumes/csi/csi-vol-5c8b10be-6760-11ee-ada8-0a580a800215/c7feed46-8956-45c2-b71f-906ae4ad4718 on /host/var/lib/kubelet/pods/029e640c-2db9-49dd-ae9e-5215de6b11f7/volumes/kubernetes.io~csi/pvc-0ea2d69a-8e9d-41b2-bfa5-85d0ede6211b/mount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=my-filesystem)
    

    This mount point is also writable.

  • Why is this issue occurring? How to prevent this problem?

Environment

  • Red Hat Openshift Data Foundation, versions:
    • v4.12
    • v4.13

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content