OCS/ODF: Linux shutdown / reboot hangs due to CephFS kernel client, "libceph: connect (1)192.168.xx.76:6801 error -101"

Solution Verified - Updated -

Issue

Linux shutdown / reboot hangs due to CephFS kernel client, "libceph: connect (1)192.168.xx.76:6801 error -101"

This is a snippet of dmesg -T from the node stuck in Linux shutdown / reboot:

[Thu Apr 27 17:10:56 2023] libceph: osd9 (1)192.168.xx.51:6801 connect error
[Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.xx.62:6801 error -101
[Thu Apr 27 17:10:56 2023] libceph: osd0 (1)192.168.xx.62:6801 connect error
[Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.xx.72:6801 error -101
[Thu Apr 27 17:10:56 2023] libceph: osd16 (1)192.168.xx.72:6801 connect error
[Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.xx.46:6801 error -101
[Thu Apr 27 17:10:57 2023] libceph: osd2 (1)192.168.xx.46:6801 connect error
[Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.xx.76:6801 error -101
[Thu Apr 27 17:10:57 2023] libceph: osd22 (1)192.168.xx.76:6801 connect error
[Thu Apr 27 17:10:58 2023] libceph: connect (1)192.168.xx.77:6801 error -101
[Thu Apr 27 17:10:58 2023] libceph: osd3 (1)192.168.xx.77:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.74:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd14 (1)192.168.xx.74:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.52:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd12 (1)192.168.xx.52:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.38:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd18 (1)192.168.xx.38:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.49:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd8 (1)192.168.xx.49:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.56:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd23 (1)192.168.xx.56:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.66:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd6 (1)192.168.xx.66:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.44:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd7 (1)192.168.xx.44:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.11:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd19 (1)192.168.xx.11:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.xx.3:6801 error -101

From Upstream Tracker:

1)  Mount CephFS on client
2)  Shutdown a node containing an OSD+MON or somehow make that node unreachable
3)  While client is accessing the mount(simple ls on directory), reboot the CephFS client node
4)  Observe the client node will be stuck forever until it can reach the Ceph nodes or unless hard reset is done

Environment

Red Hat OpenShift Container Platform (OCP) 4.x
Red Hat OpenShift Container Storage (OCS) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Ceph File System (CephFS)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content