Ceph - PG query hangs and doesn't return

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage

Issue

  • One of the placement groups is incomplete and when running a ceph pg <pg id> query it just hangs.
  • How can we continue debugging the PG to continue troubleshooting?

Resolution

  • In some situations when the PG didn't complete peering, the pg query command will hang. There is a short window of opportunity during OSD startup when the PG can still be queried.
  • As that window is quite narrow, a small shell loop helps.
  • In the terminal (terminal 1) issue the following command loop:
while true
do
   ceph pq <pg id> query >> /tmp/query.txt
done
  • Open a new terminal (terminal 2) and issue the osd restart command systemctl restart ceph-osd@<OSD_number> for the primary OSD of that PG.
  • In terminal 1 where the loop is continuously running, Press CTRL-C about every 1 seconds. This will interrupt the command if it hangs and the loop will instantly issue a new command.
  • You will know this has been properly achieved if in terminal 2 you have a successful pg query output is located in /tmp/query.txt
  • Once this process is complete you can simple close the first terminal and review the PG query for continued troubleshooting.
  • You might also be able to use the timeout command to limit pg query execution, but this has not yet been tested:
while ! timeout 1 ceph pg <pg id> query; do echo -n .; done

Root Cause

  • Querying the placement group can sometimes hang if it is stuck in peering.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments