The GFS2 filesystem took a long time to successfully mount or replay the journal on a fenced cluster node in RHEL 5 or 6

Solution Verified - Updated -

Issue

  • The GFS2 fs took a long time to reply the journal on a fenced cluster node:
Nov  9 20:46:45 node1 clurgmgrd[15745]: <info> Waiting for node #2 to be fenced 
Nov  9 20:51:40 node1 fenced[9253]: node2 not a cluster member after 300 sec post_fail_delay
Nov  9 20:51:40 node1 fenced[9253]: fencing node "node2"
Nov  9 20:51:59 node1 fenced[9253]: fence "node2" success
Nov  9 20:51:59 node1 kernel: GFS2: fsid=prod:gfs1.0: jid=1: Trying to acquire journal lock...
....
Nov  9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Looking at journal...
Nov  9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Acquiring the transaction lock...
Nov  9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Replaying journal...
Nov  9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Replayed 364 of 365 blocks
Nov  9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Found 1 revoke tags
Nov  9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Journal replayed in 1s
Nov  9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Done
  • After a node was fenced, we could not access the GFS2 file system for a long time
  • After fencing, it took an excessive amount of time for the file system to become available
  • Node was 'evicted' and fenced moments later, but the services were not restarted on any other node until 20 min or so later.
  • When a cluster node mounts a GFS2 filesystem with mount.gfs2 it takes an unusually long time to complete. The backtrace of the mount.gfs2 appears to be waiting on DLM:
Jun 16 23:06:03 node42 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "Cluster5:SpaceTravel1"
Jun 16 23:06:04 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: Joined cluster. Now mounting FS...
[....]
Jun 16 23:09:50 node42 kernel: INFO: task mount.gfs2:4427 blocked for more than 120 seconds.
Jun 16 23:09:50 node42 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 16 23:09:50 node42 kernel: mount.gfs2    D ffffffff801546d1     0  4427      1          5414  5131 (NOTLB)
Jun 16 23:09:50 node42 kernel:  ffff811077c0d8e8 0000000000000082 ffff811077c0d8e8 ffff81107f05a3a0
Jun 16 23:09:50 node42 kernel:  0000000000000010 0000000000000008 ffff81107a665860 ffff81189c2557e0
Jun 16 23:09:50 node42 kernel:  000018e6504c9fb4 0000000000005fca ffff81107a665a48 000000097fffffff
Jun 16 23:09:50 node42 kernel: Call Trace:
Jun 16 23:09:50 node42 kernel:  [<ffffffff8006467c>] __down_read+0x7a/0x92
Jun 16 23:09:50 node42 kernel:  [<ffffffff887d3c9c>] :dlm:dlm_lock+0x4b/0x129
Jun 16 23:09:50 node42 kernel:  [<ffffffff8887885a>] :lock_dlm:gdlm_do_lock+0x6c/0xd7
Jun 16 23:09:50 node42 kernel:  [<ffffffff888784c0>] :lock_dlm:gdlm_ast+0x0/0x32e
Jun 16 23:09:50 node42 kernel:  [<ffffffff88878a1c>] :lock_dlm:gdlm_bast+0x0/0xdd
Jun 16 23:09:50 node42 kernel:  [<ffffffff887fe1a9>] :gfs2:do_xmote+0x161/0x1c1
Jun 16 23:09:50 node42 kernel:  [<ffffffff887fe685>] :gfs2:gfs2_glock_nq+0x264/0x28f
Jun 16 23:09:50 node42 kernel:  [<ffffffff887fe7e7>] :gfs2:gfs2_glock_nq_num+0x43/0x68
Jun 16 23:09:50 node42 kernel:  [<ffffffff88809b5f>] :gfs2:init_locking+0x2e/0x14b
Jun 16 23:09:50 node42 kernel:  [<ffffffff8880a848>] :gfs2:fill_super+0x51c/0xab8
Jun 16 23:09:50 node42 kernel:  [<ffffffff8006457b>] __down_write_nested+0x12/0x92
Jun 16 23:09:50 node42 kernel:  [<ffffffff887fe7df>] :gfs2:gfs2_glock_nq_num+0x3b/0x68
Jun 16 23:09:50 node42 kernel:  [<ffffffff800e6493>] set_bdev_super+0x0/0xf
Jun 16 23:09:50 node42 kernel:  [<ffffffff800e64a2>] test_bdev_super+0x0/0xd
Jun 16 23:09:50 node42 kernel:  [<ffffffff8880a32c>] :gfs2:fill_super+0x0/0xab8
Jun 16 23:09:50 node42 kernel:  [<ffffffff800e7461>] get_sb_bdev+0x10a/0x16c
Jun 16 23:09:50 node42 kernel:  [<ffffffff80130c2b>] selinux_sb_copy_data+0x1a1/0x1c5
Jun 16 23:09:50 node42 kernel:  [<ffffffff800e6dfe>] vfs_kern_mount+0x93/0x11a
Jun 16 23:09:50 node42 kernel:  [<ffffffff800e6ec7>] do_kern_mount+0x36/0x4d
Jun 16 23:09:50 node42 kernel:  [<ffffffff800f18c5>] do_mount+0x6a9/0x719
Jun 16 23:09:50 node42 kernel:  [<ffffffff80045ad3>] do_sock_read+0xcf/0x110
Jun 16 23:09:50 node42 kernel:  [<ffffffff8022c620>] sock_aio_read+0x4f/0x5e
Jun 16 23:09:50 node42 kernel:  [<ffffffff8000cfdf>] do_sync_read+0xc7/0x104
Jun 16 23:09:50 node42 kernel:  [<ffffffff800ceeb4>] zone_statistics+0x3e/0x6d
Jun 16 23:09:50 node42 kernel:  [<ffffffff8000f470>] __alloc_pages+0x78/0x308
Jun 16 23:09:50 node42 kernel:  [<ffffffff8004c0df>] sys_mount+0x8a/0xcd
Jun 16 23:09:50 node42 kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83
[....]
Jun 16 23:14:55 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: jid=7, already locked for use
Jun 16 23:14:55 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: jid=7: Looking at journal...
Jun 16 23:14:55 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: jid=7: Done

Environment

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content