The GFS2 filesystem took a long time to successfully mount or replay the journal on a fenced cluster node in RHEL 5 or 6
Issue
- The GFS2 fs took a long time to reply the journal on a fenced cluster node:
Nov 9 20:46:45 node1 clurgmgrd[15745]: <info> Waiting for node #2 to be fenced
Nov 9 20:51:40 node1 fenced[9253]: node2 not a cluster member after 300 sec post_fail_delay
Nov 9 20:51:40 node1 fenced[9253]: fencing node "node2"
Nov 9 20:51:59 node1 fenced[9253]: fence "node2" success
Nov 9 20:51:59 node1 kernel: GFS2: fsid=prod:gfs1.0: jid=1: Trying to acquire journal lock...
....
Nov 9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Looking at journal...
Nov 9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Acquiring the transaction lock...
Nov 9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Replaying journal...
Nov 9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Replayed 364 of 365 blocks
Nov 9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Found 1 revoke tags
Nov 9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Journal replayed in 1s
Nov 9 21:41:27 node1 kernel: GFS2: fsid=prod:gfsprod01.0: jid=1: Done
- After a node was fenced, we could not access the GFS2 file system for a long time
- After fencing, it took an excessive amount of time for the file system to become available
- Node was 'evicted' and fenced moments later, but the services were not restarted on any other node until 20 min or so later.
- When a cluster node mounts a GFS2 filesystem with
mount.gfs2
it takes an unusually long time to complete. The backtrace of themount.gfs2
appears to be waiting on DLM:
Jun 16 23:06:03 node42 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "Cluster5:SpaceTravel1"
Jun 16 23:06:04 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: Joined cluster. Now mounting FS...
[....]
Jun 16 23:09:50 node42 kernel: INFO: task mount.gfs2:4427 blocked for more than 120 seconds.
Jun 16 23:09:50 node42 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 16 23:09:50 node42 kernel: mount.gfs2 D ffffffff801546d1 0 4427 1 5414 5131 (NOTLB)
Jun 16 23:09:50 node42 kernel: ffff811077c0d8e8 0000000000000082 ffff811077c0d8e8 ffff81107f05a3a0
Jun 16 23:09:50 node42 kernel: 0000000000000010 0000000000000008 ffff81107a665860 ffff81189c2557e0
Jun 16 23:09:50 node42 kernel: 000018e6504c9fb4 0000000000005fca ffff81107a665a48 000000097fffffff
Jun 16 23:09:50 node42 kernel: Call Trace:
Jun 16 23:09:50 node42 kernel: [<ffffffff8006467c>] __down_read+0x7a/0x92
Jun 16 23:09:50 node42 kernel: [<ffffffff887d3c9c>] :dlm:dlm_lock+0x4b/0x129
Jun 16 23:09:50 node42 kernel: [<ffffffff8887885a>] :lock_dlm:gdlm_do_lock+0x6c/0xd7
Jun 16 23:09:50 node42 kernel: [<ffffffff888784c0>] :lock_dlm:gdlm_ast+0x0/0x32e
Jun 16 23:09:50 node42 kernel: [<ffffffff88878a1c>] :lock_dlm:gdlm_bast+0x0/0xdd
Jun 16 23:09:50 node42 kernel: [<ffffffff887fe1a9>] :gfs2:do_xmote+0x161/0x1c1
Jun 16 23:09:50 node42 kernel: [<ffffffff887fe685>] :gfs2:gfs2_glock_nq+0x264/0x28f
Jun 16 23:09:50 node42 kernel: [<ffffffff887fe7e7>] :gfs2:gfs2_glock_nq_num+0x43/0x68
Jun 16 23:09:50 node42 kernel: [<ffffffff88809b5f>] :gfs2:init_locking+0x2e/0x14b
Jun 16 23:09:50 node42 kernel: [<ffffffff8880a848>] :gfs2:fill_super+0x51c/0xab8
Jun 16 23:09:50 node42 kernel: [<ffffffff8006457b>] __down_write_nested+0x12/0x92
Jun 16 23:09:50 node42 kernel: [<ffffffff887fe7df>] :gfs2:gfs2_glock_nq_num+0x3b/0x68
Jun 16 23:09:50 node42 kernel: [<ffffffff800e6493>] set_bdev_super+0x0/0xf
Jun 16 23:09:50 node42 kernel: [<ffffffff800e64a2>] test_bdev_super+0x0/0xd
Jun 16 23:09:50 node42 kernel: [<ffffffff8880a32c>] :gfs2:fill_super+0x0/0xab8
Jun 16 23:09:50 node42 kernel: [<ffffffff800e7461>] get_sb_bdev+0x10a/0x16c
Jun 16 23:09:50 node42 kernel: [<ffffffff80130c2b>] selinux_sb_copy_data+0x1a1/0x1c5
Jun 16 23:09:50 node42 kernel: [<ffffffff800e6dfe>] vfs_kern_mount+0x93/0x11a
Jun 16 23:09:50 node42 kernel: [<ffffffff800e6ec7>] do_kern_mount+0x36/0x4d
Jun 16 23:09:50 node42 kernel: [<ffffffff800f18c5>] do_mount+0x6a9/0x719
Jun 16 23:09:50 node42 kernel: [<ffffffff80045ad3>] do_sock_read+0xcf/0x110
Jun 16 23:09:50 node42 kernel: [<ffffffff8022c620>] sock_aio_read+0x4f/0x5e
Jun 16 23:09:50 node42 kernel: [<ffffffff8000cfdf>] do_sync_read+0xc7/0x104
Jun 16 23:09:50 node42 kernel: [<ffffffff800ceeb4>] zone_statistics+0x3e/0x6d
Jun 16 23:09:50 node42 kernel: [<ffffffff8000f470>] __alloc_pages+0x78/0x308
Jun 16 23:09:50 node42 kernel: [<ffffffff8004c0df>] sys_mount+0x8a/0xcd
Jun 16 23:09:50 node42 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83
[....]
Jun 16 23:14:55 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: jid=7, already locked for use
Jun 16 23:14:55 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: jid=7: Looking at journal...
Jun 16 23:14:55 node42 kernel: GFS2: fsid=Cluster5:SpaceTravel1.7: jid=7: Done
Environment
- Red Hat Enterprise Linux (RHEL) 5 with the Resilient Storage Add On
- kernel release prior to
2.6.18-308.13.1.el5
in RHEL 5 Update 8 or prior tokernel-2.6.18-348.el5
- kernel release prior to
- Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
- kernel release prior to
kernel-2.6.32-279.el6
- kernel release prior to
- GFS2
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.