Cluster services and/or GFS2 become blocked when a node fails or leaves in a RHEL 6 High Availability cluster running Trend Micro Server
Issue
- GFS2 cluster requires all servers to be rebooted at the same time.
- If one node in the cluster reboots, the other node just blocks and never fences it and its GFS2 file systems become blocked
- I see the kernel reporting
corosynchas blocked for more than 120 seconds after aTOTEMprocessor failure, and other GFS2-related processes block as well:
Jun 23 14:36:23 node2 corosync[2341]: [TOTEM ] A processor failed, forming new configuration.
Jun 23 14:40:23 node2 kernel: INFO: task corosync:2341 blocked for more than 120 seconds.
Jun 23 14:40:23 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 14:40:23 node2 kernel: corosync D 0000000000000003 0 2341 1 0x00000080
Jun 23 14:40:23 node2 kernel: ffff880c577f7e28 0000000000000086 ffff880c577f7d98 ffffffff81090c06
Jun 23 14:40:23 node2 kernel: ffff880c5c8fb380 ffffffffa0459da0 ffff880c577f7de8 ffffffff8104c9a9
Jun 23 14:40:23 node2 kernel: ffff880c5c3ed038 ffff880c577f7fd8 000000000000f4e8 ffff880c5c3ed038
Jun 23 14:40:23 node2 kernel: Call Trace:
Jun 23 14:40:23 node2 kernel: [<ffffffff81090c06>] ? autoremove_wake_function+0x16/0x40
Jun 23 14:40:23 node2 kernel: [<ffffffff8104c9a9>] ? __wake_up_common+0x59/0x90
Jun 23 14:40:23 node2 kernel: [<ffffffff81090ede>] ? prepare_to_wait+0x4e/0x80
Jun 23 14:40:23 node2 kernel: [<ffffffffa04478a5>] closeHook+0x615/0xbe0 [splxmod]
Jun 23 14:40:23 node2 kernel: [<ffffffff81176652>] ? vfs_write+0x132/0x1a0
Jun 23 14:40:23 node2 kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
Jun 23 14:40:23 node2 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Environment
- Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add On
- Trend Micro Server Protect
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.