在 RHEL 7、8、9 High Availability 集群中隔离失败,因为 systemd 启动了一个安全关闭
Issue
- 隔离失败,因为
systemd-logind对 "电源按钮"信号进行处理并启动了一个安全关闭(graceful shutdown)过程,而不是进行电源关机过程。 - 当某个节点隔离其他节点时,我们可以看到节点处理按电源按钮的行为,然后开始关闭过程。与此同时,我们可以看到在其他节点上隔离失败,似乎是因为用时过长造成的
- 我们是否需要象以前的版本一样,在 RHEL 7 集群中禁用 acpi / acpid?
- 除了在 RHEL 7 集群节点上禁用 ACPI 外,是否还需要进行其他操作来避免软关机?例如:
Aug 13 21:07:22 node01 systemd-logind: Power key pressed.
Aug 13 21:07:22 node01 systemd-logind: Powering Off...
Aug 13 21:07:22 node01 systemd-logind: System is powering down.
Aug 13 21:07:42 node02 stonith-ng[2803]: notice: log_operation: Operation 'reboot' [3114] for device 'node01-ilo' returned: -62 (Timer expired)
- RHEL 7 中,一个集群节点会被安全重启,而不是被硬终止:
Nov 2 10:57:01 node41 stonith-ng[8161]: notice: Operation reboot of node42 by node42 for crmd.20238@uxplpsgrd03.8b66209c: OK
Nov 2 10:57:01 node42 crmd[20238]: crit: We were allegedly just fenced by node41 for node42!
Nov 2 10:57:01 node42 stonith-ng[20234]: notice: Operation reboot of node42 by node41 for crmd.20238@node42.8b66209c: OK
Nov 2 10:57:01 node42 systemd-logind: Power key pressed.
- RHEL 8 中,一个集群节点会被安全重启,而不是被硬终止:
Sep 18 16:19:11 rhel8-1 stonith-ng[8161]: notice: Operation reboot of rhel8-1 by rhel8-2 for crmd.20238@uxplpsgrd03.8b66209c: OK
Sep 18 16:19:11 rhel8-1 crmd[20238]: crit: We were allegedly just fenced by rhel8-1 for rhel8-2!
Sep 18 16:19:11 rhel8-1 systemd-logind[792]: Session 1 logged out. Waiting for processes to exit.
Sep 18 16:19:11 rhel8-1 systemd-logind[792]: Removed session 1.
Environment
- 具有高可用性附加组件的 Red Hat Enterprise Linux (RHEL) 7、8 和 9
- 一个或多个
pacemaker 集群节点(或pacemaker 远程节点)与 stonith` 设备关联,这个设备使用了一个基于电源的方法连接到 BMC 或系统管理控制器(如 iLO, RSA, DRAC, iDRAC, 等)。
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.