The node is not rebooting although fence operation is completed successfully

Comments

Manual fence from node2 to node1

[root@clus2 ~]# pcs stonith fence clus1
Node: clus1 fenced
[root@clus2 ~]#

Meanwhile the logs from node1;

Mar 02 07:27:58 clus1 pacemaker-fenced[1018]:  notice: scsi is eligible to fence (reboot) clus1: static-list
Mar 02 07:27:58 clus1 pacemaker-fenced[1018]:  notice: Operation 'reboot' targeting clus1 by clus2 for stonith_admin.1598@clus2: OK (complete)
Mar 02 07:27:58 clus1 pacemaker-controld[1023]:  crit: We were allegedly just fenced by clus2 for clus2!
Mar 02 07:27:58 clus1 pacemaker-execd[1019]:  warning: new_event_notification (/dev/shm/qb-1019-1023-9-ThT8CY/qb): Bad file descriptor (9)
Mar 02 07:27:58 clus1 pacemaker-execd[1019]:  warning: Could not notify client crmd: Bad file descriptor
Mar 02 07:27:58 clus1 pacemakerd[1016]:  warning: Shutting cluster down because pacemaker-controld[1023] had fatal failure
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Shutting down Pacemaker
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-schedulerd
Mar 02 07:27:58 clus1 pacemaker-schedulerd[1021]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-attrd
Mar 02 07:27:58 clus1 pacemaker-attrd[1020]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-execd
Mar 02 07:27:58 clus1 pacemaker-execd[1019]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-fenced
Mar 02 07:27:58 clus1 pacemaker-fenced[1018]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-based
Mar 02 07:27:58 clus1 pacemaker-based[1017]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemaker-based[1017]:  notice: Disconnected from Corosync
Mar 02 07:27:58 clus1 pacemaker-based[1017]:  notice: Disconnected from Corosync
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Shutdown complete
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Shutting down and staying down after fatal error
Mar 02 07:27:58 clus1 corosync[896]:   [CFG   ] Node 1 was shut down by sysadmin
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Unloading all Corosync service engines.
Mar 02 07:27:58 clus1 systemd[1]: pacemaker.service: Succeeded.
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync configuration map access
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync configuration service
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync profile loading service
Mar 02 07:27:59 clus1 corosync[896]:   [MAIN  ] Corosync Cluster Engine exiting normally
Mar 02 07:27:59 clus1 systemd[1]: corosync.service: Control process exited, code=exited status=1
Mar 02 07:27:59 clus1 systemd[1]: corosync.service: Failed with result 'exit-code'.

And then reboot fails, node1 stays alive after this log.

This is the stonith configuration;

[root@clus2 ~]# pcs stonith config 
Resource: scsi (class=stonith type=fence_scsi)
  Attributes: scsi-instance_attributes
    debug_file=/root/fence.debug
    devices=/dev/sda
    pcmk_host_list="clus1 clus2"
    verbose=yes
  Meta Attributes: scsi-meta_attributes
    provides=unfencing
  Operations:
    monitor: scsi-monitor-interval-60s
      interval=60s
[root@clus2 ~]#

Note that there is no pcmk_reboot_action=on configured.

Started 2023-03-02T07:38:43+00:00 by

Stefan Scherer

Community Member 23 points

Select Your Language

The node is not rebooting although fence operation is completed successfully

Responses

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Responses

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links