The node is not rebooting although fence operation is completed successfully

Posted on

Manual fence from node2 to node1

[root@clus2 ~]# pcs stonith fence clus1
Node: clus1 fenced
[root@clus2 ~]# 

Meanwhile the logs from node1;

Mar 02 07:27:58 clus1 pacemaker-fenced[1018]:  notice: scsi is eligible to fence (reboot) clus1: static-list
Mar 02 07:27:58 clus1 pacemaker-fenced[1018]:  notice: Operation 'reboot' targeting clus1 by clus2 for stonith_admin.1598@clus2: OK (complete)
Mar 02 07:27:58 clus1 pacemaker-controld[1023]:  crit: We were allegedly just fenced by clus2 for clus2!
Mar 02 07:27:58 clus1 pacemaker-execd[1019]:  warning: new_event_notification (/dev/shm/qb-1019-1023-9-ThT8CY/qb): Bad file descriptor (9)
Mar 02 07:27:58 clus1 pacemaker-execd[1019]:  warning: Could not notify client crmd: Bad file descriptor
Mar 02 07:27:58 clus1 pacemakerd[1016]:  warning: Shutting cluster down because pacemaker-controld[1023] had fatal failure
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Shutting down Pacemaker
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-schedulerd
Mar 02 07:27:58 clus1 pacemaker-schedulerd[1021]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-attrd
Mar 02 07:27:58 clus1 pacemaker-attrd[1020]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-execd
Mar 02 07:27:58 clus1 pacemaker-execd[1019]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-fenced
Mar 02 07:27:58 clus1 pacemaker-fenced[1018]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Stopping pacemaker-based
Mar 02 07:27:58 clus1 pacemaker-based[1017]:  notice: Caught 'Terminated' signal
Mar 02 07:27:58 clus1 pacemaker-based[1017]:  notice: Disconnected from Corosync
Mar 02 07:27:58 clus1 pacemaker-based[1017]:  notice: Disconnected from Corosync
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Shutdown complete
Mar 02 07:27:58 clus1 pacemakerd[1016]:  notice: Shutting down and staying down after fatal error
Mar 02 07:27:58 clus1 corosync[896]:   [CFG   ] Node 1 was shut down by sysadmin
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Unloading all Corosync service engines.
Mar 02 07:27:58 clus1 systemd[1]: pacemaker.service: Succeeded.
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync configuration map access
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync configuration service
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 02 07:27:58 clus1 corosync[896]:   [QB    ] withdrawing server sockets
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 02 07:27:58 clus1 corosync[896]:   [SERV  ] Service engine unloaded: corosync profile loading service
Mar 02 07:27:59 clus1 corosync[896]:   [MAIN  ] Corosync Cluster Engine exiting normally
Mar 02 07:27:59 clus1 systemd[1]: corosync.service: Control process exited, code=exited status=1
Mar 02 07:27:59 clus1 systemd[1]: corosync.service: Failed with result 'exit-code'.

And then reboot fails, node1 stays alive after this log.

This is the stonith configuration;

[root@clus2 ~]# pcs stonith config 
Resource: scsi (class=stonith type=fence_scsi)
  Attributes: scsi-instance_attributes
    debug_file=/root/fence.debug
    devices=/dev/sda
    pcmk_host_list="clus1 clus2"
    verbose=yes
  Meta Attributes: scsi-meta_attributes
    provides=unfencing
  Operations:
    monitor: scsi-monitor-interval-60s
      interval=60s
[root@clus2 ~]#

Note that there is no pcmk_reboot_action=on configured.

Responses