Cluster does not respond after one of the 3 controllers is powered off
Issue
- If Only 1 of the 3 controllers in the HA cluster goes down, the stack is no longer usable. (all OpenStack commands fail)
-
Incorrect stonith configuration will cause all OpenStack commands to fail.
-
/var/log/cluster/corosync.log
Delay(delay)[1312]: 2016/01/15_16:52:41 INFO: Delay is running OK
Jan 15 16:52:59 [4011] crmd: info: do_lrm_rsc_op: Performing key=497:16:0:55c1b34e-e757-4e7e-84c5-bed2227dc2ff op=redis_notify_0
Jan 15 16:52:59 [3982] stonith-ng: notice: can_fence_host_with_device: my-ipmilan-for--controller-0 can not fence (reboot) 1-controller-2: static-list
Jan 15 16:52:59 [3982] stonith-ng: notice: can_fence_host_with_device: my-ipmilan-for-1-controller-2 can fence (reboot) 1-controller-2: static-list
Jan 15 16:52:59 [4006] lrmd: info: log_execute: executing - rsc:redis action:notify call_id:315
Jan 15 16:52:59 [4006] lrmd: info: log_finished: finished - rsc:redis action:notify call_id:315 pid:2256 exit-code:0 exec-time:47ms queue-time:0ms
Jan 15 16:52:59 [4011] crmd: notice: process_lrm_event: Operation redis_notify_0: ok (node=-controller-1, call=315, rc=0, cib-update=0, confirmed=true)
Jan 15 16:53:00 [3982] stonith-ng: notice: can_fence_host_with_device: my-ipmilan-for-controller-0 can not fence (reboot) -controller-2: static-list
Jan 15 16:53:00 [3982] stonith-ng: notice: can_fence_host_with_device: my-ipmilan-for-jcc1-controller-2 can fence (reboot) controller-2: static-list
Jan 15 16:53:00 [3982] stonith-ng: info: stonith_fence_get_devices_cb: Found 1 matching devices for '1-controller-2'
Jan 15 16:53:00 [3982] stonith-ng: info: internal_stonith_action_execute: Attempt 2 to execute fence_ipmilan (reboot). remaining timeout is 60
Delay(delay)[2071]: 2016/01/15_16:53:01 INFO: Delay is running OK
Jan 15 16:53:01 [3982] stonith-ng: info: update_remaining_timeout: Attempted to execute agent fence_ipmilan (reboot) the maximum number of times (2) allowed
Jan 15 16:53:01 [3982] stonith-ng: error: log_operation: Operation 'reboot' [2506] (call 18 from crmd.3973) for host 'controller-2' with device 'my-ipmilan-for--controller-2' returned: -201 (Generic Pacemaker error)
Jan 15 16:53:01 [3982] stonith-ng: warning: log_operation: my-ipmilan-for--controller-2:2506 [ Failed: Unable to obtain correct plug status or plug is not available ]
Jan 15 16:53:01 [3982] stonith-ng: warning: log_operation: my-ipmilan-for--controller-2:2506 [ ]
Jan 15 16:53:01 [3982] stonith-ng: warning: log_operation: my-ipmilan-for--controller-2:2506 [ ]
Jan 15 16:53:01 [3982] stonith-ng: notice: remote_op_done: Operation reboot of -controller-2 by <no-one> for crmd.3973@-controller-0.4cdcdb7b: No route to host
Jan 15 16:53:01 [4011] crmd: notice: tengine_stonith_notify: Peer -controller-2 was not terminated (reboot) by <anyone> for -controller-0: No route to host (ref=4cdcdb7b-3c9b-4dcb-a591-29cdaf6399ac) by client crmd.3973
Delay(delay)[2932]: 2016/01/15_16:53:21 INFO: Delay is running OK
Environment
- Red Hat OpenStack 7.0
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.