Cluster does not respond after one of the 3 controllers is powered off

Solution In Progress - Updated -

Issue

  • If Only 1 of the 3 controllers in the HA cluster goes down, the stack is no longer usable. (all OpenStack commands fail)
  • Incorrect stonith configuration will cause all OpenStack commands to fail.

  • /var/log/cluster/corosync.log

Delay(delay)[1312]:     2016/01/15_16:52:41 INFO: Delay is running OK
Jan 15 16:52:59 [4011]        crmd:     info: do_lrm_rsc_op:      Performing key=497:16:0:55c1b34e-e757-4e7e-84c5-bed2227dc2ff op=redis_notify_0
Jan 15 16:52:59 [3982]  stonith-ng:   notice: can_fence_host_with_device: my-ipmilan-for--controller-0 can not fence (reboot) 1-controller-2: static-list
Jan 15 16:52:59 [3982]  stonith-ng:   notice: can_fence_host_with_device: my-ipmilan-for-1-controller-2 can fence (reboot) 1-controller-2: static-list
Jan 15 16:52:59 [4006]        lrmd:     info: log_execute:        executing - rsc:redis action:notify call_id:315
Jan 15 16:52:59 [4006]        lrmd:     info: log_finished:       finished - rsc:redis action:notify call_id:315 pid:2256 exit-code:0 exec-time:47ms queue-time:0ms
Jan 15 16:52:59 [4011]       crmd:   notice: process_lrm_event:  Operation redis_notify_0: ok (node=-controller-1, call=315, rc=0, cib-update=0, confirmed=true)
Jan 15 16:53:00 [3982]  stonith-ng:   notice: can_fence_host_with_device: my-ipmilan-for-controller-0 can not fence (reboot) -controller-2: static-list
Jan 15 16:53:00 [3982] stonith-ng:   notice: can_fence_host_with_device: my-ipmilan-for-jcc1-controller-2 can fence (reboot) controller-2: static-list
Jan 15 16:53:00 [3982]  stonith-ng:     info: stonith_fence_get_devices_cb:       Found 1 matching devices for '1-controller-2'
Jan 15 16:53:00 [3982]  stonith-ng:     info: internal_stonith_action_execute:    Attempt 2 to execute fence_ipmilan (reboot). remaining timeout is 60
Delay(delay)[2071]:     2016/01/15_16:53:01 INFO: Delay is running OK
Jan 15 16:53:01 [3982]  stonith-ng:     info: update_remaining_timeout:   Attempted to execute agent fence_ipmilan (reboot) the maximum number of times (2) allowed
Jan 15 16:53:01 [3982] stonith-ng:    error: log_operation:      Operation 'reboot' [2506] (call 18 from crmd.3973) for host 'controller-2' with device 'my-ipmilan-for--controller-2' returned: -201 (Generic Pacemaker error)
Jan 15 16:53:01 [3982]  stonith-ng:  warning: log_operation:      my-ipmilan-for--controller-2:2506 [ Failed: Unable to obtain correct plug status or plug is not available ]
Jan 15 16:53:01 [3982]  stonith-ng:  warning: log_operation:      my-ipmilan-for--controller-2:2506 [  ]
Jan 15 16:53:01 [3982]  stonith-ng:  warning: log_operation:      my-ipmilan-for--controller-2:2506 [  ]
Jan 15 16:53:01 [3982] stonith-ng:   notice: remote_op_done:     Operation reboot of -controller-2 by <no-one> for crmd.3973@-controller-0.4cdcdb7b: No route to host
Jan 15 16:53:01 [4011]        crmd:   notice: tengine_stonith_notify:     Peer -controller-2 was not terminated (reboot) by <anyone> for -controller-0: No route to host (ref=4cdcdb7b-3c9b-4dcb-a591-29cdaf6399ac) by client crmd.3973
Delay(delay)[2932]:     2016/01/15_16:53:21 INFO: Delay is running OK

Environment

  • Red Hat OpenStack 7.0

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content