fence_ipmilan reports "Connection timed out" in a RHEL 7 High Availability cluster

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 7 with the High Availability Add On
  • One or more stonith devices configured to usefence_ipmilan as the agent
  • fence-agents-ipmilan-4.0.11-11.el7 or later

Issue

  • My fence_ipmilan device is reporting "Connection timed out"
Jun 01 11:44:45 [2824] node1.example.com stonith-ng:   notice: log_operation:   Operation 'monitor' [12742] for device 'fence_node1' returned: -110 (Connection timed out)
Jun 01 11:44:45 [2824] node1.example.com stonith-ng:  warning: log_operation:   fence_node1:12742 [ Connection timed out ]
Jun 01 11:44:45 [2824] node1.example.com stonith-ng:  warning: log_operation:   fence_node1:12742 [  ]
Jun 01 11:44:45 [2824] node1.example.com stonith-ng:  warning: log_operation:   fence_node1:12742 [  ]
Jun 01 11:44:45 [2825] node1.example.com       lrmd:     info: log_finished:    finished - rsc:fence_node1 action:start call_id:312  exit-code:1 exec-time:41116ms queue-time:1ms
Jun 01 11:44:46 [2828] node1.example.com       crmd:    error: process_lrm_event:   Operation fence_node1_start_0 (node=node1.example.com, call=312, status=4, cib-update=803, confirmed=true) Error
  • My stonith devices are failing to start and failing monitor operations. I've configured my stonith devices several times with different options regarding the timeouts like pcmk_monitor_timeout=120s, stonith-timeout, but nothing helped.

Resolution

Set a power_timeout value in the device's attributes that is higher than the default 20 seconds.

# # Example: # pcs stonith create <device> <agent> <attributes> power_timeout=<seconds>
# pcs stonith create node1_ipmi fence_ipmilan ipaddr=node1-ipmi.example.com lanplus=1 login=admin password='a2@7czD44#pQrs7UX.' power_timeout=60
# # Example # pcs stonith update <device> power_timeout=<seconds>
# pcs stonith update node1_ipmi power_timeout=60

Root Cause

In RHEL 7 Update 1 (fence-agents-ipmilan-4.0.11-11.el7), the fence_ipmilan agent was updated to a new implementation that utilizes the shared fencing library that many other fence agents use. The error message being reported here ("Connection timed out") is one of the standard errors that agents using this library can report, and in the case of fence_ipmilan it means that the ipmitool command it spawned did not return within the timeout that was allocated for it; that timeout is controlled by the power_timeout attribute, which defaults to 20 seconds. So, increasing this timeout will give the device more time to complete the command, hopefully avoiding the error.

For information on resolving timeout errors more generally, see this related solution.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments