Remote Nodes fence unexpectedly

Solution In Progress - Updated -

Issue

Compute nodes are unexpectedly fenced from HA Openstack instances, and remote connections are abruptly closed.

Below errors may be observed on the controllers with this issue:

$ cat /var/log/pacemaker/pacemaker.log
-------------------------------------->8---------------------------------------
Aug 16 02:39:05 controller1 pacemaker-controld  [5669] (lrmd_tls_dispatch)     info: Lost remote-node10 executor connection while reading data
Aug 16 02:39:05 controller1 pacemaker-controld  [5669] (lrmd_tls_connection_destroy)     info: TLS connection destroyed
Aug 16 02:39:05 controller1 pacemaker-controld  [5669] (remote_lrm_op_callback)     error: Lost connection to Pacemaker Remote node remote-node10
Aug 16 02:39:05 controller1 pacemaker-controld  [5669] (process_lrm_event)     error: Result of monitor operation for remote-node10 on controller1: Error | call=394 key=remote-node10_monitor_60000 confirmed=false status=4 cib-update=8646
$ cat /var/log/pacemaker/pacemaker.log
-------------------------------------->8---------------------------------------
Aug 18 08:46:15 controller3 pacemaker-schedulerd[5824] (unpack_rsc_op_failure)   warning: Unexpected result (error) was recorded for monitor of remote-node5 on controller1 at Aug 18 08:46:15 2023 | rc=1 id=remote-node5_last_failure_0
Aug 18 08:46:15 controller3 pacemaker-schedulerd[5824] (unpack_rsc_op_failure)   notice: remote-node5 will not be started under current conditions
Aug 18 08:46:15 controller3 pacemaker-schedulerd[5824] (pe_fence_node)   warning: Remote node remote-node5 will be fenced: remote connection is unrecoverable
$ cat /var/log/pacemaker/pacemaker.log
-------------------------------------->8---------------------------------------
Aug 21 02:10:06 controller3 pacemaker-controld  [5826] (monitor_timeout_cb)      info: Timed out waiting for remote poke response from remote-node20
Aug 21 02:10:06 controller3 pacemaker-based     [5804] (cib_process_request)     info: Forwarding cib_modify operation for section status to all (origin=local/crmd/205783)
Aug 21 02:10:06 controller3 pacemaker-controld  [5826] (process_lrm_event)       error: Result of monitor operation for remote-node20 on controller3: Timed Out | call=491 key=remote-node20_monitor_60000 timeout=300000ms
Aug 21 02:10:06 controller3 pacemaker-controld  [5826] (lrmd_api_disconnect)     info: Disconnecting TLS remote-node20 executor connection
Aug 21 02:10:06 controller3 pacemaker-controld  [5826] (lrmd_tls_connection_destroy)     info: TLS connection destroyed
Aug 21 02:10:06 controller3 pacemaker-controld  [5826] (remote_lrm_op_callback)  error: Lost connection to Pacemaker Remote node remote-node20
Aug 21 02:10:06 controller3 pacemaker-controld  [5826] (lrmd_api_disconnect)     info: Disconnecting TLS remote-node20 executor connection
-------------------------------------->8---------------------------------------
Aug 21 02:10:07 controller3 pacemaker-schedulerd[5824] (unpack_rsc_op_failure)   warning: Unexpected result (error) was recorded for monitor of remote-node20 on controller3 at Aug 21 02:10:06 2023 | rc=1 id=remote-node20_last_failure_0
Aug 21 02:10:07 controller3 pacemaker-schedulerd[5824] (unpack_rsc_op_failure)   notice: remote-node20 will not be started under current conditions
Aug 21 02:10:07 controller3 pacemaker-schedulerd[5824] (pe_fence_node)   warning: Remote node remote-node20 will be fenced: remote connection is unrecoverable

Below errors may be observed on the remote nodes. These reported errors are less common because fence will likely occur before the errors are reported:

Aug 16 17:45:17 remote-node1 pacemaker-remoted[4763]: notice: Cleaning up after remote client pacemaker-remote-10.0.0.10:3121 disconnected
Aug 16 17:45:17 remote-node1 pacemaker-remoted[4763]: error: Could not send remote message: Software caused connection abort
Aug 16 17:45:17 remote-node1 pacemaker-remoted[4763]: warning: Could not notify client pacemaker-remote-10.0.0.10:3121/91a2fb48-4044-29bt-bc30-a5217b3cfe125: Software caused connection abort
Aug 16 17:45:17 remote-node1 pacemaker-remoted[4763]: warning: Could not notify client pacemaker-remote-10.0.0.10:3121/91a2fb48-4044-29bt-bc30-a5217b3cfe125: Software caused connection abort
Aug 16 17:47:07 remote-node1 kernel: Linux version 4.18.0-305.65.1.el8_4.x86_64 (mockbuild@x86-vm-09.build.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Sep 22 08:28:21 EDT 2022 <--- System startup

Environment

  • Red Hat Enterprise Linux 8 (with the High Availability Add on)
  • Red Hat OpenStack Platform
  • pacemaker

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content