Galera bundle resource is fenced during resource cleanup, and errors are repeated in the failed resource action

Solution In Progress - Updated -

Environment

  • Red Hat OpenStack Platform 16.2

Issue

  • The failed galera-bundle-podman-x container has already been recovered and is running normally as MASTER In the pcs status output.
  • pcs resource cleanup command is unable to cleanup galera-bundle-x_monitor_30000 error in Failed Resource Actions of the pcs status output. pcs resource cleanup command was executed, but galera-bundle-x_monitor_30000 error occurs again within 60 seconds.

Resolution

  • When a Pacemaker Bundle resource has a monitor operation error in the Failed Resource Actions, a process is executed to fence (restart) the resource for recovery.
    During the restart process by the fence of the target resource, the monitor operation in the container is executed and a galera-bundle-x_monitor_30000 error is recorded in the Failed Resource Actions.
  • As a workaround, when cleaning up resources in the galera-bundle container, it is possible to avoid the monitor operation during the fence by cleaning up the resources after unmanaging them as follows.

    # pcs resource unmanage galera-bundle
    # pcs resource cleanup galera-bundle
    # pcs resource manage galera-bundle 
    

Root Cause

  • The pcs resource cleanup process is performing a fence on the resources in the galera-bundle-podman container, but the monitor operation is being executed during the process.
  • The galera-bundle-podman container has already been restored to normal, but the issue of resources being restarted during pcs resource cleanup has been discussed in Bug 1650754 and has not been fixed at this time.

Diagnostic Steps

Executed podman kill galera-bundle-podman-1 to terminate the galera-bundle-podman-1

Apr 08 03:48:07 controller-0 pacemaker-controld  [26856] (process_lrm_event)    error: Result of monitor operation for galera-bundle-1 on controller-0: Error | call=79 key=galera-bundle-1_monitor_30000 confirmed=false status=4 cib-update=802

Which caused the fencing (reboot) of resource galera-bundle

Apr 08 03:48:07 controller-1 pacemaker-schedulerd[27028] (pe_fence_node)        warning: Guest node galera-bundle-1 will be fenced (by recovering its guest resource galera-bundle-podman-1): galera:1 is thought to be active there
Apr 08 03:48:07 controller-1 pacemaker-schedulerd[27028] (native_stop_constraints)      notice: Stop of failed resource galera:1 is implicit after galera-bundle-1 is fenced
Apr 08 03:48:07 controller-1 pacemaker-schedulerd[27028] (LogNodeActions)       notice:  * Fence (reboot) galera-bundle-1 (resource: galera-bundle-podman-1) 'guest is unclean'

galera-bundle-podman-1 restart and recovery

Apr 08 03:48:12 controller-0 pacemaker-controld  [26856] (process_lrm_event)    notice: Result of monitor operation for galera-bundle-1 on controller-0: ok | rc=0 call=82 key=galera-bundle-1_monitor_30000 confirmed=false cib-update=819

Which has left failed action

[root@controller-0 ~]# date; pcs status
Fri Apr  8 03:48:41 UTC 2022
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: controller-1 (version 2.0.5-9.el8_4.3-ba59be7122) - partition with quorum
...
Full List of Resources:
...
  * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]:
    * galera-bundle-0   (ocf::heartbeat:galera):     Master controller-2
    * galera-bundle-1   (ocf::heartbeat:galera):     Master controller-0
    * galera-bundle-2   (ocf::heartbeat:galera):     Master controller-1
...
Failed Resource Actions:
  * galera-bundle-1_monitor_30000 on controller-0 'error' (1): call=79, status='Error', exitreason='', last-rc-change='2022-04-08 03:48:07Z', queued=0ms, exec=0ms
...

Executed pcs resource cleanup galera-bundle

Apr 08 03:49:36 controller-0 pacemaker-controld  [26856] (controld_delete_resource_history)     info: Clearing resource history for galera-bundle-1 on controller-0 (via CIB call 828) | xpath=//node_state[@uname='controller-0']/lrm/lrm_resources/lrm_resource[@id='galera-bundle-1']

Which caused the fencing (reboot) of resource galera-bundle again

Apr 08 03:49:38 controller-1 pacemaker-schedulerd[27028] (pe_fence_node)        warning: Guest node galera-bundle-1 will be fenced (by recovering its guest resource galera-bundle-podman-1): galera:1 is thought to be active there
Apr 08 03:49:38 controller-1 pacemaker-schedulerd[27028] (native_stop_constraints)      notice: Stop of failed resource galera:1 is implicit after galera-bundle-1 is fenced
Apr 08 03:49:38 controller-1 pacemaker-schedulerd[27028] (LogNodeActions)       notice:  * Fence (reboot) galera-bundle-1 (resource: galera-bundle-podman-1) 'guest is unclean'

Which has left failed action

Apr 08 03:51:28 controller-0 pacemaker-controld  [26856] (process_lrm_event)    error: Result of monitor operation for galera-bundle-1 on controller-0: Error | call=82 key=galera-bundle-1_monitor_30000 confirmed=false status=4 cib-update=833
[root@controller-0 ~]# date; pcs status
Fri Apr  8 03:52:28 UTC 2022
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: controller-1 (version 2.0.5-9.el8_4.3-ba59be7122) - partition with quorum
...
Full List of Resources:
...
  * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]:
    * galera-bundle-0   (ocf::heartbeat:galera):     Master controller-2
    * galera-bundle-1   (ocf::heartbeat:galera):     Master controller-0
    * galera-bundle-2   (ocf::heartbeat:galera):     Master controller-1
...
Failed Resource Actions:
  * galera-bundle-1_monitor_30000 on controller-0 'error' (1): call=82, status='Error', exitreason='', last-rc-change='2022-04-08 03:51:28Z', queued=0ms, exec=0ms
...

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments