RHOSP: Replacing a controller node fails

Solution Verified - Updated -

Environment

  • Red Hat OpenStack Platform 17.1

Issue

  • Replacing a controller node sometimes fails with the following messages
2023-07-06 00:04:40.041332 | 525400ef-e043-9ecc-40fe-000000009544 |      FATAL | Run init bundle puppet on the host for haproxy | osp17-1r1-controller-0 | error={"changed": false, "cmd": "puppet apply  --detailed-exitcodes --summarize --color=false --modulepath '/etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules' --tags 'pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation' -e 'include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle'\n", "delta": "0:04:26.557015", "end": "2023-07-06 13:04:39.982311", "failed_when_result": true, "msg": "non-zero return code", "rc": 4, "start": "2023-07-06 13:00:13.425296", "stderr": "Warning: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5\n   (file: /etc/puppet/hiera.yaml)\nWarning: Undefined variable '::deploy_config_name'; \n   (file & line not available)\nWarning: The function 'hiera' is deprecated in favor of using 'lookup'. See https://puppet.com/docs/puppet/7.10/deprecated_language.html\n   (file & line not available)\nWarning: Unknown variable: '::deployment_type'. (file: /etc/puppet/modules/tripleo/manifests/fencing.pp, line: 124, column: 8)\nWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs property config' instead.\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs property config' instead.\nError: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20230706-943108-rbsfio node attribute osp17-1r1-controller-2 haproxy-role=true failed: Error: unable to set attribute haproxy-role. Too many tries\nError: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Pacemaker::Property[haproxy-role-osp17-1r1-controller-2]/Pcmk_property[property-osp17-1r1-controller-2-haproxy-role]/ensure: change from 'absent' to 'present' failed: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20230706-943108-rbsfio node attribute osp17-1r1-controller-2 haproxy-role=true failed: Error: unable to set attribute haproxy-role. Too many tries\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead.\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead.\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead.\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead.\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead.\nDeprecation Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead.\nWarning: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Pacemaker::Resource::Bundle[haproxy-bundle]/Pcmk_bundle[haproxy-bundle]: Skipping because of failed dependencies\nWarning: 

Resolution

Red Hat Enterprise Linux 8

  • The issue (bugzilla bug: 2225670) has been resolved with the errata RHBA-2023:4966 with the following package(s): pacemaker-2.0.5-9.el8_4.7 or later for RHEL 8.4.z.
  • The issue (bugzilla bug: 2225669) has been resolved with the errata RHBA-2023:4795 with the following package(s): pacemaker-2.1.2-4.el8_6.7 or later for RHEL 8.6.z.
  • The issue (bugzilla bug: 2225668) has been resolved with the errata RHBA-2023:5261 with the following package(s): pacemaker-2.1.5-9.3.el8_8 or later for RHEL 8.8.z.
  • The issue (bugzilla bug: 2225631) has been resolved with the errata RHEA-2023:6970 with the following package(s): pacemaker-2.1.6-8.el8 or later.

Red Hat Enterprise Linux 9

  • The issue (bugzilla bug: 2237465) has been resolved with the errata RHBA-2023:5600 with the following package(s): pacemaker-2.1.2-4.el9_0.5 or later for RHEL 9.0.z.
  • The issue (bugzilla bug: 2225671) has been resolved with the errata RHBA-2023:5090 with the following package(s): pacemaker-2.1.5-9.el9_2.3 or later for RHEL 9.2.z.
  • The issue (bugzilla bug: 2221084) has been resolved with the errata RHEA-2023:6314 with the following package(s): pacemaker-2.1.6-9.el9 or later.

Workaround

Check the cluster status by crm_node -l command.
If the removed node is reported as "lost" state, execute pcs cluster node clear to remove the node again.
Retry these steps until the node is removed completely in the output of crm_node -l command.

For example, controller3 is failed to replace:

$ ssh tripleo-admin@$controller1 "sudo crm_node -l"
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
1 controller1 member
2 controller2 member
3 controller3 lost
$ ssh tripleo-admin@$controller1 "sudo pcs cluster node clear controller3"
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
$ ssh tripleo-admin@$controller1 "sudo crm_node -l"
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
1 controller1 member
2 controller2 member
3 controller3 lost
...
$ ssh tripleo-admin@$controller1 "sudo pcs cluster node clear controller3"   # repeat several times
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
$ ssh tripleo-admin@$controller1 "sudo crm_node -l"
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
1 controller1 member
2 controller2 member                                                         # OK
$

Root Cause

Pacemaker sometimes fails to remove nodes by pcs cluster node remove command.
In this case, the removed node is shown as "lost" node.

$ ssh tripleo-admin@controller1 "sudo pcs cluster node remove controller3 --skip-offline --force"
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
Warning: Omitting node 'controller3'
Warning: Unable to connect to controller3 (Failed to connect to controller3 port 2224: No route to host)
Warning: Unable to determine whether this action will cause a loss of the quorum
Destroying cluster on hosts: 'controller3'...
Warning: Unable to connect to controller3 (Failed to connect to controller3 port 2224: No route to host)
Warning: Removed node 'controller3' could not be reached and subsequently deconfigured. Run 'pcs cluster destroy' on the unreachable node.
Sending updated corosync.conf to nodes...
controller2: Succeeded
controller1: Succeeded
controller1: Corosync configuration reloaded
$ ssh tripleo-admin@$controller1 "sudo crm_node -l"
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
1 controller1 member
2 controller2 member
3 controller3 lost

This is a bug of pacemaker.

Diagnostic Steps

After removing the failed node with pcs cluster node remove command, check if the node is disappeared in the crm_node -l command output.

$ ssh tripleo-admin@controller1 "sudo crm_node -l"
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
1 controller1 member
2 controller2 member
3 controller3 lost

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments