Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

Chapter 24. Clustering

Pacemaker correctly interprets systemd responses and systemd services are stopped in proper order at cluster shutdown

Previously, when a Pacemaker cluster was configured with systemd resources and the cluster was stopped, Pacemaker could mistakenly assume that a systemd service had stopped before it actually had stopped. As a consequence, services could be stopped out of order, potentially leading to stop failures. With this update, Pacemaker now correctly interprets systemd responses and systemd services are stopped in the proper order at cluster shutdown. (BZ#1286316)

Pacemaker now distinguishes transient failures from fatal failures when loading systemd units

Previously, Pacemaker treated all errors loading a systemd unit as fatal. As a consequence, Pacemaker would not start a systemd resource on a node where it could not load the systemd unit, even if the load failed due to transient conditions such as CPU load. With this update, Pacemaker now distinguishes transient failures from fatal failures when loading systemd units. Logs and cluster status now show more appropriate messages, and the resource can start on the node once the transient error clears. (BZ#1346726)

Pacemaker now removes node attributes from its memory when purging a node that has been removed from the cluster

Previously, Pacemaker's node attribute manager removed attribute values from its memory but not the attributes themselves when purging a node that had been removed from the cluster. As a result, if a new node was later added to the cluster with the same node ID, attributes that existed on the original node could not be set for the new node. With this update, Pacemaker now purges the attributes themselves when removing a node and a new node with the same ID encounters no problems with setting attributes. (BZ#1338623)

Pacemaker now correctly determines expected results for resources that are in a group or depend on a clone

Previously, when restarting a service, Pacemaker's crm_resource tool (and thus the pcs resource restart command) could fail to properly determine when affected resources successfully started. As a result, the command could fail to restart a resource that is a member of a group, or the command could hang indefinitely if the restarted resource depended on a cloned resource that moved to another node. With this update, the command now properly determines expected results for resources that are in a group or depend on a clone. The desired service is restarted, and the command returns. (BZ#1337688)

Fencing now occurs when DLM requires it, even when the cluster itself does not

Previously, DLM could require fencing due to quorum issues, even when the cluster itself did not require fencing, but would be unable to initiate it, As a consequence, DLM and DLM-based services could hang waiting for fencing that never happened. With this fix, the ocf:pacemaker:controld resource agent now checks whether DLM is in this state, and requests fencing if so. Fencing now occurs in this situation, allowing DLM to recover. (BZ#1268313)

The DLM now detects and reports connection problems

Previously, the Distributed Lock Manager (DLM) used for cluster communications expected TCP/IP packet delivery and waited for responses indefinitely. As a consequence, if a DLM connection was lost, there was no notification of the problem. With this update, the DLM detects and reports when cluster communications are lost. As a result, DLM communication problems can be identified, and cluster nodes that become unresponsive can be restarted once the problems are resolved. (BZ#1267339)

High Availability instances created by non-admin users are now evacuated when a compute instance is turned off

Previously, the fence_compute agent searched only for compute instances created by admin users. As a consequence, instances created by non-admin users were not evacuated when a compute instance was turned off. This update makes sure that fence_compute searches for instances run as any user, and compute instances are evacuated to new compute nodes as expected. (BZ#1313561)

Starting the nfsserver resource no longer fails

The nfs-idmapd service fails to start when the var-lib-nfs-rpc_pipefs.mount process is active. The process is active by default. Consequently, starting the nfsserver resource failed. With this update, var-lib-nfs-rpc_pipefs.mount stops in this situation and does not prevent nfs-idmapd from starting. As a result, nfsserver starts as expected. (BZ#1325453)

lrmd logs errors as expected and no longer crashes

Previously, Pacemaker's Local Resource Management Daemon (lrmd) used an invalid format string when logging certain rare systemd errors. As a consequence, lrmd could terminate unexpectedly with a segmentation fault. A patch has been applied to fix the format string. As a result, lrmd no longer crashes and logs the aforementioned rare error messages as intended. (BZ#1284069)

stonithd now properly distinguishes attribute removals from device removals.

Prior to this update, if a user deleted an attribute from a fence device, Pacemaker's stonithd service sometimes mistakenly removed the entire device. Consequently, the cluster would no longer use the fence device. The underlying source code has been modified to fix this bug, and stonithd now properly distinguishes attribute removals from device removals. As a result, deleting a fence device attribute no longer removes the device itself. (BZ#1287315)

HealthCPU now correctly measures CPU usage

Previously, the ocf:pacemaker:HealthCPU resource parsed the output of the top command incorrectly on Red Hat Enterprise Linux 7. As a consequence, the HealthCPU resource did not work. With this update, the resource agent correctly parses the output of later versions of top. As a result, HealthCPU now correctly measures CPU usage. (BZ#1287868)

Pacemaker now checks all collected files when stripping sensitive information

Pacemaker has the ability to strip sensitive information that matches a given pattern when submitting system information with bug reports, whether directly by Pacemaker's crm_report tool or indirectly via sosreport. However, Pacemaker would only check certain collected files, not log file extracts. Because of this, sensitive information could remain in log file extracts. With this fix, Pacemaker now checks all collected files when stripping sensitive information and no sensitive information is collected. (BZ#1219188)

The corosync memory footprint no longer increases on every node rejoin

Previously, when a user rejoined a node some buffers in corosync were not freed so that memory consumption grew. With this fix, no memory is leaked and the memory footprint no longer increases on every node rejoin. (BZ#1306349)

Corosync starts correctly when configured to use IPv4 and DNS is set to return both IPv4 and IPv6 addresses

Previously, when a pcs-generated corosync.conf file used hostnames instead of IP addresses and Internet Protocol version 4 (IPv4) and the DNS server was set to return both IPV4 and IPV6 addresses, the corosync utility failed to start. With this fix, if Corosync is configured to use IPv4, IPv4 is really used. As a result, corosync starts as expected in the described circumstances. (BZ#1289169)

The corosync-cmapctl utility correctly handles errors in the print_key() function

Previously, the corosync-cmapctl utility did not handle corosync errors in the print_key() function correctly. Consequently, corosync-cmapctl could enter an infinite loop if the corosync utility was killed. The provided fix makes sure all errors returned when Corosync exits are handled correctly. As a result, corosync-cmapctl leaves the loop and displays a relevant error message in this scenario. (BZ#1336462)