SREMachineHealthCheckRemediationRateHigh in Azure Red Hat OpenShift (ARO)

Solution Verified - Updated -

Environment

  • Azure Red Hat OpenShift (ARO) 4
  • OpenShift Container Platform

Issue

  • Alertmanager alerts titled SREMachineHealthCheckRemediationRateHigh are firing in my Azure Red Hat OpenShift (ARO) cluster.

Resolution

  • This alert fires when a worker node becomes NotReady, the MachineHealthCheck functionality has attempted to remediate, and it has done so at least twice within an hour period. The MachineHealthCheck does not attempt to remediate master nodes.
  • There are various causes of a node becoming NotReady, but if the alert is encountered, it's first recommended to ensure your apps are built using high availability practices involving running multiple replicas and configuring affinity/anti-affinity rules to avoid strain on individual cluster nodes.
  • Check to ensure that worker nodes aren't overloaded with oc adm top node, and consider scaling the cluster up to accommodate for workload resource demands.
  • Check to ensure that the cluster isn't stuck upgrading.
  • Review the documentation regarding node management, limit ranges, and container management.
  • To view the status of the MachineHealthCheck itself, use the following:
oc describe mhc -n openshift-machine-api aro-machinehealthcheck
  • If issue persists, open a support case for investigation.

Root Cause

The SREMachineHealthCheckRemediationRateHigh alert is configured by ARO SREs to maintain and monitor the health of the ARO cluster. This alert is meant to monitor the MachineHealthCheck functionality within the cluster. For more details regarding the MachineHealthCheck, see the official documentation.

This alert is deployed as a PrometheusRule in all ARO clusters as follows:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: mhc-remediation-alert
  namespace: openshift-machine-api
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: sre-mhc-remediation-alert
    rules:
    - alert: SREMachineHealthCheckRemediationRateHigh
      expr: increase(mapi_machinehealthcheck_remediation_success_total [60m]) > 1
      Annotations:
        Message: worker nodes have been remediated 2 or more times in the last hour this may indicate an unstable workload running on the cluster
      labels:
        severity: warning

Under the MachineHealthCheck, this alert indicates that a node has been recreated 2 or more times in the past hour, which can indicate a resource consumption or other issue inside of the cluster. The MachineHealthCheck will not remediate in cases where there is more than one worker node in NotReady state.

For details on how to mitigate the alert, please see above Resolution section. Modifying or deleting this PrometheusRule or the MachineHealthCheck resource is not supported.

This document is not applicable to non-Azure Red Hat OpenShift clusters, such as OpenShift clusters hosted in Azure using an IPI or UPI installation method, and OpenShift Dedicated environments.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments