SREMachineHealthCheckRemediationRateHigh in Azure Red Hat OpenShift (ARO)
Environment
- Azure Red Hat OpenShift (ARO) 4
- OpenShift Container Platform
Issue
- Alertmanager alerts titled SREMachineHealthCheckRemediationRateHigh are firing in my Azure Red Hat OpenShift (ARO) cluster.
Resolution
- This alert fires when a worker node becomes NotReady, the MachineHealthCheck functionality has attempted to remediate, and it has done so at least twice within an hour period. The MachineHealthCheck does not attempt to remediate master nodes.
- There are various causes of a node becoming NotReady, but if the alert is encountered, it's first recommended to ensure your apps are built using high availability practices involving running multiple replicas and configuring affinity/anti-affinity rules to avoid strain on individual cluster nodes.
- Check to ensure that worker nodes aren't overloaded with
oc adm top node, and consider scaling the cluster up to accommodate for workload resource demands. - Check to ensure that the cluster isn't stuck upgrading.
- Review the documentation regarding node management, limit ranges, and container management.
- To view the status of the MachineHealthCheck itself, use the following:
oc describe mhc -n openshift-machine-api aro-machinehealthcheck
- If issue persists, open a support case for investigation.
Root Cause
The SREMachineHealthCheckRemediationRateHigh alert is configured by ARO SREs to maintain and monitor the health of the ARO cluster. This alert is meant to monitor the MachineHealthCheck functionality within the cluster. For more details regarding the MachineHealthCheck, see the official documentation.
This alert is deployed as a PrometheusRule in all ARO clusters as follows:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mhc-remediation-alert
namespace: openshift-machine-api
labels:
prometheus: k8s
role: alert-rules
spec:
groups:
- name: sre-mhc-remediation-alert
rules:
- alert: SREMachineHealthCheckRemediationRateHigh
expr: increase(mapi_machinehealthcheck_remediation_success_total [60m]) > 1
Annotations:
Message: worker nodes have been remediated 2 or more times in the last hour this may indicate an unstable workload running on the cluster
labels:
severity: warning
Under the MachineHealthCheck, this alert indicates that a node has been recreated 2 or more times in the past hour, which can indicate a resource consumption or other issue inside of the cluster. The MachineHealthCheck will not remediate in cases where there is more than one worker node in NotReady state.
For details on how to mitigate the alert, please see above Resolution section. Modifying or deleting this PrometheusRule or the MachineHealthCheck resource is not supported.
This document is not applicable to non-Azure Red Hat OpenShift clusters, such as OpenShift clusters hosted in Azure using an IPI or UPI installation method, and OpenShift Dedicated environments.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments