Node Health Check with Self-Remediation services keep restarting nodes when doing ETCD replacement creating many issues

Solution Verified - Updated 2025-02-05T15:19:02+00:00 -

Issue

Having Node Health Check operator with NHC CR created leads to cluster being unstable or degraded when following the official documentation to replace failed ETCD node, due to force reboots on all nodes by the self-remediation service.

Pausing the NHC CR or deleting doesn't seem to help either and the BMC on the nodes log:

Wed Nov 20 2024 08:42:38 : The watchdog timer reset the system.

Environment

Red Hat OpenShift Container Platform 4.14+ IPI Bare Metal
Workload Availability for Red Hat OpenShift 23.x and 24.x

NOTE: Most affected are 3 node clusters with our without additional nodes, but not exclusively.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

Node Health Check with Self-Remediation services keep restarting nodes when doing ETCD replacement creating many issues

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links