Worker node is not able to join an ARO cluster, becomes NotReady, and is deleted and the process is repeated with new nodes
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Azure Red Hat OpenShift (ARO)
- 4
- Machine Health Check (MHC)
- DNS forwarding
EgressNetworkPolicies
Issue
- Creating a new node in ARO, manually or automatically by the Cluster Autoscaler, it becomes
NotReady
and after some time it is deleted by the Machine Health Check and new one is created. The same behavior is repeating always. -
The following event is shown in the
NotReady
node before it is deleted:KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
-
The following message is shown in the SDN pod of the failing node:
Error adding EgressNetworkPolicy DNSName rule: IP address not found for domain "xxxxxxxxx": read udp xxxxxxxxxxxxxxxxxxxxxxxxx
Resolution
Check if there are EgressNetworkPolicies
which URLs are only reachable via the DNS forwarding configuration, and remove them. It is needed to allow direct access to those URLs from the nodes, and not via the DNS forwarding.
Root Cause
There are URLs configured in the EgressNetworkPolicies
that are only reachable via DNS forwarding, and as the SDN pods are using the node network and not the DNS forwarding configured, the SDN pod fails to add the EgressNetworkPolicy DNSName rule
. It takes several seconds to fail, and if there are several EgressNetworkPolicies
, the Machine Health Check (MHC) ends deleting the node (and a new one is created, starting the same procedure and failing the same way).
Diagnostic Steps
Check the SDN pods of the affected node for errors similar to "Error adding EgressNetworkPolicy DNSName rule: IP address not found for domain
":
$ oc get nodes | grep NotReady
[...]
$ oc get pods -n openshift-sdn -o wide
[...]
$ oc logs [sdn-pod_name_for_failing_node] -n openshift-sdn | grep "Error adding EgressNetworkPolicy DNSName rule"
Error adding EgressNetworkPolicy DNSName rule: IP address not found for domain "xxxxxxxxx": read udp xxxxxxxxxxxxxxxxxxxxxxxxx
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments