Worker node is not able to join an ARO cluster, becomes NotReady, and is deleted and the process is repeated with new nodes

Solution Unverified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Azure Red Hat OpenShift (ARO)
    • 4
  • Machine Health Check (MHC)
  • DNS forwarding
  • EgressNetworkPolicies

Issue

  • Creating a new node in ARO, manually or automatically by the Cluster Autoscaler, it becomes NotReady and after some time it is deleted by the Machine Health Check and new one is created. The same behavior is repeating always.
  • The following event is shown in the NotReady node before it is deleted:

    KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
    
  • The following message is shown in the SDN pod of the failing node:

    Error adding EgressNetworkPolicy DNSName rule: IP address not found for domain "xxxxxxxxx": read udp xxxxxxxxxxxxxxxxxxxxxxxxx
    

Resolution

Check if there are EgressNetworkPolicies which URLs are only reachable via the DNS forwarding configuration, and remove them. It is needed to allow direct access to those URLs from the nodes, and not via the DNS forwarding.

Root Cause

There are URLs configured in the EgressNetworkPolicies that are only reachable via DNS forwarding, and as the SDN pods are using the node network and not the DNS forwarding configured, the SDN pod fails to add the EgressNetworkPolicy DNSName rule. It takes several seconds to fail, and if there are several EgressNetworkPolicies, the Machine Health Check (MHC) ends deleting the node (and a new one is created, starting the same procedure and failing the same way).

Diagnostic Steps

Check the SDN pods of the affected node for errors similar to "Error adding EgressNetworkPolicy DNSName rule: IP address not found for domain":

$ oc get nodes | grep NotReady
[...]
$ oc get pods -n openshift-sdn -o wide
[...]
$ oc logs [sdn-pod_name_for_failing_node] -n openshift-sdn | grep "Error adding EgressNetworkPolicy DNSName rule"
  Error adding EgressNetworkPolicy DNSName rule: IP address not found for domain "xxxxxxxxx": read udp xxxxxxxxxxxxxxxxxxxxxxxxx

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments