Upgrade an OCP cluster to 4.6.37 or 4.7.18 and higher causing network operator to be degraded or stuck progressing

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Azure Red Hat OpenShift (ARO)
    • 4
  • OpenShift Managed (Azure)
  • Microsoft Azure

Issue

  • Upgrading an ARO cluster or an OCP cluster in Azure to 4.6.37 or higher gets stuck on the network operator.
  • Upgrading an ARO cluster or an OCP cluster in Azure to 4.7.18 or higher gets stuck on the network operator.
  • The sdn pod is failing because the drop-icmp container:

    containers with unready status: [drop-icmp]
    
  • The drop-icmp containers in the sdn pods are failing with the following error:

    error: You must be logged in to the server (Unauthorized)
    

Resolution

A fix for similar issue was already rolled out for ARO 4.6 and ARO 4.7. For OCP clusters in Azure, the workaround from KCS 5252831 needs to be removed.

In some cases, after an oc login executed in a node, the drop-icmp container within the sdn pod will try to use the generated kubeconfig file, and if not valid, the container will fail, causing the network clusteroperator to be in a degraded state.

Check if a /root/.kube/config file exists in the failing node and remove it:

$ oc get nodes
[...]

$ oc debug node/[node_name] -- ls -ltrh /host/root/.kube/
-rw-------. 1 root root 709 Jul 12 06:42 config

$ oc debug node/[node_name] -- rm -rf /host/root/.kube/config

Root Cause

From OpenShift 4.6.37 and 4.7.18 onward, there is a fix that conflicts with a workaround for OCP clusters in Azure, and with a fix done independently in ARO several months prior as workaround for the same network issue as referenced in KCS 5252831 and BZ 1979312. The fix for that issue was already rolled out for ARO 4.6 and ARO 4.7 clusters.

With that fix, if an oc login was executed in a node, a kubeconfig file is generated and will be used by the drop-icmp container in the sdn pod, and if not valid, the container will fail.

Diagnostic Steps

Check if the network clusteroperator is stuck in progressing or degraded state:

$ oc get clusteroperator/network
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.6.36    True        True          True       70d

Check the status of the network clusteroperator:

$ oc get co network -o yaml
[...]
status:
  conditions:
  - lastTransitionTime: '2021-08-01T01:01:59Z'
    message: 'DaemonSet "openshift-sdn/sdn" rollout is not making progress - pod sdn-xxxxx
      is in CrashLoopBackOff State

      DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2021-08-01T01:00:24Z'
    reason: RolloutHung
    status: 'True'
    type: Degraded
[...]

Check the sdn pods, in some pods there are only 2 containers running of 3:

$ oc get pods -n openshift-sdn
NAME                  READY  STATUS   RESTARTS  AGE
sdn-xxxxx             3/3    Running  0         1d
sdn-yyyyy             2/3    Running  236       1d
sdn-zzzzz             3/3    Running  0         1d

Check the pod status to see the failing container:

$ oc get pod sdn-yyyyy -n openshift-sdn -o json | jq -r '.status.conditions'
[...]
  {
    "lastProbeTime": null,
    "lastTransitionTime": "2021-08-03T08:31:32Z",
    "message": "containers with unready status: [drop-icmp]",
    "reason": "ContainersNotReady",
    "status": "False",
    "type": "Ready"
  },
  {
    "lastProbeTime": null,
    "lastTransitionTime": "2021-08-03T08:31:32Z",
    "message": "containers with unready status: [drop-icmp]",
    "reason": "ContainersNotReady",
    "status": "False",
    "type": "ContainersReady"
  },
[...]

Search for the following error message in the drop-icmp containers from the sdn pod:

$ oc logs -n openshift-sdn -c drop-icmp sdn-yyyyy
[...]
2021-08-01T01:01:59.858068835Z + oc observe pods -n openshift-sdn -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh
2021-08-01T01:01:59.975123335Z Flag --argument has been deprecated, and will be removed in a future release. Use --template instead.
2021-08-01T01:02:00.033301633Z error: You must be logged in to the server (Unauthorized)

Check the config used by the oc command in the drop-icmp container. It should be empty, similar to the following example, and shouldn't show a message like Config loaded from file: /root/.kube/config:

$ oc rsh -n openshift-sdn -c drop-icmp pod/sdn-yyyyy oc config view -v 6
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null

Check if a /root/.kube/config file exists in the failing node:

$ oc get nodes
[...]

$ oc debug node/[node_name] -- ls -ltrh /host/root/.kube/
-rw-------. 1 root root 709 Jul 12 06:42 config

$ oc debug node/[node_name] -- cat /host/root/.kube/config
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments