Upgrade an OCP cluster to 4.6.37 or 4.7.18 and higher causing network operator to be degraded or stuck progressing
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Azure Red Hat OpenShift (ARO)
- 4
- OpenShift Managed (Azure)
- Microsoft Azure
Issue
- Upgrading an ARO cluster or an OCP cluster in Azure to 4.6.37 or higher gets stuck on the network operator.
- Upgrading an ARO cluster or an OCP cluster in Azure to 4.7.18 or higher gets stuck on the network operator.
-
The
sdnpod is failing because thedrop-icmpcontainer:containers with unready status: [drop-icmp] -
The
drop-icmpcontainers in thesdnpods are failing with the following error:error: You must be logged in to the server (Unauthorized)
Resolution
A fix for similar issue was already rolled out for ARO 4.6 and ARO 4.7. For OCP clusters in Azure, the workaround from KCS 5252831 needs to be removed.
In some cases, after an oc login executed in a node, the drop-icmp container within the sdn pod will try to use the generated kubeconfig file, and if not valid, the container will fail, causing the network clusteroperator to be in a degraded state.
Check if a /root/.kube/config file exists in the failing node and remove it:
$ oc get nodes
[...]
$ oc debug node/[node_name] -- ls -ltrh /host/root/.kube/
-rw-------. 1 root root 709 Jul 12 06:42 config
$ oc debug node/[node_name] -- rm -rf /host/root/.kube/config
Root Cause
From OpenShift 4.6.37 and 4.7.18 onward, there is a fix that conflicts with a workaround for OCP clusters in Azure, and with a fix done independently in ARO several months prior as workaround for the same network issue as referenced in KCS 5252831 and BZ 1979312. The fix for that issue was already rolled out for ARO 4.6 and ARO 4.7 clusters.
With that fix, if an oc login was executed in a node, a kubeconfig file is generated and will be used by the drop-icmp container in the sdn pod, and if not valid, the container will fail.
Diagnostic Steps
Check if the network clusteroperator is stuck in progressing or degraded state:
$ oc get clusteroperator/network
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
network 4.6.36 True True True 70d
Check the status of the network clusteroperator:
$ oc get co network -o yaml
[...]
status:
conditions:
- lastTransitionTime: '2021-08-01T01:01:59Z'
message: 'DaemonSet "openshift-sdn/sdn" rollout is not making progress - pod sdn-xxxxx
is in CrashLoopBackOff State
DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2021-08-01T01:00:24Z'
reason: RolloutHung
status: 'True'
type: Degraded
[...]
Check the sdn pods, in some pods there are only 2 containers running of 3:
$ oc get pods -n openshift-sdn
NAME READY STATUS RESTARTS AGE
sdn-xxxxx 3/3 Running 0 1d
sdn-yyyyy 2/3 Running 236 1d
sdn-zzzzz 3/3 Running 0 1d
Check the pod status to see the failing container:
$ oc get pod sdn-yyyyy -n openshift-sdn -o json | jq -r '.status.conditions'
[...]
{
"lastProbeTime": null,
"lastTransitionTime": "2021-08-03T08:31:32Z",
"message": "containers with unready status: [drop-icmp]",
"reason": "ContainersNotReady",
"status": "False",
"type": "Ready"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2021-08-03T08:31:32Z",
"message": "containers with unready status: [drop-icmp]",
"reason": "ContainersNotReady",
"status": "False",
"type": "ContainersReady"
},
[...]
Search for the following error message in the drop-icmp containers from the sdn pod:
$ oc logs -n openshift-sdn -c drop-icmp sdn-yyyyy
[...]
2021-08-01T01:01:59.858068835Z + oc observe pods -n openshift-sdn -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh
2021-08-01T01:01:59.975123335Z Flag --argument has been deprecated, and will be removed in a future release. Use --template instead.
2021-08-01T01:02:00.033301633Z error: You must be logged in to the server (Unauthorized)
Check the config used by the oc command in the drop-icmp container. It should be empty, similar to the following example, and shouldn't show a message like Config loaded from file: /root/.kube/config:
$ oc rsh -n openshift-sdn -c drop-icmp pod/sdn-yyyyy oc config view -v 6
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null
Check if a /root/.kube/config file exists in the failing node:
$ oc get nodes
[...]
$ oc debug node/[node_name] -- ls -ltrh /host/root/.kube/
-rw-------. 1 root root 709 Jul 12 06:42 config
$ oc debug node/[node_name] -- cat /host/root/.kube/config
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.