Upgrade an OCP cluster to 4.6.37 or 4.7.18 and higher causing network operator to be degraded or stuck progressing
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Azure Red Hat OpenShift (ARO)
- 4
- OpenShift Managed (Azure)
- Microsoft Azure
Issue
- Upgrading an ARO cluster or an OCP cluster in Azure to 4.6.37 or higher gets stuck on the network operator.
- Upgrading an ARO cluster or an OCP cluster in Azure to 4.7.18 or higher gets stuck on the network operator.
-
The
sdn
pod is failing because thedrop-icmp
container:containers with unready status: [drop-icmp]
-
The
drop-icmp
containers in thesdn
pods are failing with the following error:error: You must be logged in to the server (Unauthorized)
Resolution
A fix for similar issue was already rolled out for ARO 4.6 and ARO 4.7. For OCP clusters in Azure, the workaround from KCS 5252831 needs to be removed.
In some cases, after an oc login
executed in a node, the drop-icmp
container within the sdn
pod will try to use the generated kubeconfig
file, and if not valid, the container will fail, causing the network
clusteroperator
to be in a degraded state.
Check if a /root/.kube/config
file exists in the failing node and remove it:
$ oc get nodes
[...]
$ oc debug node/[node_name] -- ls -ltrh /host/root/.kube/
-rw-------. 1 root root 709 Jul 12 06:42 config
$ oc debug node/[node_name] -- rm -rf /host/root/.kube/config
Root Cause
From OpenShift 4.6.37 and 4.7.18 onward, there is a fix that conflicts with a workaround for OCP clusters in Azure, and with a fix done independently in ARO several months prior as workaround for the same network issue as referenced in KCS 5252831 and BZ 1979312. The fix for that issue was already rolled out for ARO 4.6 and ARO 4.7 clusters.
With that fix, if an oc login
was executed in a node, a kubeconfig
file is generated and will be used by the drop-icmp
container in the sdn
pod, and if not valid, the container will fail.
Diagnostic Steps
Check if the network
clusteroperator
is stuck in progressing
or degraded
state:
$ oc get clusteroperator/network
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
network 4.6.36 True True True 70d
Check the status of the network
clusteroperator
:
$ oc get co network -o yaml
[...]
status:
conditions:
- lastTransitionTime: '2021-08-01T01:01:59Z'
message: 'DaemonSet "openshift-sdn/sdn" rollout is not making progress - pod sdn-xxxxx
is in CrashLoopBackOff State
DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2021-08-01T01:00:24Z'
reason: RolloutHung
status: 'True'
type: Degraded
[...]
Check the sdn
pods, in some pods there are only 2 containers running of 3:
$ oc get pods -n openshift-sdn
NAME READY STATUS RESTARTS AGE
sdn-xxxxx 3/3 Running 0 1d
sdn-yyyyy 2/3 Running 236 1d
sdn-zzzzz 3/3 Running 0 1d
Check the pod status
to see the failing container:
$ oc get pod sdn-yyyyy -n openshift-sdn -o json | jq -r '.status.conditions'
[...]
{
"lastProbeTime": null,
"lastTransitionTime": "2021-08-03T08:31:32Z",
"message": "containers with unready status: [drop-icmp]",
"reason": "ContainersNotReady",
"status": "False",
"type": "Ready"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2021-08-03T08:31:32Z",
"message": "containers with unready status: [drop-icmp]",
"reason": "ContainersNotReady",
"status": "False",
"type": "ContainersReady"
},
[...]
Search for the following error message in the drop-icmp
containers from the sdn
pod:
$ oc logs -n openshift-sdn -c drop-icmp sdn-yyyyy
[...]
2021-08-01T01:01:59.858068835Z + oc observe pods -n openshift-sdn -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh
2021-08-01T01:01:59.975123335Z Flag --argument has been deprecated, and will be removed in a future release. Use --template instead.
2021-08-01T01:02:00.033301633Z error: You must be logged in to the server (Unauthorized)
Check the config used by the oc
command in the drop-icmp
container. It should be empty, similar to the following example, and shouldn't show a message like Config loaded from file: /root/.kube/config
:
$ oc rsh -n openshift-sdn -c drop-icmp pod/sdn-yyyyy oc config view -v 6
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null
Check if a /root/.kube/config
file exists in the failing node:
$ oc get nodes
[...]
$ oc debug node/[node_name] -- ls -ltrh /host/root/.kube/
-rw-------. 1 root root 709 Jul 12 06:42 config
$ oc debug node/[node_name] -- cat /host/root/.kube/config
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments