Keepalived-monitor container keeps crashing due to readinessProbe failure
Environment
OCP 4.6 Bare Metal IPI
Note: Vsphere IPI deploys the same on opesnhift-vsphere-infra so this might apply as well, but hasn't been tested nor issue reproduced with OCP on vSphere.
Issue
- After upgrade OCP 4.6 between z releases, keepalived pods are running but the keepalived -monitor container on the pod keeps crashing on worker nodes with readinessProbe failures.
- Similar issues happens after installation on Bare Metal using IPI method.
Resolution
Since these are static pods started by the kubelet on startup, create a machineConfig to overwrite the static pod file in the workers nodes and remove the readinessProbe section.
First gather the file from the current 00-worker machineConfig, which is already on the coded format and has the infrastructure VIPs already configured:
$ oc get mc/00-worker -o yaml | \
grep '/etc/kubernetes/manifests/keepalived.yaml' -B4 | \
grep "source\: \data\:" | cut -d',' -f2
Decode this file to have a more human readable view using this URL decode tool.
The file attached here (keepalived-pod.yml) is a template of what is extracted and how it should look like after the changes. To achieve that edit the decoded file on the link above and remove these lines from the keepalived-monitor container:
readinessProbe:
httpGet:
path: /readyz
port: 6443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 10
Using the same tool but in Encode mode convert the edited file into the machineConfig URL format (example of what encoded format will look like is attached) and proceed to the creation of the new machineConfig template file:
$ cat << EOF > keepalived-pod_mc.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 02-keepalived-pod-worker-manifest
spec:
config:
ignition:
config: {}
security:
tls: {}
timeouts: {}
version: 3.1.0
networkd: {}
passwd: {}
storage:
files:
- contents:
source: data:,<ENCODED_DATA_GOES_HERE>
filesystem: root
mode: 420
overwrite: true
path: /etc/kubernetes/manifests/keepalived.yaml
systemd: {}
EOF
$ oc create -f keepalived-pod_mc.yaml
$ oc get mc
The MCO will create a new rendered-worker-xxx machineConfig and update the nodes where it will overwrite the current static pod file. After that process is done and kubelet restarted, confirm the pods to see if they started:
$ oc get pods -o wide -n openshift-kni-infra
$ oc get pod <keepalived-pod-on-worker> -o yaml -n openshift-kni-infra
Root Cause
Both containers of the keepalived pod start only on masters, since the readinessProbe on keepalived-monitor was created to check /readyz path on localhost on port 6443. This is the port for the kube-apiservers which in the masters will not have problems and is meant to avoid overlapping issues when querying the public VIP for the API that is monitored by the keepalived container. On workers this port doesn't exist, since the nodes only connect to the API via the api-int VIP, and therefore will make the readinessProbe fail everytime.
This seems to have been a mistake during the bootstrap or upgrade process and it was already reported on the Bugzilla 1940594
Diagnostic Steps
On project openshift-kni-infra check events and pods for similar messages like this:
Readiness probe failed: Get "https://nodeIP:6443/readyz": dial tcp nodeIP:6443: connect: connection refused...
Similar events can also be seen on the streaming events in the web console.
$ oc get pods -o wide -n openshift-kni-infra
$ oc get events -n openshift-kni-infra
$ oc describe pod <keepalived-running-on-workers>
Confirm readinessProbe section is present in the file on the nodes either by ssh and reading the '/etc/kubernetes/manifests/keepalived.yaml' or using debug option:
$ oc debug node/<woker-node-name> -- chroot /host cat /etc/kubernetes/manifests/keepalived.yaml
Attachments
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments