HAProxy pods are crashing due to NodePorts conflicts
Environment
Red Hat Openshift Container Platform 4
- 4.8
- 4.9
- 4.10
Issue
- HAProxy pods are looping back crashing: CrashLoopBackOff.
$ oc get pods -A | grep haproxy
openshift-kni-infra master-0 1/2 CrashLoopBackOff 153 10m
openshift-kni-infra master-1 1/2 CrashLoopBackOff 156 7m21s
openshift-kni-infra master-2 1/2 CrashLoopBackOff 153 5m11s
The error appears because HAProxy is trying to use ports reserved for Openshif Nodeports.
This can cause errors about the Liveness probes:
$ oc get events -n openshift-kni-infra
121m Warning Unhealthy pod/haproxy-master-0 Liveness probe failed: Get "http://10.131.120.71:30936/haproxy_ready": dial tcp 10.131.120.71:30936: i/o timeout (Client.Timeout exceeded while awaiting headers)
121m Normal Pulled pod/haproxy-master-0 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08b511e4046223ca2198587d378cdca5a20e2a7390da8b298eb165e80dc54a1a" already present on machine
121m Normal Created pod/haproxy-master-0 Created container haproxy
121m Normal Started pod/haproxy-master-0 Started container haproxy
113m Warning ProbeError pod/haproxy-master-0Liveness probe error: Get "http://10.131.120.72:30936/haproxy_ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
body:
Or about, HAProxy listening ports for LBStats. Check the HAProxy logs to find some binding errors:
[NOTICE] 173/133217 (9) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 173/133217 (9) : path to executable is /usr/sbin/haproxy
[ALERT] 173/133217 (9) : Starting proxy stats: cannot bind socket [::1:30000]
In these two examples either 30000 or 30936 are reserved NodePorts.
Resolution
Since these are static pods started by the kubelet on startup, create a machineConfig to overwrite the static pod file in the master nodes to change the Liveness Probe Port.
First gather the file from the current 00-master machineConfig, which is already on the coded format and has the ports configurations for HAProxy. We will have to get the content for two different files.
Get the content for '/etc/kubernetes/manifests/haproxy.yaml':
$> oc get mc/00-master -o yaml | \
grep '/etc/kubernetes/manifests/haproxy.yaml' -B4 | \
grep "source\: \data\:" | cut -d',' -f2
and get the content for '/etc/kubernetes/static-pod-resources/haproxy/haproxy.cfg.tmpl':
$> oc get mc/00-master -o yaml | \
grep '/etc/kubernetes/static-pod-resources/haproxy/haproxy.cfg.tmpl' -B4 | \
grep "source\: \data\:" | cut -d',' -f2
In both cases, decode the output to have a more human readable view using this URL decode tool. With the decoded content look for ports using a NodePort range.
Depending on the OCP version the conflicts could be different, following some recommendations to fix conflicts:
- 30936 port for Liveness probes has to be 9444
- 30000 port for LBStats has to be 29445
# in the haproxy.yaml
livenessProbe:
initialDelaySeconds: 50
httpGet:
path: /haproxy_ready
port: 30936 => 9444
# in the haproxy.cfg.tmpl
listen health_check_http_url
bind :::30936 v4v6 => 9444
mode http
monitor-uri /haproxy_ready
option dontlognull
....
...
listen stats
bind localhost:{{ .LBConfig.StatPort }} =>bind localhost:29445
mode http
stats enable
stats hide-version
With the two manifests readable, and the ports changed, encode them separately. This time, using the URL encode tool
Then use the new encoded content into the following MachineConfig, substituting the the re-coded outputs in the proper section:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: 99-master-haproxy-fix-nodeport-conflict
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:,<ENCODED_DATA_FOR_haproxy.yaml>
mode: 420
overwrite: true
path: /etc/kubernetes/manifests/haproxy.yaml
- contents:
source: data:,<ENCODED_DATA_FOR_haproxy.cfg.tmpl>
mode: 420
overwrite: true
path: /etc/kubernetes/static-pod-resources/haproxy/haproxy.cfg.tmpl
Apply the MachineConfig and the HAProxy pods will be restarted. The MachineConfigs will restart the Master nodes, after a while, you will check that port has changed:
$> oc -n openshift-kni-infra get pod haproxy-master-0.el8k-ztp-1.hpecloud.org -o jsonpath={.spec.containers[].livenessProbe.httpGet.port}
9444
Or:
$ oc debug node/master-0.el8k-ztp-1.hpecloud.org -- chroot /host sh -c "cat /etc/haproxy/haproxy.cfg | grep bind"
Starting pod/master-0el8k-ztp-1hpecloudorg-debug ...
To use host binaries, run `chroot /host`
bind :::9445 v4v6
bind :::9444 v4v6
bind localhost: 29445
Root Cause
The HAProxy is using a port in the NodePort range.
oc -n openshift-kni-infra get pod haproxy-openshift-master-2.hub-virtual.lab -o jsonpath={.spec.containers[].livenessProbe}|jq .
{
"failureThreshold": 3,
"httpGet": {
"path": "/haproxy_ready",
"port": 30936,
"scheme": "HTTP"
},
"initialDelaySeconds": 50,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
Or:
oc debug node/master-0.el8k-ztp-1.hpecloud.org -- chroot /host sh -c "cat /etc/haproxy/haproxy.cfg | grep bind"
Starting pod/master-0el8k-ztp-1hpecloudorg-debug ...
To use host binaries, run `chroot /host`
bind :::9445 v4v6
bind :::9444 v4v6
bind localhost:30000
Getting NodePort range:
$ oc get configmaps -n openshift-kube-apiserver config \
-o jsonpath="{.data['config\.yaml']}" | \
grep -Eo '"service-node-port-range":["[[:digit:]]+-[[:digit:]]+"]'
OCP/Kubernetes is not aware this port is been used. Therefore, same port could be assigned to any SVC that asks for a NodePort. When this happens, there is a ports conflict. Causing an infinite loop of restarts, until the port is available again.
This issue is fixed in OCP4.11 with this commit
The bug is fixed on OCP4.11, and it is backported 4.8, 4.9 and 4.10. But here are different port's conflicts combination with the different OCP releases. Mainly, about the Liveness probes and the LBStats.
Diagnostic Steps
Check the port where the HAProxy liveness probe is done:
$> oc -n openshift-kni-infra get pod haproxy-master.<xyz> -o jsonpath={.spec.containers[].livenessProbe.httpGet.port}
30936
Check the events on the openshift-kni-infra namespace
$> oc -n openshift-kni-infra get events
...(more)...
121m Warning Unhealthy pod/haproxy-master-0Liveness probe failed: Get "http://10.131.120.71:30936/haproxy_ready": dial tcp 10.131.120.71:30936: i/o timeout (Client.Timeout exceeded while awaiting headers)
121m Normal Pulled pod/haproxy-master-0 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08b511e4046223ca2198587d378cdca5a20e2a7390da8b298eb165e80dc54a1a" already present on machine
121m Normal Created pod/haproxy-master-0 Created container haproxy
121m Normal Started pod/haproxy-master-0 Started container haproxy
113m Warning ProbeError pod/haproxy-master-0Liveness probe error: Get "http://10.131.120.72:30936/haproxy_ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
body:
...(more)...
Or maybe the stats binding port is in conflict. Get the logs from a failing HAProxy
[NOTICE] 173/133217 (9) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 173/133217 (9) : path to executable is /usr/sbin/haproxy
[ALERT] 173/133217 (9) : Starting proxy stats: cannot bind socket [::1:30000]
Finally, oc get svc -A | grep <CONFLICT_PORT>
is returning a service which is using the same port and creating the conflict.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments