HAProxy pods are crashing due to NodePorts conflicts

Solution Verified - Updated -

Environment

Red Hat Openshift Container Platform 4

  • 4.8
  • 4.9
  • 4.10

Issue

  • HAProxy pods are looping back crashing: CrashLoopBackOff.
$ oc get pods -A | grep haproxy
openshift-kni-infra                                master-0                    1/2     CrashLoopBackOff   153        10m
openshift-kni-infra                                master-1                    1/2     CrashLoopBackOff   156        7m21s
openshift-kni-infra                                master-2                    1/2     CrashLoopBackOff   153        5m11s

The error appears because HAProxy is trying to use ports reserved for Openshif Nodeports.
This can cause errors about the Liveness probes:

$ oc get events -n openshift-kni-infra
121m Warning Unhealthy pod/haproxy-master-0 Liveness probe failed: Get "http://10.131.120.71:30936/haproxy_ready": dial tcp 10.131.120.71:30936: i/o timeout (Client.Timeout exceeded while awaiting headers)
121m Normal Pulled pod/haproxy-master-0 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08b511e4046223ca2198587d378cdca5a20e2a7390da8b298eb165e80dc54a1a" already present on machine
121m Normal Created pod/haproxy-master-0 Created container haproxy
121m Normal Started pod/haproxy-master-0 Started container haproxy
113m Warning ProbeError pod/haproxy-master-0Liveness probe error: Get "http://10.131.120.72:30936/haproxy_ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
body:

Or about, HAProxy listening ports for LBStats. Check the HAProxy logs to find some binding errors:

[NOTICE] 173/133217 (9) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 173/133217 (9) : path to executable is /usr/sbin/haproxy
[ALERT] 173/133217 (9) : Starting proxy stats: cannot bind socket [::1:30000]

In these two examples either 30000 or 30936 are reserved NodePorts.

Resolution

Since these are static pods started by the kubelet on startup, create a machineConfig to overwrite the static pod file in the master nodes to change the Liveness Probe Port.
First gather the file from the current 00-master machineConfig, which is already on the coded format and has the ports configurations for HAProxy. We will have to get the content for two different files.
Get the content for '/etc/kubernetes/manifests/haproxy.yaml':

$> oc get mc/00-master -o yaml | \
    grep '/etc/kubernetes/manifests/haproxy.yaml' -B4 | \
    grep "source\: \data\:" | cut -d',' -f2

and get the content for '/etc/kubernetes/static-pod-resources/haproxy/haproxy.cfg.tmpl':

$>  oc get mc/00-master -o yaml | \
    grep '/etc/kubernetes/static-pod-resources/haproxy/haproxy.cfg.tmpl' -B4 | \
    grep "source\: \data\:" | cut -d',' -f2

In both cases, decode the output to have a more human readable view using this URL decode tool. With the decoded content look for ports using a NodePort range.
Depending on the OCP version the conflicts could be different, following some recommendations to fix conflicts:
- 30936 port for Liveness probes has to be 9444
- 30000 port for LBStats has to be 29445

# in the haproxy.yaml 
    livenessProbe:
      initialDelaySeconds: 50
      httpGet:
        path: /haproxy_ready
        port:  30936  => 9444

# in the haproxy.cfg.tmpl
listen health_check_http_url
  bind :::30936 v4v6  => 9444
  mode http
  monitor-uri /haproxy_ready
  option dontlognull
....
...
listen stats
  bind localhost:{{ .LBConfig.StatPort }} =>bind localhost:29445
  mode http
  stats enable
  stats hide-version

With the two manifests readable, and the ports changed, encode them separately. This time, using the URL encode tool
Then use the new encoded content into the following MachineConfig, substituting the the re-coded outputs in the proper section:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-haproxy-fix-nodeport-conflict
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,<ENCODED_DATA_FOR_haproxy.yaml>
        mode: 420
        overwrite: true
        path: /etc/kubernetes/manifests/haproxy.yaml
      - contents:
          source: data:,<ENCODED_DATA_FOR_haproxy.cfg.tmpl>
        mode: 420
        overwrite: true
        path: /etc/kubernetes/static-pod-resources/haproxy/haproxy.cfg.tmpl

Apply the MachineConfig and the HAProxy pods will be restarted. The MachineConfigs will restart the Master nodes, after a while, you will check that port has changed:

$> oc -n openshift-kni-infra get pod haproxy-master-0.el8k-ztp-1.hpecloud.org -o jsonpath={.spec.containers[].livenessProbe.httpGet.port}                                                               
9444

Or:

$ oc debug node/master-0.el8k-ztp-1.hpecloud.org -- chroot /host sh -c "cat /etc/haproxy/haproxy.cfg | grep bind"
Starting pod/master-0el8k-ztp-1hpecloudorg-debug ...
To use host binaries, run `chroot /host`
  bind :::9445 v4v6
  bind :::9444 v4v6
  bind localhost: 29445

Root Cause

The HAProxy is using a port in the NodePort range.

oc -n openshift-kni-infra get pod haproxy-openshift-master-2.hub-virtual.lab -o jsonpath={.spec.containers[].livenessProbe}|jq .
{
  "failureThreshold": 3,
  "httpGet": {
    "path": "/haproxy_ready",
    "port": 30936,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 50,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}

Or:

oc debug node/master-0.el8k-ztp-1.hpecloud.org -- chroot /host sh -c "cat /etc/haproxy/haproxy.cfg | grep bind"
Starting pod/master-0el8k-ztp-1hpecloudorg-debug ...
To use host binaries, run `chroot /host`
  bind :::9445 v4v6
  bind :::9444 v4v6
  bind localhost:30000

Getting NodePort range:

$ oc get configmaps -n openshift-kube-apiserver config \
  -o jsonpath="{.data['config\.yaml']}" | \
  grep -Eo '"service-node-port-range":["[[:digit:]]+-[[:digit:]]+"]'

OCP/Kubernetes is not aware this port is been used. Therefore, same port could be assigned to any SVC that asks for a NodePort. When this happens, there is a ports conflict. Causing an infinite loop of restarts, until the port is available again.

This issue is fixed in OCP4.11 with this commit

The bug is fixed on OCP4.11, and it is backported 4.8, 4.9 and 4.10. But here are different port's conflicts combination with the different OCP releases. Mainly, about the Liveness probes and the LBStats.

Diagnostic Steps

Check the port where the HAProxy liveness probe is done:

$> oc -n openshift-kni-infra get pod haproxy-master.<xyz>  -o jsonpath={.spec.containers[].livenessProbe.httpGet.port}                                                               
30936

Check the events on the openshift-kni-infra namespace

$> oc -n openshift-kni-infra get events
        ...(more)...
121m Warning Unhealthy pod/haproxy-master-0Liveness probe failed: Get "http://10.131.120.71:30936/haproxy_ready": dial tcp 10.131.120.71:30936: i/o timeout (Client.Timeout exceeded while awaiting headers)
121m Normal Pulled pod/haproxy-master-0 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08b511e4046223ca2198587d378cdca5a20e2a7390da8b298eb165e80dc54a1a" already present on machine
121m Normal Created pod/haproxy-master-0 Created container haproxy
121m Normal Started pod/haproxy-master-0 Started container haproxy
113m Warning ProbeError pod/haproxy-master-0Liveness probe error: Get "http://10.131.120.72:30936/haproxy_ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
body:
    ...(more)...

Or maybe the stats binding port is in conflict. Get the logs from a failing HAProxy

[NOTICE] 173/133217 (9) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 173/133217 (9) : path to executable is /usr/sbin/haproxy
[ALERT] 173/133217 (9) : Starting proxy stats: cannot bind socket [::1:30000]

Finally, oc get svc -A | grep <CONFLICT_PORT> is returning a service which is using the same port and creating the conflict.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments