Upgrade process fails in Openshift Container Platform 4 because the Ingress Operator failed: CanaryChecksRepetitiveFailures

Solution Verified - Updated -

Environment

  • Openshift Container Platform 4.x

Issue

  • The upgrade process for OCP 4.x gets stuck because the Ingress operator is in a degraded state with the following errors appearing in the Ingress operator pod:
2021-03-17T10:01:26.023229058Z 2021-03-17T10:01:26.023Z    ERROR    operator.canary_controller    wait/wait.go:155    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"http://canary-openshift-ingress-canary.apps.<domain>\": dial tcp $IP:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"}
2021-03-17T10:01:28.984572747Z 2021-03-17T10:01:28.984Z    INFO    operator.ingress_controller    controller/controller.go:244    reconciling    {"request": "openshift-ingress-operator/default"}
  • The upgrade process for OCP 4.x gets stuck because the Ingress operator is in a degraded state with the following errors appearing in the Ingress clusteroperator:
ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

Resolution

  1. Test if pods can curl itself for port 8080. All curl request should provide 200 status code, if status code != 200, the pod is not yet operational.

    # for i in `oc get po  -n openshift-ingress-canary | grep -vi name | awk '{print $1}' ` ; do echo -e "\t---- $i ----" ; oc exec -n openshift-ingress-canary $i  --  curl -s -D - http://localhost:8080/ ; done
    
    • If issue is recent, please wait for couple of minutes (1 or 2) and check again.
    • If issue is persistent, restart pod.
  2. Check if pods can curl on service end-point. All curl request should provide 200 status code, if status code != 200, service IP is not able to route incoming traffic.

    # SVC_IP=$(oc get svc -n openshift-ingress-canary -ojsonpath={..clusterIP})
    # for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- curl http://${SVC_IP}:8080 -s -D - ; done
    
    • Creating this service is managed by daemonset ingress-canary in openshift-ingress-canary namespace, if status code is != 200, restart cluster-version-operator pod and then mentioned daemonset.
  3. Check connection to route. Curl request should give 302 response code.

    # ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host})
    # for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- curl http://${ROUTE} -sS -k -D - ; done
    
    • Check if load-balancer is routing traffic to router pods.
    • If route is not present or status code is != 302, restart cluster-version operator -> router pods from openshift-ingress namespace.
  4. Confirm DNS routing is resolving properly from within the openshift-ingress-operator pod:

    # ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host})
    # for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- dig ${ROUTE} +nocmd +noall +answer ; done
    ##you should return an A record that points to the IP of your loadbalancer for *.apps.<yourcluster>.<yourdomain>
    
    • A failure here may indicate that DNS is not working locally on the platform (openshift-dns pods); or that the upstream nameserver is failing to return results for *.apps...
  5. Confirm your Loadbalancer is forwarding traffic appropriately by using curl --resolve to bypass the LB and terminate the lookup directly at a router pod from your bastion:

    $ oc get pods -o wide -n openshift-ingress
    NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE                                                      
    router-default-5fc6f8b969-jhcnd   1/1     Running   0          9h    10.0.90.154   worker-2 
    router-default-5fc6f8b969-kfq45   1/1     Running   0          19h   10.0.89.132   worker-0 
    
    ##while on bastion run the following syntax to 
    $ curl -kv --resolve {ROUTE-URL}:80:{INFRA-NODE-IP-ADDRESS} http://{ROUTE-URL}
    $ curl -kv --resolve {ROUTE-URL}:443:{INFRA-NODE-IP-ADDRESS} https://{ROUTE-URL}
    

    Note: In this example, {INFRA-NODE-IP-ADDRESS} can be 10.0.89.132 or 10.0.90.154

    • Check in your curl result that the syntax includes confirmation that you did indeed resolve at the router pod:
    * Connected to canary-openshift-ingress-canary.apps.<mycluster>.<mydomain> (10.0.90.154) port 443 (#0)
    
    • We should see that the router pods are able to resolve the request and return a valid response code from this route. A success here may imply that your Loadbalancer is not able to forward traffic, or that the DNS nameserver is not resolving properly at the ingress VIP A record (*.apps.yourcluster.yourdomain) to the Loadbalancer IP.

If issue is persistent after executing above steps, please open support case with the curl outputs, dig results and up-to-date must-gather.

Root Cause

  • The Ingress operator pod tries to reach the canary pods route to verify if the network is reachable. When it fails it goes into a degraded state.

Diagnostic Steps

  • Check the Ingress Operator Pod Logs for the CanaryChecksRepetitiveFailures error.
$ oc logs -n openshift-ingress-operator ingress-operator-b449dcfc4-btwvq -c ingress-operator
2021-03-17T10:01:26.023229058Z 2021-03-17T10:01:26.023Z    ERROR    operator.canary_controller    wait/wait.go:155    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"http://canary-openshift-ingress-canary.apps.<domain>\": dial tcp $IP:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"}
2021-03-17T10:01:28.984572747Z 2021-03-17T10:01:28.984Z    INFO    operator.ingress_controller    controller/controller.go:244    reconciling    {"request": "openshift-ingress-operator/default"}
2021-03-17T10:01:29.051564141Z 2021-03-17T10:01:29.051Z    ERROR    operator.ingress_controller    controller/controller.go:244    got retryable error; requeueing    {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
2021-03-17T10:02:29.052556682Z 2021-03-17T10:02:29.051Z    INFO    operator.ingress_controller    controller/controller.go:244    reconciling    {"request": "openshift-ingress-operator/default"}
2021-03-17T10:02:29.120898099Z 2021-03-17T10:02:29.120Z    ERROR    operator.ingress_controller    controller/controller.go:244    got retryable error; requeueing    {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
  • Check the Ingress Cluster Operator for the CanaryChecksRepetitiveFailures error.
$ oc get clusteroperators ingress -oyaml
ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments