Troubleshooting CanaryChecksRepetitiveFailures in the Ingress operator on OpenShift 4

Solution Verified - Updated -

Environment

  • Red Hat Openshift Container Platform (RHOCP)
    • 4
  • Ingress Operator
  • Canary

Issue

  • The upgrade process for OpenShift 4 gets stuck because the ingress operator is in a degraded state with the following errors appearing in the ingress operator pod:

    ERROR    operator.canary_controller    wait/wait.go:155    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"http://canary-openshift-ingress-canary.apps.<domain>\": dial tcp $IP:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"}
    INFO    operator.ingress_controller    controller/controller.go:244    reconciling    {"request": "openshift-ingress-operator/default"}
    
  • The upgrade process for OCP 4 gets stuck because the ingress operator is in a degraded state with the following errors appearing in the ingress clusteroperator:

    ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
    

Resolution

The CanaryChecksRepetitiveFailures error can be caused by different reasons, and some of them have been identified as bugs and resolved in different releases. Refer to the following solutions to check if the issue is caused by any of the already known issues (there could be other solutions also related to the same error):

If none of the above is the cause of the current issue, the following steps outline a methodology for diagnosing the problem and finding the root cause of the problem.

Performing each step below is designed to illustrate where in the communication flow there might be an issue, to help you then further identify and resolve the root cause for that outage. These steps are designed to show you where there is a breakdown.

The canary route check repetitive failures error explicitly is reporting that the following connection flow is failing:

openshift-ingress-operator pod --> DNS lookup for *.apps (A-record for loadbalancer) --> Loadbalancer IP --> router-default pod --> openshift-ingress-canary-pod (200 OK).

Troubleshooting steps

If one of the portions of the curl request fails or is failing, this error will be shown. The below steps will help to find which part of that flow is broken in order to solve the issue:

  1. Test if pods can curl itself for port 8080. All curl request should provide 200 status code, if status code != 200, the pod is not yet operational:

    $ for i in `oc get po  -n openshift-ingress-canary | grep -vi name | awk '{print $1}' ` ; do echo -e "\t---- $i ----" ; oc exec -n openshift-ingress-canary $i  --  curl -s -D - http://localhost:8080/ ; done
    

    Note: in OCP 4.17 and newer releases, the curl in above command needs to be changed to curl -sk -D - https://localhost:8888/

    • If issue is recent, please wait for couple of minutes (1 or 2) and check again.
    • If issue is persistent, restart pod.
  2. Check if pods can curl on service end-point. All curl request should provide 200 status code, if status code != 200, service IP is not able to route incoming traffic.

    $ SVC_IP=$(oc get svc -n openshift-ingress-canary -ojsonpath={..clusterIP})
    $ for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- curl -s -D - http://${SVC_IP}:8080 ; done
    

    Note: in OCP 4.17 and newer releases, the curl in above command needs to be changed to curl -sk -D - https://${SVC_IP}:8888

    • Creating this service is managed by daemonset ingress-canary in openshift-ingress-canary namespace, if status code is != 200, restart cluster-version-operator pod and then mentioned daemonset.
  3. Check connection to route. The curl request should give 302 response code:

    $ ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host})
    $ for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- curl http://${ROUTE} -sS -k -D - ; done
    
    • Check if load-balancer is routing traffic to router pods.
    • If route is not present or status code is != 302, restart the cluster-version operator -> router pods from openshift-ingress namespace.
  4. Confirm DNS routing is resolving properly from within the openshift-ingress-operator pod:

    $ ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host})
    $ for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- dig ${ROUTE} +nocmd +noall +answer ; done
    ##it should return an A record that points to the IP of the loadbalancer for *.apps.<yourcluster>.<yourdomain>
    
    • A failure here may indicate that DNS is not working locally on the platform (openshift-dns pods); or that the upstream nameserver is failing to return results for *.apps...
  5. Confirm the Load Balancer is forwarding traffic appropriately by using curl --resolve to bypass the LB and terminate the lookup directly at a router pod from your bastion:

    $ oc get pods -o wide -n openshift-ingress
    NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE                                                      
    router-default-5fc6f8b969-jhcnd   1/1     Running   0          9h    10.0.90.154   worker-2 
    router-default-5fc6f8b969-kfq45   1/1     Running   0          19h   10.0.89.132   worker-0 
    
    $ ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host})
    $ ROUTER=$(oc get pod -n openshift-ingress -o wide | grep -v NAME | grep Running | grep router-default | awk {'print $6'} | head -n 1)
    $ curl -k -v --noproxy '*' --resolve ${ROUTE}:443:${ROUTER} https://${ROUTE}
    

    Note: in this example, ${ROUTER} can be 10.0.89.132 or 10.0.90.154, we are selecting the first available router pod that is in Running state for this test

    Note: *when copying out the curl command to your command line, it may change the syntax of the curl to include backslashes: https://$\{ROUTE\} - remove these, as it will fail the curl in error otherwise.

    • Check in the curl result that the syntax includes confirmation that you did indeed resolve at the router pod and NOT the Load Balancer IP: (note that the result below indicates 10.0.90.154 was called, which is the IP of the router pod/infra host)

      * Connected to canary-openshift-ingress-canary.apps.<mycluster>.<mydomain> (10.0.90.154) port 443 (#0)
      
    • We should see that the router pods are able to resolve the request and return a valid response code from this route. A success here may imply that your Loadbalancer is not able to forward traffic, or that the DNS nameserver is not resolving properly at the ingress VIP A record (*.apps.yourcluster.yourdomain) to the Loadbalancer IP.

    • After several minutes, if the ingress is still degraded try the following command:

      $ oc delete route canary -n openshift-ingress-canary
      

NOTE: if issue is persistent after executing above steps, please open support case with the curl outputs, dig results and up-to-date must-gather.

Root Cause

The ingress operator pod tries to reach the canary pods route to verify if the network is reachable. When it fails it goes into a degraded state.

Diagnostic Steps

  • Check the Ingress Operator Pod Logs for the CanaryChecksRepetitiveFailures error.

    $ oc logs -n openshift-ingress-operator ingress-operator-b449dcfc4-btwvq -c ingress-operator
    [...]
    2021-03-17T10:01:26.023229058Z 2021-03-17T10:01:26.023Z    ERROR    operator.canary_controller    wait/wait.go:155    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"http://canary-openshift-ingress-canary.apps.<domain>\": dial tcp $IP:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"}
    2021-03-17T10:01:28.984572747Z 2021-03-17T10:01:28.984Z    INFO    operator.ingress_controller    controller/controller.go:244    reconciling    {"request": "openshift-ingress-operator/default"}
    2021-03-17T10:01:29.051564141Z 2021-03-17T10:01:29.051Z    ERROR    operator.ingress_controller    controller/controller.go:244    got retryable error; requeueing    {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
    2021-03-17T10:02:29.052556682Z 2021-03-17T10:02:29.051Z    INFO    operator.ingress_controller    controller/controller.go:244    reconciling    {"request": "openshift-ingress-operator/default"}
    2021-03-17T10:02:29.120898099Z 2021-03-17T10:02:29.120Z    ERROR    operator.ingress_controller    controller/controller.go:244    got retryable error; requeueing    {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
    
  • Check the Ingress Cluster Operator for the CanaryChecksRepetitiveFailures error.

    $ oc get clusteroperators ingress -oyaml
    [...]
    ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments