Troubleshooting CanaryChecksRepetitiveFailures in the Ingress operator on OpenShift 4
Environment
- Red Hat Openshift Container Platform (RHOCP)
- 4
- Ingress Operator
- Canary
Issue
-
The upgrade process for OpenShift 4 gets stuck because the
ingressoperator is in a degraded state with the following errors appearing in theingressoperator pod:ERROR operator.canary_controller wait/wait.go:155 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"http://canary-openshift-ingress-canary.apps.<domain>\": dial tcp $IP:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"} INFO operator.ingress_controller controller/controller.go:244 reconciling {"request": "openshift-ingress-operator/default"} -
The upgrade process for OCP 4 gets stuck because the
ingressoperator is in a degraded state with the following errors appearing in theingressclusteroperator:ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
Resolution
The CanaryChecksRepetitiveFailures error can be caused by different reasons, and some of them have been identified as bugs and resolved in different releases. Refer to the following solutions to check if the issue is caused by any of the already known issues (there could be other solutions also related to the same error):
- Ingress Operator degraded with
CanaryChecksRepetitiveFailures. - Ingress Operator failed:
CanaryChecksRepetitiveFailuresduring upgrade when ingress sharding is used. - Canary route checks failing or canary route deployed with two
routerCanonicalHostnameentries with different domains. - Failed to update canary route
openshift-ingress-canary/canaryin OpenShift 4. - CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing in OpenShift 4.17.
If none of the above is the cause of the current issue, the following steps outline a methodology for diagnosing the problem and finding the root cause of the problem.
Performing each step below is designed to illustrate where in the communication flow there might be an issue, to help you then further identify and resolve the root cause for that outage. These steps are designed to show you where there is a breakdown.
The canary route check repetitive failures error explicitly is reporting that the following connection flow is failing:
openshift-ingress-operator pod --> DNS lookup for *.apps (A-record for loadbalancer) --> Loadbalancer IP --> router-default pod --> openshift-ingress-canary-pod (200 OK).
Troubleshooting steps
If one of the portions of the curl request fails or is failing, this error will be shown. The below steps will help to find which part of that flow is broken in order to solve the issue:
-
Test if pods can
curlitself for port8080. Allcurlrequest should provide 200 status code, if status code!= 200, the pod is not yet operational:$ for i in `oc get po -n openshift-ingress-canary | grep -vi name | awk '{print $1}' ` ; do echo -e "\t---- $i ----" ; oc exec -n openshift-ingress-canary $i -- curl -s -D - http://localhost:8080/ ; doneNote: in OCP 4.17 and newer releases, the
curlin above command needs to be changed tocurl -sk -D - https://localhost:8888/- If issue is recent, please wait for couple of minutes (1 or 2) and check again.
- If issue is persistent, restart pod.
-
Check if pods can curl on service end-point. All curl request should provide 200 status code, if status code != 200, service IP is not able to route incoming traffic.
$ SVC_IP=$(oc get svc -n openshift-ingress-canary -ojsonpath={..clusterIP}) $ for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- curl -s -D - http://${SVC_IP}:8080 ; doneNote: in OCP 4.17 and newer releases, the
curlin above command needs to be changed tocurl -sk -D - https://${SVC_IP}:8888- Creating this service is managed by daemonset ingress-canary in openshift-ingress-canary namespace, if status code is != 200, restart cluster-version-operator pod and then mentioned daemonset.
-
Check connection to route. The
curlrequest should give302response code:$ ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host}) $ for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- curl http://${ROUTE} -sS -k -D - ; done- Check if load-balancer is routing traffic to router pods.
- If route is not present or status code is
!= 302, restart thecluster-versionoperator -> router pods fromopenshift-ingressnamespace.
-
Confirm DNS routing is resolving properly from within the
openshift-ingress-operatorpod:$ ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host}) $ for i in `oc get po -n openshift-ingress-operator | grep -v NAME| awk '{print $1}' ` ; do oc exec -n openshift-ingress-operator -c ingress-operator $i -- dig ${ROUTE} +nocmd +noall +answer ; done ##it should return an A record that points to the IP of the loadbalancer for *.apps.<yourcluster>.<yourdomain>- A failure here may indicate that DNS is not working locally on the platform (openshift-dns pods); or that the upstream nameserver is failing to return results for *.apps.
. .
- A failure here may indicate that DNS is not working locally on the platform (openshift-dns pods); or that the upstream nameserver is failing to return results for *.apps.
-
Confirm the Load Balancer is forwarding traffic appropriately by using
curl --resolveto bypass the LB and terminate the lookup directly at a router pod from your bastion:$ oc get pods -o wide -n openshift-ingress NAME READY STATUS RESTARTS AGE IP NODE router-default-5fc6f8b969-jhcnd 1/1 Running 0 9h 10.0.90.154 worker-2 router-default-5fc6f8b969-kfq45 1/1 Running 0 19h 10.0.89.132 worker-0 $ ROUTE=$(oc get route -n openshift-ingress-canary -ojsonpath={..host}) $ ROUTER=$(oc get pod -n openshift-ingress -o wide | grep -v NAME | grep Running | grep router-default | awk {'print $6'} | head -n 1) $ curl -k -v --noproxy '*' --resolve ${ROUTE}:443:${ROUTER} https://${ROUTE}Note: in this example, ${ROUTER} can be
10.0.89.132or10.0.90.154, we are selecting the first available router pod that is in Running state for this testNote: *when copying out the curl command to your command line, it may change the syntax of the curl to include backslashes:
https://$\{ROUTE\}- remove these, as it will fail the curl in error otherwise.-
Check in the
curlresult that the syntax includes confirmation that you did indeed resolve at the router pod and NOT the Load Balancer IP: (note that the result below indicates10.0.90.154was called, which is the IP of the router pod/infra host)* Connected to canary-openshift-ingress-canary.apps.<mycluster>.<mydomain> (10.0.90.154) port 443 (#0) -
We should see that the router pods are able to resolve the request and return a valid response code from this route. A success here may imply that your Loadbalancer is not able to forward traffic, or that the DNS nameserver is not resolving properly at the ingress VIP A record (*.apps.yourcluster.yourdomain) to the Loadbalancer IP.
-
After several minutes, if the ingress is still degraded try the following command:
$ oc delete route canary -n openshift-ingress-canary
-
NOTE: if issue is persistent after executing above steps, please open support case with the curl outputs, dig results and up-to-date must-gather.
Root Cause
The ingress operator pod tries to reach the canary pods route to verify if the network is reachable. When it fails it goes into a degraded state.
Diagnostic Steps
-
Check the Ingress Operator Pod Logs for the
CanaryChecksRepetitiveFailureserror.$ oc logs -n openshift-ingress-operator ingress-operator-b449dcfc4-btwvq -c ingress-operator [...] 2021-03-17T10:01:26.023229058Z 2021-03-17T10:01:26.023Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"http://canary-openshift-ingress-canary.apps.<domain>\": dial tcp $IP:80: i/o timeout (Client.Timeout exceeded while awaiting headers)"} 2021-03-17T10:01:28.984572747Z 2021-03-17T10:01:28.984Z INFO operator.ingress_controller controller/controller.go:244 reconciling {"request": "openshift-ingress-operator/default"} 2021-03-17T10:01:29.051564141Z 2021-03-17T10:01:29.051Z ERROR operator.ingress_controller controller/controller.go:244 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"} 2021-03-17T10:02:29.052556682Z 2021-03-17T10:02:29.051Z INFO operator.ingress_controller controller/controller.go:244 reconciling {"request": "openshift-ingress-operator/default"} 2021-03-17T10:02:29.120898099Z 2021-03-17T10:02:29.120Z ERROR operator.ingress_controller controller/controller.go:244 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"} -
Check the
Ingress Cluster Operatorfor theCanaryChecksRepetitiveFailureserror.$ oc get clusteroperators ingress -oyaml [...] ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments