Timeouts observed in CoreDNS leading to repeated SERVFAILs on Azure Red Hat OpenShift (ARO) cluster

Solution Verified - Updated -

Environment

  • Azure Red Hat OpenShift (ARO)
    • 4.x

Issue

  • Customer workloads observe repeated SERVFAILs for the same query on a single host. Other queries continue to work, and after 40 seconds the query works again.

  • Repeated i/o error messages are observed in CoreDNS logs.

Resolution

CoreDNS timeouts and retry logic may be more forgiving for customer workloads compared to dnsmasq. To enable this, configure the DNS Operator to forward certain domains directly to the Azure recursive resolvers or resolvers of the customer's choice.

To configure forwarding in the DNS operator, modify the DNS Operator object named dns.operator/default. Add a servers section similar to the following, adding the desired domains to the zones list. This example uses the Azure default recursive resolver at 168.63.129.16, but customer-provided recursive DNS IPs can be provided instead.

  servers:
  - name: override
    zones: 
    - example.com
    - example.net
    forwardPlugin:
      policy: Random 
      upstreams: 
      - 168.63.129.16

For more information about configuring the OCP DNS Operator, refer to the OCP documentation.

Important: For clusters with the egress lockdown feature enabled, certain domains must not be forwarded in the DNS Operator. ARO overrides certain DNS records in order to enable the egress lockdown feature. These DNS names must not be configured in the DNS Operator otherwise the cluster will lose access to them. Some of the domains to avoid configuring in this manner are listed here:

  • arosvc.azurecr.io
  • arosvc.<REGION>.data.azurecr.io
  • blob.core.windows.net
  • ingest.monitor.core.windows.net
  • login.microsoftonline.com
  • management.azure.com
  • monitoring.core.windows.net
  • servicebus.windows.net

For a full list of egress lockdown domains, refer to Control egress traffic for your Azure Red Hat OpenShift (ARO) cluster or check the cluster directly using:

$ oc get cluster cluster -o jsonpath="{.spec.gatewayDomains}"

["agentimagestorewus01.blob.core.windows.net","agentimagestorecus01.blob.core.windows.net","agentimagestoreeus01.blob.core.windows.net","agentimagestoreweu01.blob.core.windows.net","agentimagestoreeas01.blob.core.windows.net","eastus-shared.prod.warm.ingest.monitor.core.windows.net","gcs.prod.monitoring.core.windows.net","randomstring.servicebus.windows.net","randomstring.blob.core.windows.net","randomstring.servicebus.windows.net","randomstring.blob.core.windows.net","randomstring.servicebus.windows.net","randomstring.blob.core.windows.net","maupdateaccount.blob.core.windows.net","maupdateaccount2.blob.core.windows.net","maupdateaccount3.blob.core.windows.net","maupdateaccount4.blob.core.windows.net","production.diagnostics.monitoring.core.windows.net","qos.prod.warm.ingest.monitor.core.windows.net","login.microsoftonline.com","management.azure.com","arosvc.azurecr.io","arosvc.eastus.data.azurecr.io","imageregistry6rmpk.blob.core.windows.net"]

Root Cause

The underlying issue is caused by a remote recursive DNS resolver failing to respond to a query. This can be caused by overloaded DNS infrastructure or by packet loss.

ARO routes all outbound DNS queries through dnsmasq on each node. Dnsmasq will forward received queries to the DNS recursive resolvers configured on the Azure virtual network. These can be either the default Azure resolver at 168.63.129.16 or customer-provided resolvers. When dnsmasq receives a request, it will forward it to the upstream resolvers, but will queue any further requests for that specific DNS name (and type) until the upstream resolver responds to the original request. If that forwarded request is lost, dnsmasq times out after 40 seconds and drops the requests queued for that name.

CoreDNS is configured to forward upstream requests to dnsmasq via the SystemResolvConf mechanism. CoreDNS has a timeout of 6 seconds on forwarded requests, after which it will respond to the client with SERVFAIL. Because of the queueing behavior in dnsmasq, if a forwarded request is lost, CoreDNS will respond to all queries for that name and type with SERVFAIL for the next 40 seconds. As a result, the client workload could observe a large number of DNS failures for a 40 second period each time an upstream request is lost.

Diagnostic Steps

Examine the logs and their timestamps for the dns-default pods in the openshift-dns namespace.

$ oc get pods -n openshift-dns | grep dns-default

$ oc logs -n openshift-dns [dns-pod_name] --timestamps

The logs will have a series of identical lines that are similar to:

2023-05-23T12:25:33.973073008Z [INFO] 10.129.2.11:56970 - 36977 "A IN db-prod-postgres.postgres.database.azure.com. udp 114 false 512" - - 0 6.001935614s
2023-05-23T12:25:33.973073008Z [ERROR] plugin/errors: 2 db-prod-postgres.postgres.database.azure.com. A: read udp 10.1.2.3:50863->10.0.2.1:53: i/o timeout
2023-05-23T12:25:35.513536797Z [INFO] 10.129.2.11:56970 - 36977 "A IN db-prod-postgres.postgres.database.azure.com. udp 114 false 512" - - 0 6.001935614s
2023-05-23T12:25:35.513536797Z [ERROR] plugin/errors: 2 db-prod-postgres.postgres.database.azure.com. A: read udp 10.1.2.3:50863->10.0.2.1:53: i/o timeout

That specific name and type (e.g. A or AAAA) will be consistent for that specific pod. Comparing the timestamps will reveal that those errors lasted for only 40 seconds at a time. Other pods may have sequences of similar errors which will also be consistent for a specific DNS name and type, and only last for 40 seconds.

The [ERROR] level message will record that this was an i/o timeout, while the [INFO] level message will record it as lasting approximately 6.0s.

When examining the logs, there may be a large number of failures of queries constructed using the default search domain (e.g. ending in randomstring.bx.internal.cloudapp.net.). While the error with these queries may be due to this issue, they are not expected to resolve and the failure will not be impactful.

If timeouts are observed that are consistently occurring across all nodes simultaneously, then it is not likely to be this problem.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments