DNS queries hang when sent over TCP

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux
  • Custom DNS server

Issue

  • DNS queries hang went sent over TCP but not UDP

Resolution

Root Cause

Large UDP DNS queries requires the DNS server and client to implement TCP retries at minimum. Optionally, the DNS client and server may implement EDNS to advertise much larger UDP sizes. Failure to implement TCP retries in the DNS server and client may result in unexpected behavior.

  • The original specifications for Domain Names Services in 1987 defined an upper limit of 512 bytes for a UDP-based message response. Should a DNS query response over UDP cause the response to go over this limit, the "Truncated" (TC) header bit needs to be set to inform the client that the server could not fit the entire response in the reply packet.
  • The specifications were later clarified in 1997 to note that, should a reply include the TC bit (IE the DNS query response has too many names to fit in the 512 byte packet), the client "should ignore that response, and query again, using a mechanism, such as a TCP connection, that will permit larger replies." [SIC]
  • DNS specifications were later updated in 2010 to explicitly note "that support for TCP is henceforth a REQUIRED part of a full DNS protocol implementation" and "when the client receives [a UDP response with the TC bit set], it takes the TC flag as an indication that it should retry over TCP instead."
  • Extension Mechanisms for DNS (EDNS) were defined in 2013 which allows advertising the capability to handle UDP DNS response payloads ranging in size from 512 bytes up to the recommended 4096 bytes (though an upper limit of 64 KB via reassembling datagrams is possible but not recommended).
  • If an application stack does not follow either of the above specification details when a UDP DNS response has the TC bit set, the application may have undefined or unexpected behavior ranging from successfully completing the reply up to but not including an application hang or crash.

Diagnostic Steps

To reproduce the issue

  • This issue can be replicated in a libvirtd environment with python, Red Hat Enterprise Linux as a virtual machine, dnsmasq as the DNS client on the hypervisor, and libvirtd operating as the hypervisor provider;

    1. Produce a purposefully large name resolution set for dnsmasq to respond with on the hypervisor:

      # for h in `seq 1 256`; do echo "10.0.0.$h me.mydomain.net" >> /var/lib/libvirt/dnsmasq/default.addnhosts; done
      
    2. Force dnsmasq to re-read its host resolution files on the hypervisor:

      # kill -1 `pidof dnsmasq`
      
    3. Set the virtual machine to drop egress packets for the hypervisor over port 53 in the virtual machine

      # iptables -A OUTPUT -d 192.168.122.1 -p tcp --dport 53 -j DROP    # NOTE change the destination IP address if the virtual bridge does not match default configurations
      
    4. Reproduce the issue in the virtual machine; here python is used to lookup the name of the example hostname generated in step 1:

      # python -c "import socket;print socket.getaddrinfo(\"me.mydomain.net\",3268,socket.AF_UNSPEC,socket.SOCK_STREAM)[0][4][0]"   # Python 2 in RHEL 7 and below
      # python3 -c "import socket;print(socket.getaddrinfo(\"me.mydomain.net\",3268,socket.AF_UNSPEC,socket.SOCK_STREAM)[0][4][0])"    # Python 3 in RHEL 8 and above
      
    5. After about 2 minutes, either example above will timeout:

      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
          for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
      socket.gaierror: [Errno -2] Name or service not known
      

What to look for

  • The application making DNS queries can be traced to determine if it indeed makes a UDP DNS query and then attempts again with TCP. Below, the python examples above are traced to show just this behavior:

    1. Reproduce the steps 1-3 in the above section, then trace the python examples:

      # strace -fTTvyy -o /tmp/example.strace -s 1024 -- python -c "import socket;print socket.getaddrinfo(\"me.mydomain.net\",3268,socket.AF_UNSPEC,socket.SOCK_STREAM)[0][4][0]"   # Python 2 in RHEL 7 and below
      # strace -fTTvyy -o /tmp/example.strace -s 1024 -- python3 -c "import socket;print(socket.getaddrinfo(\"me.mydomain.net\",3268,socket.AF_UNSPEC,socket.SOCK_STREAM)[0][4][0])"    # Python 3 in RHEL 8 and above
      
    2. The strace log can be reviewed either during the hang or after the timeout (Note the IP address for hypervisor's virtual bridge is 192.168.122.1 and the virtual machine's IP address is 192.168.122.119 in this example):

      # tail -f /tmp/example.strace 
      1009 3475845 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3<UDP:[87621801]> <0.000021>
      1010 3475845 setsockopt(3<UDP:[87621801]>, SOL_IP, IP_RECVERR, [1], 4) = 0 <0.000010>
      1011 3475845 connect(3<UDP:[87621801]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.122.1")}, 16) = 0 <0.000014>
      1012 3475845 poll([{fd=3<UDP:[192.168.122.119:33933->192.168.122.1:53]>, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) <0.000012>
      1013 3475845 sendmmsg(3<UDP:[192.168.122.119:33933->192.168.122.1:53]>, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\245\346\1\0\0\1\0\0\0\0\0\0\2me\10mydomain\3net\0\0\1\0\1", iov_len=33}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=33}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="}\345\1\0\0\1\0\0\0\0\0\0\2me\10mydomain\3net\0\0\34\0\1", iov_len=33}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=33}], 2, MSG_NOSIGNAL) = 2 <0.000196> 
      1014 3475845 poll([{fd=3<UDP:[192.168.122.119:33933->192.168.122.1:53]>, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}]) <0.000010>
      1015 3475845 ioctl(3<UDP:[192.168.122.119:33933->192.168.122.1:53]>, FIONREAD, [497]) = 0 <0.000008>
      1016 3475845 recvfrom(3<UDP:[192.168.122.119:33933->192.168.122.1:53]>, "\245\346\207\200\0\1\0\35\0\0\0\0\2me\10mydomain\3net\0\0\1\0\1\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\f\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\r\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\16\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\17\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\20\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\21\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\22\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\23\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\24\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\25\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\26\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\27\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\30\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\31\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\32\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\33\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\34\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\35\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\36\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\37\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0 \300\f\0\1\0\1\0\0\0\0\0\4\n\0\0!\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0\"\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0#\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0$\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0%\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0&\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0'\300\f\0\1\0\1\0\0\0\0\0\4\n\0\0(", 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.122.1")}, [28->16]) = 497 <0.000009>
      1017 3475845 close(3<UDP:[192.168.122.119:33933->192.168.122.1:53]>) = 0 <0.000036>
      1018 3475845 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3<TCP:[87621802]> <0.000255>
      1019 3475845 connect(3<TCP:[87621802]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 ETIMEDOUT (Connection timed out) <130.902843>
      
      • Line 1009: The UDP socket is created
      • Line 1011: The UDP socket is used to connect to the hypervisor
      • Line 1013: We request the lookup for me.mydomain.net
      • Line 1016: We receive the lookup response
      • Line 1017: The python application responds appropriately by closing the UDP connection
      • Line 1018: A TCP socket is created
      • Line 1019: We attempt to connect the TCP socket but the host never responds causing a timeout
  • The network traffic can also be traced to determine if the TC bit is being set.

    1. Install tcpdump on the virtual machine via the article that most closely matches the environment in question:

    2. Begin a packet capture via tcpdump. Note for systems with a large amount of network traffic, setting filters may be appropriate on the capture.

      # tcpdump -i any -w /tmp/example.pcap    # to listen for any traffic over any network interface 
      # tcpdump -i any -w /tmp/example.pcap port 53    # to listen for traffic over port 53 over any network interface 
      
    3. Let tcpdump record until the issue occurs. From there, check for instances of the TC bit being set (the below output is from a packet capture during one of the python examples above):

      # tshark -r /tmp/example.pcap -Y "dns.flags.truncated == 1"
         16   2.074623 192.168.122.1 → 192.168.122.119 DNS 541 Standard query response 0x18fb A me.mydomain.net A 10.0.0.14 A 10.0.0.15 A 10.0.0.16 A 10.0.0.17 A 10.0.0.18 A 10.0.0.19 A 10.0.0.20 A 10.0.0.21 A 10.0.0.22 A 10.0.0.23 A 10.0.0.24 A 10.0.0.25 A 10.0.0.26 A 10.0.0.27 A 10.0.0.28 A 10.0.0.29 A 10.0.0.30 A 10.0.0.31 A 10.0.0.32 A 10.0.0.33 A 10.0.0.34 A 10.0.0.35 A 10.0.0.36 A 10.0.0.37 A 10.0.0.38 A 10.0.0.39 A 10.0.0.40 A 10.0.0.41 A 10.0.0.42
      
      • In the above output, a DNS packet from the DNS server on the hypervisor to the client was truncated due to there being so many resolutions for the hostname.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments