Reliability/speed of dns lookups

Latest response

Hi,

Quick dns lookups is critical for many applications. I consider to change the default timeout option in
/etc/resolv.conf from 5 seconds to something less (maybe 1 second?) to reduce the impact of a
unresponsive dns-server. Also the "rotate" option will help a bit.
Still with these changes a 1 second delay for at least 50% of the lookups is very slow and will
influence application performance a lot. Normally a reply is probably received within 10 milliseconds
(only lookups within the organization are performed with a fast lan/wan between the resolver and the
dns-servers).

I'm a bit surprised that the resolver in libc is not more sophisticated.
Wouldn't it be quite simple to implement som sort of blacklist of non responding dns-servers.
For instance if the first dns-server in resolv.conf did not reply within the configured timeout, the
resolver could send the next queries directly to the second and third dns servers in resolv.conf.
After a predefined number of seconds the first one could be tried again (maybe increasing the number of seconds every time
to a maximum like for instance 3600).

To avoid problems like this I see that people suggest many solutions like nscd, unbound, load balancing/failover of
dns-servers etc, but that may not be easy to implement in all cases.
A bit more robustness in the libc resolver would maybe have been better/safer in many cases.

How do you solve this?

Best regards,

Erling Ringen Elvsrud

Responses

Hi,

I have had similar frustration in the path when failing RHEL VM's between sites. I have tried the same resolv.conf configuration items as yourself and found they have very little impact on DNS sensitive applications starting on boot.

My solution (workaround / hack) was to write a startup script that starts immediately after the networking comes up on boot and carries out the following:

  1. Checks that the primary NIC is up (ie. NIC used to reach DNS servers)
  2. Return a list of DNS servers listed in /etc/resolv.conf (commented out or not.. makes sense in the next step)
  3. Loop through each server in the list and determine if it is up (using your chosen method, eg. ping)
    3a. If it is up make sure the DNS server is uncommented in /etc/resolv.conf
    3b. If it is down, comment it out of /etc/resolv.conf
  4. If the minimum number of DNS servers (configurable) aren't returned as 'up' from the list, repeat the loop for X number of tries before giving up

Although this sounds long winded, it solved the problem for me as a server was quickly able to remove servers from resolv.conf that it couldn't get to during boot which in turn mean that applications that started later in the boot process weren't at the mercy of resolv.conf having an unreachable entry.

Thanks for your reply. Your solution might work in certain scenarios.
I have not yet decided what to do in my case. The best solution would have been to improve the resolver as
this is a potential problem for all RHEL users.
All other solutions I have seen adds complexity and might also be unreliable.

Ideally i'd like the option to check multiple DNS servers in parallel and then just provide the quickest response.

Replying to myself... of 4 years ago.... helping myself out?

I did end up coming up with a far more robust solution that queries DNS servers in parallel.. and uses the quickest response to answer the query.

The basic process is:
1. Install dnsmasq

yum install -y dnsmasq

2. Create a dnsmasq file that contains your DNS servers (eg. /etc/resolv.dnsmasq)

nameserver 10.0.0.2
nameserver 10.0.0.3
nameserver 10.0.0.4
nameserver 10.0.0.4

3. Configure dnsmasq to use your dnsmasq specific resolv file

resolv-file=/etc/resolv.dnsmasq

4. Configure resolv.conf to use the dnsmasq resolver on localhost (127.0.0.1)

search pixeldrift.local
nameserver 127.0.0.1

5. This is the critical step... Create a replacement systemd service file to provide an extra parameter to dnsmasq on startup. This will change the behaviour of dnsmasq to hit all DNS servers concurrently and use the first answer.

[Service]
ExecStart=/usr/sbin/dnsmasq -k --all-servers

Note: This will have impact on your DNS traffic, as it will send a DNS request to every server in your resolv.dnsmasq, this can be overcome by implementing caching in dnsmasq.

hi guys , have we concluded this topic ? I would also like to know the possible way of failover from primary dns to secondary dns server. i am facing production outage issue as failover taking more than 10 seconds.

Hi Arun,

If you are facing an outage I suggest you open a support case.

A 4 year old discussion is probably not re-opened.

Regards,

Jan Gerrit

not an outage currentlt , but recently faced a service outage because of this exact issue .

on the client resolve.cfg file - adding the below entries will help here or not ? any idea!! option timeout:1 options attempts:1

this is dillema of RHEL where by default has no dns caching. anyway, in my testing, options attempts=1 timeout=1 works

have anyone tried 0s as timeout value? what is the implication? although I tested this value, does this mean it will aggresively go to the next nameserver? Our current dns servers resolves fast in ms. if 0s, does that mean of dns query is = or greater than 1s, it will timeout and go to the next resolver?

Anyone landing here, this discussion is from 2014, and was resurrected in 2018 and again in 2021.

Perry, it seems you found something useful, I'd agree it needs more testing.

Here is a possible additional reference maybe, perhaps, this Red Hat solution on adjusting timeouts between identified resolvers in /etc/resolv.conf - let us know. Our resolvers seem to not have an issue, so our tests might be moot in this specific phase of this very old discussion from 2014, that occasionally garners interest over the years, and yet still may have some interest to those having this issue. NOTE: each person's environment may be different, and consequently, there could be actual network issues or actual DNS issues contributing this (google "DNS Haiku") that may be unique to each person's specific issue.

Regards,
RJ