Bind 9.9.4-74.el7_6.1 breaking zone transfers

Latest response

I have updated a working Bind with this update to address a vulnerability CVE-2018-5743

Before that everything was working fine for zone transfers to other systems. Once updated, the zones fail to copy over. Errors are in this format:

xfer-out: error: client AAA.BBB.CCC.DDD #61783 (example.com): view internal: transfer of 'example.com/IN': aborted: timed out

lsof -Pni | grep named | grep WAIT | wc -l

typically shows 90 connections to the remote NS servers in CLOSE_WAIT state.

If I roll back to the prior version of bind: 9.9.4-73.el7_6 then zone transfers resume good function.

We're tried the option "transfers-out 100;" and it makes no difference in the current release 9.9.4-74.el7_6.1

dig with +trace confirms the new hosts are not resolved by remote DNS.

I may have to run this server with the older release until we understand how to make a better configuration or there is a bug fix. Is anyone else seeing a problem? It isn't obvious unless you try a DNS query that happens to be resolved by one of the remote DNS servers and with a freshly added record to check.

Responses

Yes, we also have this problem with big sized zones Currently workaround is to point to the master of zone or transfer slavezone manually

Our zone files add up to about 70MB. We are also running the previous Bind release until Redhat can fix it. Have not heard back from Redhat support ticket yet, but with a severity rating of low they have a few days before they respond.

The zone of discovery of the failure is ~100MB but it was also seen on our main signed zone which is "only" 6MB. I think Redhat should address this as critical

yum downgrade bind bind-libs bind-libs-lite bind-license bind-utils on the master did the trick, 9.9.4-RedHat-9.9.4-73.el7_6 works

Yes, having the same problem here. Initially seemed to only affect using dig to axfr a zone, but is now affecting many normal master to slave zone transfers. Tried increasing tcp-clients, which is what the patch addresses, but using higher values does not help. The patch is obviously flawed.

I have some responses from RH support. They reference this KB article at ISC linked from the RH bugzilla report:

https://kb.isc.org/docs/how-does-tcp-clients-work

The way tcp-clients are counted now includes the number of listening interfaces, so some sites need to increase their tcp-clients up from the default of 100. 'rndc status' can confirm your usage.

In my case, the clients count is not exceeded, and increasing it to 200 has no effect in changing this. Every time we do an 'rndc reload' it increases the number of hung transfers, which can be observed by watching 'lsof -Pni | grep named | grep TCP'

It is looking like a bug to me. The failure to transfer is immediately found after upgrading to the current Bind package 9.9.4-RedHat-9.9.4-74.el7_6.1

I received a response from a senior tech support person at Redhat. The troubleshooting they want me to do is extensive, with tcpdump running at both server and client end. I don't have time to do this debugging work and I've already supplied them with enough information they should be able to replicate the issue. If anyone has some time to spend on this, I suggest they make a support ticket or bug report and perhaps this issue will become resolved some day.

I have experienced the same issue, which started right after yum updated bind to 9.9.4-74.el7_6.1. I read the CVE articles to find out about the change in how tcp clients are handled. Even tried changing my tcp-clients to 5000. Didn't help. This also seems to update nsupdates as well. Glad to find that this seems to be a known issue. I will open my own ticket with RedHat. Perhaps that will help validate that this is a larger issue.

For anyone interested, there is a bug opened for this: https://bugzilla.redhat.com/show_bug.cgi?id=1720703# I also opened a ticket and concurrently the bug was updated with a workaround that seems to work. I haven't tested it because I downgraded last night. The workaround is to configure the following option in the global options: keep-response-order { any; };

The workaround Kenneth mentions does work. This is only a workaround and it may become deprecated after the next patch fixes this.

Any update of when a new bind will be released that fixes this issue?