rpc.statd hangs while starting, and repeatedly logs the message 'select: Bad file descriptor'
Environment
RHEL 7
nfs-utils
rpc-statd.service
Issue
In rare circumstances, a bug in rpc.statd may result in the service failing to find an available reserved udp port during startup, and becoming stuck in a loop, repeatedly attempting to use a port that it closed. When this happens, rpc.statd does not complete its startup process, and outputs the message 'rpc.statd[]: my_svc_run() - select: Bad file descriptor' to the system log.
Resolution
This bug has not been resolved in RHEL 7, but has been corrected for RHEL 8.
Red Hat considers the bug extremely unlikely to occur, and recommends that any customers concerned that they may experience the bug implement a workaround.
Suggested workarounds include:
-
remove udp services from the
/etc/services
file, particularly between port 604 and port 658 - this will allowrpc.statd
to use those ports, andrpc.statd
will find an available udp port for its use -
use NFS v4, which performs NFS locking through the same channel as NFS traffic, and does not require the separate
rpc.statd
for file locking -
update to RHEL 8, where the bug is fixed
-
monitor system messages for the
rpc.statd
error, and restart therpc-statd.service
when the errors are seen -
use
rsyslog
to detect therpc.statd
errors and restartrpc-statd.service
automatically
/etc/rsyslog.d/fix-rpc.statd.conf
#####
module(load="omprog")
if ($programname == "rpc.statd" and $msg contains "select: Bad file descriptor") then {
action(
type="omprog"
# execute this script when the condition is met
binary="/etc/rsyslog.d/fix-rpc.statd"
# rpc.statd outputs many errors very quickly, so
# only execute the script every 10 seconds
action.execOnlyOnceEveryInterval="10"
)
& stop
}
# discard a spurious message generated by executing the script
if ($programname == "rsyslogd" and $msg contains "has terminated, reaped by main-loop") then stop
#####
and the script can stop and restart the rpc-statd.service, logging a message to that effect:
/etc/rsyslog.d/fix-rpc.statd
#####
#!/bin/bash
echo "rpc.statd may have failed to find an available port... restarting the service" | logger -p daemon.notice -t fix-rpc.statd
sleep 0.25
(
systemctl stop rpc-statd.service 2>&1
sleep 0.25
systemctl start rpc-statd.service 2>&1
echo "Restarted rpc-statd.service"
) | logger -p daemon.notice -t fix-rpc.statd
#####
then restart rsyslog:
# systemctl restart rsyslog.service
Root Cause
The bug can occur because of the way that rpc.statd
chooses a udp port during startup.
rpc.statd
calls bindresvport()
to obtain an available reserved udp port (port between 600 and 1024). It then checks the /etc/services
file to determine whether any other services are likely to attempt to claim that reserved port. If the udp port is defined in the /etc/services
file, rpc.statd again calls bindresvport()
and tries again. If rpc.statd
gets 100 reserved ports, but all are defined in /etc/services
, or if no more reserved ports are available, rpc.statd
then closes the ports it has opened, with the intent to use the first port it opened.
However, due to the bug, rpc.statd also closes the port that it first received from bindresvport()
. When the program later attempts to use the closed port, it is invalid, and rpc.statd
repeatedly logs the 'select: Bad file descriptor' error.
The bug is resolved in the upstream nfs-utils package with the following commit:
commit 3a5732152c60f8cefaa804db0b81e424e96ee657
Author: Chuck Lever <chuck.lever@oracle.com>
Date: 2015-09-16 11:18:02 -0400
statd: statd_get_socket() should return open fds
Tastky <tastky@gmail.com> reports:
> There appears to be a bug in nfs-utils exposed by musl, which
> makes rpc.statd loop with:
>
> my_svc_run() - select: Bad file descriptor
OpenGroup says getservbyport(3) is supposed to return NULL when
no entry exists for the specified port. But musl's getservbyport(3)
never returns NULL (likely a bug).
Thus statd_get_socket() tries bindresvport(3) 100 times, then gives
up and returns the last socket it created. This should work fine,
but there's a bug in the retry loop:
Rich Felker <dalias@libc.org> says:
> The logic bug is the count-down loop that closes all the temp
> sockets. In the case where the loop terminates via break, it
> leaves the last one open and only closes the extras. But in the
> case where where the loop terminates via the end condition in the
> for statement, the close loop closes all the sockets _including_
> the one it intends to use.
(emphasis mine). The closed socket fd is then passed to select(2).
See also: http://www.openwall.com/lists/musl/2015/08
The fix is to perform the loop termination test before adding sockfd
to the set of fds to be closed. As additional clean ups, remove the
use of the variable-length stack array, and switch to variable names
that better document the purpose of this logic.
Reported-by: Tastky <tastky@gmail.com>
Fixes: eb8229338f06 ("rpc.statd: Fix socket binding loop.")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steve Dickson <steved@redhat.com>
Diagnostic Steps
messages such as the following are repeatedly logged in the system's message log:
server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor
server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor
server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor
server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments