rpc.statd hangs while starting, and repeatedly logs the message 'select: Bad file descriptor'

Solution In Progress - Updated -

Environment

RHEL 7
nfs-utils
rpc-statd.service

Issue

In rare circumstances, a bug in rpc.statd may result in the service failing to find an available reserved udp port during startup, and becoming stuck in a loop, repeatedly attempting to use a port that it closed. When this happens, rpc.statd does not complete its startup process, and outputs the message 'rpc.statd[]: my_svc_run() - select: Bad file descriptor' to the system log.

Resolution

This bug has not been resolved in RHEL 7, but has been corrected for RHEL 8.

Red Hat considers the bug extremely unlikely to occur, and recommends that any customers concerned that they may experience the bug implement a workaround.

Suggested workarounds include:

  • remove udp services from the /etc/services file, particularly between port 604 and port 658 - this will allow rpc.statd to use those ports, and rpc.statd will find an available udp port for its use

  • use NFS v4, which performs NFS locking through the same channel as NFS traffic, and does not require the separate rpc.statd for file locking

  • update to RHEL 8, where the bug is fixed

  • monitor system messages for the rpc.statd error, and restart the rpc-statd.service when the errors are seen

  • use rsyslog to detect the rpc.statd errors and restart rpc-statd.service automatically

/etc/rsyslog.d/fix-rpc.statd.conf
#####
module(load="omprog")

if ($programname == "rpc.statd" and $msg contains "select: Bad file descriptor") then {
    action(
        type="omprog"

        # execute this script when the condition is met
        binary="/etc/rsyslog.d/fix-rpc.statd"

        # rpc.statd outputs many errors very quickly, so
        # only execute the script every 10 seconds
        action.execOnlyOnceEveryInterval="10"
    )
    & stop
}
# discard a spurious message generated by executing the script
if ($programname == "rsyslogd" and $msg contains "has terminated, reaped by main-loop") then stop
#####


and the script can stop and restart the rpc-statd.service, logging a message to that effect:

/etc/rsyslog.d/fix-rpc.statd
#####
#!/bin/bash

echo "rpc.statd may have failed to find an available port... restarting the service" | logger -p daemon.notice -t fix-rpc.statd
sleep 0.25
(
    systemctl stop rpc-statd.service 2>&1
    sleep 0.25
    systemctl start rpc-statd.service 2>&1
    echo "Restarted rpc-statd.service"
) | logger -p daemon.notice -t fix-rpc.statd
#####


then restart rsyslog:
# systemctl restart rsyslog.service

Root Cause

The bug can occur because of the way that rpc.statd chooses a udp port during startup.

rpc.statd calls bindresvport() to obtain an available reserved udp port (port between 600 and 1024). It then checks the /etc/services file to determine whether any other services are likely to attempt to claim that reserved port. If the udp port is defined in the /etc/services file, rpc.statd again calls bindresvport() and tries again. If rpc.statd gets 100 reserved ports, but all are defined in /etc/services, or if no more reserved ports are available, rpc.statd then closes the ports it has opened, with the intent to use the first port it opened.

However, due to the bug, rpc.statd also closes the port that it first received from bindresvport(). When the program later attempts to use the closed port, it is invalid, and rpc.statd repeatedly logs the 'select: Bad file descriptor' error.

The bug is resolved in the upstream nfs-utils package with the following commit:

commit 3a5732152c60f8cefaa804db0b81e424e96ee657
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   2015-09-16 11:18:02 -0400

    statd: statd_get_socket() should return open fds

    Tastky <tastky@gmail.com> reports:
    > There appears to be a bug in nfs-utils exposed by musl, which
    > makes rpc.statd loop with:
    >
    > my_svc_run() - select: Bad file descriptor

    OpenGroup says getservbyport(3) is supposed to return NULL when
    no entry exists for the specified port. But musl's getservbyport(3)
    never returns NULL (likely a bug).

    Thus statd_get_socket() tries bindresvport(3) 100 times, then gives
    up and returns the last socket it created. This should work fine,
    but there's a bug in the retry loop:

    Rich Felker <dalias@libc.org> says:
    > The logic bug is the count-down loop that closes all the temp
    > sockets. In the case where the loop terminates via break, it
    > leaves the last one open and only closes the extras. But in the
    > case where where the loop terminates via the end condition in the
    > for statement, the close loop closes all the sockets _including_
    > the one it intends to use.

    (emphasis mine). The closed socket fd is then passed to select(2).

    See also: http://www.openwall.com/lists/musl/2015/08

    The fix is to perform the loop termination test before adding sockfd
    to the set of fds to be closed. As additional clean ups, remove the
    use of the variable-length stack array, and switch to variable names
    that better document the purpose of this logic.

    Reported-by: Tastky <tastky@gmail.com>
    Fixes: eb8229338f06 ("rpc.statd: Fix socket binding loop.")
    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Steve Dickson <steved@redhat.com>

Diagnostic Steps

messages such as the following are repeatedly logged in the system's message log:

server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor
server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor
server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor
server.example.com rpc.statd[2960]: my_svc_run() - select: Bad file descriptor

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments