RHEL NFS client cannot see all the files of a directory served by a third party NFS server, the message "NFS: readdir reply truncated!" appears in the logs if rpc_debug is enabled or when using an older kernel

Solution Verified - Updated -

Environment

  • RHEL 4 (but other RHEL versions are likely to be affected) used as NFS client
  • third party NFS server (some unspecified versions of Microsoft Windows NFS server are known to trigger the problem)

  • NFSv3, other versions might be affected as well

Issue

  • NFS client cannot find some files of specific directories that are being served by server. The following message may appear in the logs with kernels affected by bug 443655:
NFS: readdir reply truncated! 
  • For kernels not affected by bug 443655, the message shouldn't appear until rpc_debug is enabled (with "echo 32767 > /proc/sys/sunrpc/rpc_debug" for example).

  • Modifying the content of the directory resolves the issue temporarily.

  • Disabling caching with NFS options "noac" or "lookupcache=none" does not resolve the issue.

Resolution

  • Fix the NFS server.

  • Workaround: keep updating the directory (with "touch" and "rm" for example)

Root Cause

  • As described in kernel commit 643f81115baca3630e544f6874567648b605efae some NFS servers end their READDIR or READDIRPLUS directories listings with and empty list without setting the EOF flag (cf RFC1813 section 3.3.16 for READDIR description). The Linux kernel consider this as a problem on the server side and adds the EOF itself. However it seems some NFS servers behave that way without actually ending the listing apparently expecting the client to request further listing data later. This behaviour is ambiguous and the server shouldn't send an empty list without setting EOF. Red Hat Support is happy to discuss this directly with any NFS server vendor who disagrees and insists on having READDIR return and empty list without actually ending the listing.

**
**

Diagnostic Steps

Get a network capture when reproducing the issue. It will show something like this when the client attempts to list the directory (example from CRM#2025844):

$ /usr/sbin/tshark -V -ta -r nfs-splitted.pcap15

Frame 199162 (210 bytes on wire, 210 bytes captured)
    Arrival Time: Jun  2, 2010 10:33:51.205093000
[...]
Internet Protocol, Src: 192.168.250.20 (192.168.250.20), Dst: 192.168.250.32 (192.168.250.32)
[...]
Network File System, READDIRPLUS Call FH:0xd0449fa5
    [Program Version: 3]
    [V3 Procedure: READDIRPLUS (17)]
    dir
        [...]
        [hash: 0xd0449fa5]
        [...]
    cookie: 0
    [...]
    dircount: 512
    maxcount: 4096

Frame 199163 (1414 bytes on wire, 1414 bytes captured)
    Arrival Time: Jun  2, 2010 10:33:51.205235000
[...]
Internet Protocol, Src: 192.168.250.32 (192.168.250.32), Dst: 192.168.250.20 (192.168.250.20)
[...]
Network File System, READDIRPLUS Reply
    [Program Version: 3]
    [V3 Procedure: READDIRPLUS (17)]
    Status: NFS3_OK (0)
    [...]
    Value Follows: Yes
    Entry: name .
[...]
    Entry: name ACT_CONF__100513152311.log
        [...]
    Value Follows: No
    EOF: 0

[...more READDIR(PLUS) and their reply...]

Frame 199460 (210 bytes on wire, 210 bytes captured)
    Arrival Time: Jun  2, 2010 10:33:51.228304000
[...]
Internet Protocol, Src: 192.168.250.20 (192.168.250.20), Dst: 192.168.250.32 (192.168.250.32)
[...]
Network File System, READDIRPLUS Call FH:0xd0449fa5
    [Program Version: 3]
    [V3 Procedure: READDIRPLUS (17)]
    dir
        length: 32
        [hash: 0xd0449fa5]
        decode type as: unknown
        filehandle: D6000000000001001D00000000000100494C4FE152080000...
    cookie: 118354389
    Verifier: Opaque Data
    dircount: 512
    maxcount: 4096

Frame 199461 (202 bytes on wire, 202 bytes captured)
    Arrival Time: Jun  2, 2010 10:33:51.228396000
[...]
Internet Protocol, Src: 192.168.250.32 (192.168.250.32), Dst: 192.168.250.20 (192.168.250.20)
[...]
Network File System, READDIRPLUS Reply
    [Program Version: 3]
    [V3 Procedure: READDIRPLUS (17)]
    Status: NFS3_OK (0)
    dir_attributes  Directory mode:0700 uid:-2 gid:-2
        [...]
    Verifier: Opaque Data
    Value Follows: No
    EOF: 0

We can see the listing actually works fine for a number of files until frame 199461 (the last one in this excerpt). In that frame, the list that is returned is empty but the EOF marker is unset. The NFS client then considers this is the end of the listing.

This situation is actually described in fs/nfs/nfs3xdr.c and commit 643f81115baca3630e544f6874567648b605efae (included in RHEL4 as resolution of bug 443655) which modifies slightly the behaviour of the client when this happens but won't really fix anything.**
**

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments