NFS hangs during mount on mpi launch

Solution Unverified - Updated -

Issue

  • RHEL 6 machines fail to start orted on mpi launch.
  • It looks like nfs hangs are causing the orted startup to time out.
  • Reports are that this doesn't happen on RHEL 5
  • When attempting to auto mount directories via NFS using TCP simultaneously on around 100 nodes a small number of the nodes will stall out while fetching required binaries and libraries.
  • The main effect of this behavior occurs when an MPI job is launched and an attempt is made to locate and execute the orted daemon from the NFS share. Several of the orted's will fail to start and examining the nodes in question shows the mount.nfs binary in the wchan "rpc_wait_but_killable".
  • Switching the mounts to use the UDP protocol causes the problem to disappear until around 800 nodes attempt the mount simultaneously.
  • Tests on RHEL5 based clients have yet to show this behavior using the same server and network infrastructure.
  • Error rates on the relevant networks seem within norms. This behavior is seen on several RHEL6 platforms with different client Infiniband hardware indicating the issue is not likely due to vendor specific IB driver code.

Environment

  • Red Hat Enterprise Linux 6

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.