RHEL7.1: NFS4.1 client stuck in a loop with clustered NetApp NFS server, READ / WRITEs completing with NFS4ERR_BAD_STATEID, while TEST_STATEID completes with NFS4_OK

Solution Unverified - Updated -

Issue

pNFS/NFS4.1: High amount of unexplained NFS I/O on compute node

I have deployed a Highly Available (HA) OpenStack cloud using the RHEL OSP Installer. The cloud configuration consists of three controller nodes and two compute nodes. I am using the default Neutron Networking, RabbitMQ, RHEL OSP6 on RHEL7 deployment option.
Cinder is configured to use the NetApp driver with pNFS (NFS4.1). Storage Family selected is Clustered Data ONTAP.

One of the compute nodes is performing a high amount of unexplained NFS I/O. Note the NFS storage is functioning fine from an OpenStack perspective however the high amount of unexplained NFS I/O is a problem.

Please note the issue we are experiencing is intermittent. For example the issue has occurred sometimes in the past when a backend operation occurs on the NetApp NFS server e.g. LIF failover or controller failover. Another consideration is both our OpenStack clouds have been running fine since the last reported incident (September 7th) with constant use and no sign of the issue.

Environment

  • Red Hat Enterprise Linux 7
    • seen on kernel-3.10.0-229.4.2.el7
  • Seen on NetApp Clustered Data ONTAP version 8.2.1P1 (NFS server)
  • NFS4.1

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content