Database instance crashes with ESTALE(116) disk read error

Latest response

After a certain period of time (few hours), the database instance crashes and according to the DB2 diaglog it is due to ESTALE(116) disk read error as it could not access a file from the mount.

From three different target hosts that experience this issue:

ivapp1251440.devin1.ms.com /var/msdb2 1$ uname -aLinux ivapp1251440.devin1.ms.com 3.10.0-693.47.2.el7.x86_64 #1 SMP Fri Apr 26 05:55:48 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

ivapp1192155.devin1.ms.com /var/msdb2 1$ uname -aLinux ivapp1192155.devin1.ms.com 3.10.0-957.38.1.el7.x86_64 #1 SMP Thu Sep 26 12:15:44 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

ivapp1249822.devin1.ms.com /var/msdb2 1$ uname -aLinux ivapp1249822.devin1.ms.com 3.10.0-693.47.2.el7.x86_64 #1 SMP Fri Apr 26 05:55:48 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

Syslog at that time pdflush activity at precisely the same time as our pwrite fails in db2.

41281 Jan 13 20:53:49 ivapp1249822.devin1.ms.com kernel: 5e1a904fD OUTPUT packets: IN= OUT=eth0 SRC=10.85.37.20 DST=10.195.72.141 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=48528 DF PROTO=TCP SPT=44594 DPT=17263 WINDOW=29200 RES=0x00 SYN URGP=0 UID=0 GID=0
41282 Jan 13 20:53:51 ivapp1249822.devin1.ms.com db2_command_exec.ksh[9336]: BatchJob -- Success
41283 Jan 13 20:53:51 ivapp1249822.devin1.ms.com kernel: [135908.706401] nr_pdflush_threads exported in /proc is scheduled for removal
41284 Jan 13 20:53:51 ivapp1249822.devin1.ms.com kernel: nr_pdflush_threads exported in /proc is scheduled for removal
41285 Jan 13 20:53:51 ivapp1249822.devin1.ms.com db2_command_exec.ksh[9752]: BatchJob -- Success

We have raised IBM DB2 support case also to investigate the issue from db2 database end. We got below comments :

Please note that Db2 just received an error from operating system and reported this error in database logs.
If you see entry with LEVEL: Error (OS), this means Db2 just reports errors from underlying layers like file systems, disks etc.
In this case Db2 was not able to write to a file due to ESTALE error.
That means Db2 was just a victim of the issue that happened on the layer below.
You can check what this error means using command
man errno

Errors are defined in errno.h header file in Linux/Unix and error ESTALE means:

#define ESTALE 116 /* Stale NFS file handle */

As Db2 is just a victim of the issue on the underlying layer we cannot investigate what happened here.
You cannot investigate from the application level what happened on the file system.
You need to do that checking operating system logs. For that you can discuss this with OS admin.

We also want to capture below traces/logs. Please let us know the best way to capture the same:

  1. packet trace of NFS-related activity
  2. OS-level audit log (what commands were run at around that time).

Also, please suggest, what we can do on RedHat to trace NFS at the kernel level. Do we think something like
rpcdebug -m nfs -s all and rpcdebug -m rpc -s call

Responses