RHEL6: Systems hangs when an application uses direct I/O on XFS
Issue
- We seem to be hitting a issue similar to bug #695827 with rhel6 and xfs where directIO writes from the database become blocked.
- I can see many xfs related kernel trace messages in the messages file (attached).The blocked process doesn't come back to normal for a long time and we mostly had to reboot the server.
- The same program runs fine with RHEL6 and ext4.
- The same program runs fine on RHEL5.6 and xfs.
- Workload / test which triggers hang
- The IO workload is "4k, random, write only, 12 threads, directio"
- Workload is 99% writes, random 4k (page writes) across files and within the same file, with parallelism
- lots of parallelism, writes to the same file (gut feel is bug is related to parallelism)
- log files: opened in non-direct mode, append / read (fairly small writes 64k); append to the end, read chunks from the middle
- non-log files: opened in DIRECT mode; random 4k writes, often to same file, a lot of parallelism
- Reproducibility
- can reproduce it after about an hour 9 times out of 10
- unable to write a simplified, synthetic test program to trigger the hang
- Reproduced on local SSD (probaby 2TB size of the volume), reproduced w/out multipath, different storage, etc. Same test runs fine with EXT4.
- in all cases they were using a fairly large striped LVM LV underneath XFS but the storage varied
Environment
- Red Hat Enterprise Linux 6.2 - 6.3
- Any kernel prior to 2.6.32-279.19.1.el6
- Seen on 2.6.32-220.2.1.el6, 2.6.32-279.5.2.el6
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.