Red Hat Customer Portal

Skip to main content

XFS in RHEL7 gonna be a good experience?

Latest response

It is kind of interesting that it appears XFS will be the default filesystem in RHEL7?!?! Until recent fixes in RHEL6.3 and on, there were many many problems with XFS on RHEL6. Even now, things are not good for the situation of having a filesystem with many small files.

Wheeler Billion Files PDF

I would assume RedHat knows what it is doing by making this change, but geez, I hope it actually works!

For a taste of XFS fun in RHEL6, see this

bugzilla 813137

I assume redhat is testing RHEL7 with an XFS filesystem that has a billion files on it!?!?

Daryl Herzmann's picture

Responses

Hi Daryl,

Ric Wheeler (who wrote the paper) is the architect/manager for Red Hat's file system team and was personally involved in choosing XFS as the default. In addition, Red Hat Storage (when Ric managed that team) followed many storage partners and mandated over a year ago already that XFS would be the only file system supported in Red Hat Storage 2.0.

Regarding the billion files testing, Dave Chinner gave a talk several years back where he crushed the billion file problem:

http://lwn.net/Articles/476263/

As to the specific BZ mentioned, it was found during internal testing by Red Hat's Quality Engineering team (no customer cases reported against it) and fixed a few weeks later.

All file systems have bugs in them, and Red Hat does routine power failure testing on all of our file systems and has full confidence in XFS.

Of course, in RHEL 7 we will allow users to stay on ext4 if that works better for them and continue to be fully supported.

Hope this helps!

Andrius.

Hi Andrius,

Thanks for taking the time to respond. Regarding the BZ mentioned, I was reproducing the issue and was thankful when the errata came out for it.

my note of the troubles on rhel6 list

We are an academic customer, so we don't have RHEL support for things outside of our RHN satellite or proxy server.

daryl

PS. I never got notification of your post, eventhough I am subscribed to it. Will report that issue.

Thanks Daryl. I'll investigate the notifications issue as discussed in our email exchange.

It may be that you don't have notifications enabled. If you click through to your profile you should be able to see a Notifications tab, which will help you identify which content you have enabled email notifications on. To subscribe to notifications on a particular piece of content you can click the "Notify me when this document is updated or commented" checkbox below the comment entry window, or you can click the "Follow" link to the right of the page, which provides more detailed options, allowing you to subscribe to all discussions, particular authors, and so on.

Thanks. I emailed you screenshots of my notification settings.

As sort of an illustration, on a single SATA drive with ~32 million files (650 GB total size) on close to fully updated RHEL6.4 and XFS with default mount options.

chattr -R +d directory

took nearly 4 hours, pegged system %iowait time at times and emitted a number of kernel tracebacks including

xfsaild/sdb1: page allocation failure. order:3, mode:0x20
Pid: 2759, comm: xfsaild/sdb1 Not tainted 2.6.32-358.14.1.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c197>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff81384886>] ? ata_sg_clean+0x66/0xd0
 [<ffffffff81166b42>] ? kmem_getpages+0x62/0x170
 [<ffffffff8116775a>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff811671af>] ? cache_grow+0x2cf/0x320
 [<ffffffff811674d9>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811686a0>] ? kmem_cache_alloc_node_trace+0x90/0x200
...snip...

I don't see these types of things with ext4.

Hi Daryl,

I think at this point it might be best if you open a support case so that our file system experts can take a closer look at this.

Of course if any other community members have any comments - these are also welcome! :-)

Andrius.

Thanks Andrius,

We are academic customers and have not purchased that level of support. I just boggle when I hit XFS issues doing basic stuff and wonder why others do not seem to see this or perhaps they do not notice :)

daryl

Just out of curiosity: what is it you're attempting to accomplish by recursively setting the no-dump attribute?

I wanted to start making backups on a filesystem, but wanted to omit a certain folder. Just setting that attribute on the base folder only impacts new files and folders created inside of it. I needed it set for the folder and everything inside of it.

Yep, it should have been fine. It's even documented in the manpage:

The chattr(1) command can be used to  set  this
       attribute on individual files or entire subtrees.

       To tag an individual file for exclusion from the dump:

            $ chattr +d file

       To tag all files in a subtree for exclusion from the dump:

            $ chattr -R +d directory

Cool. I haven't used dump/xfsdump-based backups in a dog's age.

Attempting to xfsdump on another machine:

Overriding dump level with 0
xfsdump: using file dump (drive_simple) strategy
xfsdump: version 3.0.4 (dump format 3.0) - Running single-threaded
xfsdump: level 0 dump of ....iastate.edu:/export/brick1
xfsdump: dump date: Tue Sep 10 15:14:05 2013
xfsdump: session id: 679d5cd7-b3fa-4e4e-ba87-328e8d9d5bee
xfsdump: session label: ""
xfsdump: ino map phase 1: constructing initial dump list
xfsdump: ino map phase 2: skipping (no pruning necessary)
xfsdump: ino map phase 3: skipping (only one dump stream)
xfsdump: ino map construction complete
xfsdump: estimated dump size: 38556819264 bytes
xfsdump: /var/lib/xfsdump/inventory created
xfsdump: creating dump session media file 0 (media 0, file 0)
xfsdump: dumping ino map
xfsdump: dumping directories
xfsdump: dumping non-directory files
xfsdump: WARNING: could not get list of non-root attributes for nondir ino 2149391481: Cannot allocate memory (12)
xfsdump: WARNING: could not get list of non-root attributes for nondir ino 2149391539: Cannot allocate memory (12)
xfsdump: page allocation failure. order:4, mode:0xd0
Pid: 29263, comm: xfsdump Not tainted 2.6.32-393.el6.snip.x86_64 #1
Call Trace:
 [<ffffffff8112d737>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff81168132>] ? kmem_getpages+0x62/0x170
 [<ffffffff81168d4a>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116879f>] ? cache_grow+0x2cf/0x320
..snip...

These things never give me warm fuzzies about RHEL and XFS.

Hi Daryl - unfortunately your ...snip...s are often strategically placed to remove useful information. ;)

xfsdump tends to do some large allocations in the kernel, and if there's not a lot of memory or it has become fragmented, they can fail. For these larger allocations, we can fall back to vmalloc, and most (alll?) have been fixed upstream and in RHEL7. From your backtrace above we can't see what path it was on, but I think that your RHEL6 kernel is still missing a fallback vmalloc for the xfs_attrlist_by_handle() path, which was just fixed upstream (by yours truly) in 3.10 and is on its way to RHEL6.

Unfortunately, if you're not in a position to file bugs w/ support, it's harder for us to know about the issues you've run into, and to fix them. AFAIK that path hasn't been reported by other customers, so there may be something unique about your usecase.

Anyway, I'm delving into support here, maybe not the right venue. But I will look into the chattr -R case too, that one is interesting; generally XFS is far superior at those sorts of traversal workloads.

Hi Eric,

Thanks for the detailed response. If useful, I'd be happy to open bugzilla's on these. For posterity, here's the full kernel message for the xfsdump case: (I'm running a test 6.5 kernel for the shown bz in this case)

xfsdump: page allocation failure. order:4, mode:0xd0
Pid: 29263, comm: xfsdump Not tainted 2.6.32-393.el6.bz973122.x86_64 #1
Call Trace:
 [<ffffffff8112d737>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff81168132>] ? kmem_getpages+0x62/0x170
 [<ffffffff81168d4a>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116879f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81168ac9>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffffa04c1055>] ? xfs_attrlist_by_handle+0xb5/0x120 [xfs]
 [<ffffffff81169899>] ? __kmalloc+0x189/0x220
 [<ffffffffa04c1055>] ? xfs_attrlist_by_handle+0xb5/0x120 [xfs]
 [<ffffffffa04c1eab>] ? xfs_file_ioctl+0x67b/0x970 [xfs]
 [<ffffffff8151c2c6>] ? down_read+0x16/0x30
 [<ffffffffa049779d>] ? xfs_iunlock+0x9d/0xd0 [xfs]
 [<ffffffffa04b5a0f>] ? xfs_free_eofblocks+0xef/0x2e0 [xfs]
 [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
 [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
 [<ffffffff81197102>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff81197289>] ? do_vfs_ioctl+0x69/0x580
 [<ffffffff811972a4>] ? do_vfs_ioctl+0x84/0x580
 [<ffffffff81184271>] ? __fput+0x1a1/0x210
 [<ffffffff81197821>] ? sys_ioctl+0x81/0xa0
 [<ffffffff810dd9fe>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 159
CPU    1: hi:  186, btch:  31 usd:   0
Node 0 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd: 184
CPU    1: hi:  186, btch:  31 usd:   0
active_anon:12474 inactive_anon:38158 isolated_anon:0
 active_file:186514 inactive_file:555460 isolated_file:0
 unevictable:0 dirty:8 writeback:0 unstable:0
 free:66159 slab_reclaimable:27959 slab_unreclaimable:71525
 mapped:3600 shmem:41 pagetables:1871 bounce:0
Node 0 DMA free:15740kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15356kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3254 4012 4012
Node 0 DMA32 free:132276kB min:54620kB low:68272kB high:81928kB active_anon:21188kB inactive_anon:88772kB active_file:601268kB inactive_file:1945956kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3333032kB mlocked:0kB dirty:16kB writeback:0kB mapped:7016kB shmem:8kB slab_reclaimable:89572kB slab_unreclaimable:233208kB kernel_stack:272kB pagetables:2100kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 757 757
Node 0 Normal free:116620kB min:12708kB low:15884kB high:19060kB active_anon:28708kB inactive_anon:63860kB active_file:144788kB inactive_file:275884kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:775680kB mlocked:0kB dirty:16kB writeback:0kB mapped:7384kB shmem:156kB slab_reclaimable:22264kB slab_unreclaimable:52892kB kernel_stack:1872kB pagetables:5384kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB 2*8kB 2*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15740kB
Node 0 DMA32: 18541*4kB 148*8kB 52*16kB 29*32kB 36*64kB 339*128kB 37*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 132276kB
Node 0 Normal: 16899*4kB 4680*8kB 568*16kB 12*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 116620kB
742499 total pagecache pages
435 pages in swap cache
Swap cache stats: add 2937, delete 2502, find 160381/160487
Free swap  = 3974856kB
Total swap = 3981304kB
1048575 pages RAM
68254 pages reserved
158226 pages shared
767324 pages non-shared

here's the chattr case

xfsaild/sdb1: page allocation failure. order:3, mode:0x20
Pid: 2759, comm: xfsaild/sdb1 Not tainted 2.6.32-358.14.1.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c197>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff81384886>] ? ata_sg_clean+0x66/0xd0
 [<ffffffff81166b42>] ? kmem_getpages+0x62/0x170
 [<ffffffff8116775a>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff811671af>] ? cache_grow+0x2cf/0x320
 [<ffffffff811674d9>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811686a0>] ? kmem_cache_alloc_node_trace+0x90/0x200
 [<ffffffff811688bd>] ? __kmalloc_node+0x4d/0x60
 [<ffffffff8143db5d>] ? __alloc_skb+0x6d/0x190
 [<ffffffff8143ec80>] ? skb_copy+0x40/0xb0
 [<ffffffffa028f27c>] ? tg3_start_xmit+0xa8c/0xd50 [tg3]
 [<ffffffff81449168>] ? dev_hard_start_xmit+0x308/0x530
 [<ffffffff814674ea>] ? sch_direct_xmit+0x15a/0x1c0
 [<ffffffff8144ce70>] ? dev_queue_xmit+0x3b0/0x550
 [<ffffffff814855a8>] ? ip_finish_output+0x148/0x310
 [<ffffffff81485828>] ? ip_output+0xb8/0xc0
 [<ffffffff81484aef>] ? __ip_local_out+0x9f/0xb0
 [<ffffffff81484b25>] ? ip_local_out+0x25/0x30
 [<ffffffff81485000>] ? ip_queue_xmit+0x190/0x420
 [<ffffffff81499d0e>] ? tcp_transmit_skb+0x40e/0x7b0
 [<ffffffff8149c11b>] ? tcp_write_xmit+0x1fb/0xa20
 [<ffffffff8149cad0>] ? __tcp_push_pending_frames+0x30/0xe0
 [<ffffffff81494563>] ? tcp_data_snd_check+0x33/0x100
 [<ffffffff814981ad>] ? tcp_rcv_established+0x3ed/0x800
 [<ffffffff814a01a3>] ? tcp_v4_do_rcv+0x2e3/0x430
 [<ffffffffa0385557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
 [<ffffffff814a1a2e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8147f6d0>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff8147f7ad>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147fa38>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147eefd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147f485>] ? ip_rcv+0x275/0x350
 [<ffffffff8144865b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8149ed4a>] ? tcp4_gro_receive+0x5a/0xd0
 [<ffffffff8144aa38>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144ab40>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144d0e9>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa028bc14>] ? tg3_poll_work+0x784/0xe50 [tg3]
 [<ffffffffa029a370>] ? tg3_poll+0x70/0x440 [tg3]
 [<ffffffff81398587>] ? ata_sff_hsm_move+0x197/0x740
 [<ffffffff8144d203>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81398bed>] ? ata_sff_host_intr+0xbd/0x1a0
 [<ffffffff81076fd1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1690>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076db5>] ? irq_exit+0x85/0x90
 [<ffffffff81517335>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffffa046678d>] ? xfs_iunlock+0x9d/0xd0 [xfs]
 [<ffffffffa046caa7>] ? xfs_inode_item_pushbuf+0x87/0xe0 [xfs]
 [<ffffffffa046cc8f>] ? xfs_inode_item_trylock+0x5f/0xa0 [xfs]
 [<ffffffffa0480f3e>] ? xfsaild_push+0x2ce/0x5e0 [xfs]
 [<ffffffff8150eb7a>] ? schedule_timeout+0x19a/0x2e0
 [<ffffffffa049495a>] ? xfsaild+0x9a/0xf0 [xfs]
 [<ffffffffa04948c0>] ? xfsaild+0x0/0xf0 [xfs]
 [<ffffffff81096956>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810968c0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

daryl

So for xfsdump, my hunch was right; it's on the xfs_attrlist_by_handle() path. Fix is upstream, & pending for RHEL6.

I'll take a further look at the chattr backtrace. And yes, I suppose entering a bugzilla entry for that would be good - even if you can't file a support case, at least it logs the issue somewhere permanent & more findable than the community forums. :)

edit: The xfsaild / chattr allocation failure actually seems to be down the tg3 networking stack, taken on an interrupt, rather than from xfs though, FWIW.

Thanks,
-Eric

hi,
i just been searching through the rhel forums for info on the XFS FS. Found this topic.

In regards with the topic, i'd like to ask:

  1. why the move from ext4 to xfs - i've been googling it and can't find a clear answer there.. was it because of technical issues with ext4?

  2. do you have some link to RH papers where to read on the proper deployment of XFS and do you have performance comparison to EXT vs. XFS ?

  3. is it safe to go XFS for '/' (root) FS? how about FS recovery when server crashes? and '/' is not accessible?

I didnt work with XFS previously, dont know much about it except 'hearsay' or what i read in diff articles, newspapers etc.

I simply need soak up as much info as possible, before deciding if XFS can go for production servers - and if so, to which FS is it likely best?
Small files, big files? Root FS? Non-root FSs? etc etc .. many questions there ...

thanks guys

So I sit here this evening on XFS RHEL 7.1 with a 2 TB filesystem and I attempt to remove a 16 GB file tree with ~120,000 files (rm -rf mydir). The process has taken 2 hours now and the system iowait time is quite high and the system is very sluggish. This is about the same experience I have seen with RHEL6 and backwards. It appears the process will finish within the next hour or so, but uffties!

The only filesystem mount option I have set is (noatime) and I have not touched system tuning at this time. I'll google around for a while attempting to find workarounds to this, but it is just frustrating to see such a basic thing struggle like this with stock RHEL.

Not specifically related to the thread but filesystem related,

Slides from the Linux Foundation Vault 2015 conference are now posted here:
http://events.linuxfoundation.org/events/vault

Interesting slides on history/future of XFS and scaling ext4.

I've been purposely avoiding using XFS for root (/) and /var in RHEL7, even though I've been a long-standing XFS on Linux advocate (since 2001, especially for data volumes >1TiB), because of my experiences with system utilization, especially /var.

Unfortunately I'm at a new client/partner where XFS was utilized for /var for a RHEL7 system, running Satellite 6 no less. If you don't know much about Satellite 6, just understand some components create a crapload of tiny files. Now coming from Satellite 5, I have always advocated breaking up various /var subdirectories for Satellite 6 as well. But in this case, the client/partner used a single /var for everything. Pulp isn't the main issue, because RPMs vary in a good amount of sizes, that XFS' Extents approach works fairly well. The larger issue are the other components that only use 10-20GiB, but have millions of small files, which is why I prefer to put even those directories in their own file systems (I know some Red Hat SMEs disagree).

And we cannot mount /var, and are left with the dreaded xfs_repair -L option being the only thing left, after xfs_repair -n pointed out the obviousness of the truth. Still doing my best to try to get the file system to mount, so we can get a good log for repair, but it's not looking good.

Bryan, please contact your RHEL support representative so they can help you with this matter by triaging the problem and filing a bug if appropriate.

Unfortunately we're running the Upstream components of Satellite 6 (largely for features, as Sat 6.2 didn't hit until just a few weeks ago), and not the Red Hat release (again, long story), so it complicates the matter, even if the platform itself is RHEL7 Server. I'm currently trying to do all I can to get the file system to mount, so we'll at least recover the logs.

I continue to struggle with XFS and small files. I recently gave up on a 700GB XFS filesystem with a few 100s thousands of files in a tree, a simple find operation would take 4 hours to complete. On ext4, it takes a few minutes. I have both filesystems on the same VM now, for side by side comparison. Although, the old XFS one gets no production load. Here's an example on a tree with 46,042 files spread over 72 sub folders.

time find /xfs/mydir -type f -print > /dev/null

real    4m34.052s
user    0m0.436s
sys 0m7.076s

# time find /ext4/mydir -type f -print > /dev/null

real    0m0.590s
user    0m0.029s
sys 0m0.033s

I recently found out that our organization now has a TAM and I was encouraged to open a support request on this. So I'll try that and see what happens.

This is because your xfs filesystem does not store the filetype in the directory, and so every inode in the tree must be stat'd (read) to determine the filetype when you use the "-type f" qualifier. This is much slower than just reading directory information. In RHEL7.3, mkfs.xfs will enable filetypes by default. You can do so today with "mkfs.xfs -n ftype=1". Thanks, -Eric

Hello Eric. Oh wow, thank you. I will definitely try this and report back with how it goes.

Was this helpful?

We appreciate your feedback. Leave a comment if you would like to provide more detail.
It looks like we have some work to do. Leave a comment to let us know how we could improve.