GFS2 is extememly slow performance

Latest response

I have ran into another huge prblem with my cluster build. This is RHEL 5.6 and I'm using gfs2 for shared filesystems. The filesystems are only mounted on 1 server at a time. When we run the sql statement "CREATE TABLESPACE" to create a simple 16GB tablespace, it takes 14 minutes for the command to run. We then ran the same sql staement to create the same size tablespace on a non-shared ext3 filesystem and it only took 15 seconds.

 

I then ran across this article in the RHEL Knowledgebase:

 

Slow write performance to GFS2 filesystem

 

Well that the exact problem I am runing into. My kernel is vmlinuz-2.6.18-238.el5 and the article says the fix is kernel 2.6.18-274 or higher

 

My question is; Are my symptoms a match for this issue? If so i have ran accross another bug that has hugely impacted my work and slowed everything down. I have another identical cluster and I have converted the filesystems from gfs2 to ext3 and are in the process of installing oracle and testing the same CREATE TABLESPACE statement.

 

I guess I should never use gfs2.

 

Mark

Responses

I found an additional article that roughly matches your stated performance issue:

https://access.redhat.com/kb/docs/DOC-61735

If the cluster is not yet in production, I would suggest yum updating to the latest kernel and re-executing the SQL statement and verifying proper operation.

Hope you are able to get everything squared away.

Mark,

 

We have an almost identical environment (see this thread: https://access.redhat.com/discussion/identical-rhel-clusters-perform-differently-io-sizes-and-iops-do-not-match-during-testing) and were experiencing GFS2 related issues and we came across the same article.  We bumped the kernel level as the article recommended and it resolved a number of the gfs2 locking issues that we were seeing.  So I would definitely recommend attempting it.  

 

-Michael

There are several Cases/Bugzillas regarding DLM and GFS2 locking and related contention in EL5.5 and EL5.6.  Highly recommend updating to all errata on EL5.6 to receive various fixes.  I believe kernel -238.1.12.el5 is an absolute minimum.

 

The engineering team inside of Red Hat took extensive time and care to ensure many of the fixes integrated well with one another, and did not cause any regressions.  I know there were at least two fixes during EL5.5 that solved one issue, but made another issue performance more problematic.

 

That's the problem with shared file systems in general.  You go for a performance fix in one area, and you end up causing a different issue in another.  There will always be trade-offs in performance when you have to deal with coherency between systems.  And if you forget to address coherency, you just get inconsistency systems (and, likely, corruption).

 

This is unlike a local-only file system where the system can cache various meta-data in memory, because it does not have to maintain coherency with any other system.  But I do have to say that I've never seen Red Hat implement something new or a performance tweask, without always addressing coherency.  So in the worse case, performance just suffers until it is addressed (and not coherency).

My organization has decided that gfs2 and the version of the kernel fixes has not been fully certified for use in our environment. Therefore, I had to revert back to ext3 filesystems.

 

Maybe we'll get to use gfs2 in the future.

 

Thanks for the comments.

 

Mark

If Ext3 is working for you, ask yourself do you really need GFS2?

 

You stated your requirements were that the file system be mounted on one system at a time.  If that is the case, you can use Red Hat Cluster Suite with Ext3 (or XFS for that matter).

 

You merely define storage with your service profile, and the appropriate node mounts the local file system solely for itself.  If the service crashes, then it will be brought up on another node (including fencing the other node if required).

 

This is an active-standby configuration, which works well with local file systems.  GFS2 is when you need an active-active configuration with shared storage between multiple nodes simultaneously.

 

As I mentioned prior, coherency is the key detail when it comes to shared file systems.  If a file system is local only, like Ext3 and XFS, the kernel can cache much of the meta-data and other aspects of a file system in memory without worrying if another system has changed it.

 

Memory is in nanoseconds, network and disk latency (and even NAND EEPROM "flash") is still in mlli to microseconds.  The more contention you have between nodes for the same meta-data and storage areas, the more coherency messaging is going to adversely affect performance.  Because instead of the local kernel just having to look up some meta-data in memory, it may have to send a message to another node.

Having suffered through all the performance problems for several years I have to agree with Mark. GFS2 is _not_ enterprise-scale production ready.

 

We're still using GFS2 and having to use increasing numbers of workarounds/kludges to allow it to keep up with demands for serving hundreds of millions of files across hundreds of Tb of storage. Manpower costs directly attributable to having to nursemaid systems or deal with things running incredibly slowly (GFS issues) easily outstrip the cost of licensing and hardware. It becomes difficult to justify throwing several hundred Gb of ram at servers in order to allow them to do what ext3/4 systems do well with 24Gb - and do a lot faster.

 

Before anyone flames: GFS2 works well in some environments, but it doesn't scale and it's fairly broken for NFS fileserving. That's why I specified enterprise-scale deployment in the first paragraph.