clusterfs does not dismount when service is stopped

Latest response

OK, I have another cluster problem maybe someone can help me with. I want my clustered filesystems to unmount when the service is stopped and then get mounted on the other node when the service gets started on it. This is not occuring and I get the following error:

 

Sep 22 18:24:56 etvfdpd3 clurgmgrd[5837]: <notice> Stopping service service:etvfdpd3svc
Sep 22 18:24:56 etvfdpd3 clurgmgrd: [5837]: <info> Removing IPv4 address 166.68.70.157/24 from bond0
Sep 22 18:25:06 etvfdpd3 clurgmgrd: [5837]: <debug> Not unmounting clusterfs:db2 - still in use by 1 other service(s)
 

It also appears that a service is still using the clustered filesystems:

 

root@etvfdpd3:/etc/cluster # cman_tool services
type             level name           id       state

dlm              1     lvdb2          00150001 none
[1]

gfs              2     lvdb2          00140001 none
[1]
 

Whats intersting is using rg_test seems to work just fine:

 

root@etvfdpd3:/etc/cluster # rg_test test cluster.conf start service etvfdpd3svc
Running in test mode.
Starting etvfdpd3svc...

<debug>  mount -t gfs2  /dev/mapper/vgdb2-lvdb2 /db2
<info>   Adding IPv4 address 166.68.70.157/24 to bond0
<debug>  Pinging addr 166.68.70.157 from dev bond0
<debug>  Sending gratuitous ARP: 166.68.70.157 f0:4d:a2:3d:cb:c8 brd ff:ff:ff:ff:ff:ff
Start of etvfdpd3svc complete
 

root@etvfdpd3:/etc/cluster # df -h /db2
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vgdb2-lvdb2
                      1.4T  2.1G  1.4T   1% /db2
 

root@etvfdpd3:/etc/cluster # rg_test test cluster.conf stop service etvfdpd3svc Running in test mode.
Stopping etvfdpd3svc...
<info>   Removing IPv4 address 166.68.70.157/24 from bond0
<info>   unmounting /dev/mapper/vgdb2-lvdb2 (/db2)
Stop of etvfdpd3svc complete

 

And I see this in the message log:

 

Sep 22 18:29:06 etvfdpd3 rg_test: [13420]: <info> unmounting /dev/mapper/vgdb2-lvdb2 (/db2)
 

I have my clusterfs setup in the service section like this:

 

    <service autostart="1" domain="etvfdpd3dom" exclusive="0" max_restarts="0" name="etvfdpd3svc" recovery="restart" restart_expire_time="0">
       <ip ref="166.68.70.157"/>

       <clusterfs name="db2" device="/dev/mapper/vgdb2-lvdb2" mountpoint="/db2" fstype="gfs2" force_unmount="1"/>
 

NOTE the force_unmount="1"

 

Once again I am stumped and would appreciate any help.

 

Mark

Responses

This is a bug in 5.6!!!

I got this to work just fine on a 5.5 cluster so I downgraded the following rpm's that came with 5.6 and used the ones that came with 5.5 and the force_unmount now works.

Here are the 5.6 rpm's that don;t work:

root@etvfdpd3:/etc/cluster # yum list installed lvm2 cman lvm2-cluster openais rgmanager
Loaded plugins: rhnplugin, security
This system is not registered with RHN.
RHN support will be disabled.
Installed Packages
cman.x86_64 2.0.115-68.el5 installed
lvm2.x86_64 2.02.74-5.el5 installed
lvm2-cluster.x86_64 2.02.74-3.el5 installed
openais.x86_64 0.80.6-28.el5 installed
rgmanager.x86_64 2.0.52-9.el5 installed

root@etvfdpd3:/etc/cluster # yum list installed gfs-utils gfs2-utils kmod-gfs
Loaded plugins: rhnplugin, security
This system is not registered with RHN.
RHN support will be disabled.
Installed Packages
gfs-utils.x86_64 0.1.20-8.el5 installed
gfs2-utils.x86_64 0.1.62-28.el5 installed
kmod-gfs.x86_64 0.1.34-15.el5 installed

Here are the 5.5 rpms that I installed and DO WORK:

root@etvfdpd3:/etc/cluster # yum list installed lvm2 cman lvm2-cluster openais rgmanager
Loaded plugins: rhnplugin, security
This system is not registered with RHN.
RHN support will be disabled.
Installed Packages
cman.x86_64 2.0.115-68.el5 installed
lvm2.x86_64 2.02.74-5.el5 installed
lvm2-cluster.x86_64 2.02.74-3.el5 installed
openais.x86_64 0.80.6-28.el5 installed
rgmanager.x86_64 2.0.52-9.el5 installed

root@etvfdpd3:/etc/cluster # yum list installed gfs-utils gfs2-utils kmod-gfs
Loaded plugins: rhnplugin, security
This system is not registered with RHN.
RHN support will be disabled.
Installed Packages
gfs-utils.x86_64 0.1.20-8.el5 installed
gfs2-utils.x86_64 0.1.62-28.el5 installed
kmod-gfs.x86_64 0.1.34-15.el5 installed
 

Hi Mark,

There was a bug in rgmanager that could result in the problem you described.  Specifically, rgmanager provides resources with a reference-count, or the number of other services using this same resource, when starting or stopping them.  Previous versions of rgmanager incorrectly calculated this number, resulting in an inaccurate reference count passed to clusterfs when it would stop.  clusterfs' resource agent has code to detect if there is a greater-than-zero reference count, and skip unmounting the fs if that is the case.

 

The problem is that both versions of rgmanager you mention, specifically 2.0.52-6.el5 and 2.0.52-9.el5, are both susceptible to this problem.  The issue was fixed via Bug #692771 in rgmanager-2.0.52-21.el5:

 

  http://rhn.redhat.com/errata/RHSA-2011-1000.html

 

Since both versions you mentioned have this bug, I don't see how downgrading could have resolved this.  This makes me think that perhaps you are not hitting this bug, but something else.

 

Do you have any other references to the same db2 resource in any of your other services?  If so, that would explain why you were initially seeing this, and perhaps after downgrading that other service was not running, making it seem as if the problem was gone. 

 

Let me know if thats not the case and we can see if there are any other potential reasons why you might have hit this.

 

Regards,

John Ruemker, RHCA

Red Hat Software Maintenance Engineer

Online User Groups Moderator

John,

 

I found the bugzilla article about the issue you mentioned in clusterfs.sh and verified that the part of the script that fixed this issue was in place for both 5.6 and 5.5. I also completely removed and reinstalled the 5.6 cluster related rpm's before I did the 5.5 downgrade, so I doubt this freed up the reference to the service. Also, I expereinced this problem on 2 different identical 2 node clusters running 5.6. But another identical 2 node cluster running 5.5 I built 2 months ago never ran into this issue.

 

Whats very interesting is that rg_test worked just fine:

 

# rg_test test cluster.conf stop service etvfdpd3svc

 

Sep 22 18:29:06 etvfdpd3 rg_test: [13420]: <info> unmounting /dev/mapper/vgdb2-lvdb2 (/db2)

 

I also ran the cman_tool services command before/during/after the service shutdown and noticed that the refernce count to the filesystem never decremented (that output is in the orginal problem statement). I think this is the real problem and not the clusterfs.sh script that checks the refernce count.

And I have no idea how or what manages the refernce count (rg_manager?).

 

I also have a case open with RH Case # 00533784.

 

I wish I had time to downgrade the cluster related rpm's one at a time to help isolate the problem, but we can out of time for these server builds.

 

If you need additional info from me, please let me know.

 

Mark

 

Redhat has provided a fix for this problem and I have verified it:

 

rgmanager-2.0.52-9.el5_6.1.x86_64.rpm

kindly guide on the best practice on implementing clusters for high availabilty.

kindly guide on the best practice on implementing clusters for high availabilty.