RGManager will not start samba server

Latest response

Hi,

Basically i am trying to setup a HA share system using DRBD in the background for replication and then using the cluster to move the ip, mount (for the DRBD device) and the samba share on that mount. I have the ip and mount moving over fine without any issues but when it moves over to the other node it says started and considers the service fully running but there is no smb process running and no mention of rgmanager even attempting to start it or having any problems with it?!

I am not sure if i am asking in the right place but i have had a few days of google searches and cannot find a solution anywhere. I am also quite new to lunix and i am looking into replacing some of our windows servers for Redhat.

 

I am running 2 nodes (RH Trials) in an esxi machine as a test environment.

I am going about getting my solution all wrong or am i missing something?

Here is my cluster.conf:

 

<?xml version="1.0"?>

<cluster config_version="33" name="RHELClus2">

        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>

        <clusternodes>

                <clusternode name="RHClusPri" nodeid="1" votes="1">

                        <fence/>

                </clusternode>

                <clusternode name="RHClusSec" nodeid="2" votes="1">

                        <fence/>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices/>

        <rm log_level="7">

                <failoverdomains/>

                <resources>

                        <ip address="192.168.16.26" sleeptime="10"/>

                        <fs device="/dev/drbd1" fsid="17062" mountpoint="/mnt/drbd1" name="DRBD"/>

                        <script file="/usr/local/etc/drbd.d/makepridrbd.sh" name="MkDRBDDriPri"/>

                        <smb name="Samba" workgroup="DOMAIN.LOCAL"/>

                </resources>

                <service autostart="1" exclusive="0" name="Cluster" recovery="relocate">

                        <ip ref="192.168.16.26">

                                <script ref="MkDRBDDriPri">

                                        <fs ref="DRBD">

                                                <smb ref="Samba"/>

                                        </fs>

                                </script>

                        </ip>

                </service>

        </rm>

</cluster>

 

I have tried it with a smb.conf set in the resource and still have the same fault. I am using luci for the web management.

 

Here is my /var/log/messages entries for when the service comes online:

Apr 28 19:25:27 RHClusPri modcluster: Starting service: Cluster on node
Apr 28 19:25:27 RHClusPri rgmanager[1651]: Starting disabled service service:Cluster
Apr 28 19:25:27 RHClusPri rgmanager[18169]: Adding IPv4 address 192.168.16.26/27 to eth0
Apr 28 19:25:29 RHClusPri avahi-daemon[1487]: Registering new address record for 192.168.16.26 on eth0.IPv4.
Apr 28 19:25:31 RHClusPri rgmanager[18265]: Executing /usr/local/etc/drbd.d/makepridrbd.sh start
Apr 28 19:25:31 RHClusPri kernel: block drbd1: role( Secondary -> Primary )
Apr 28 19:25:31 RHClusPri rgmanager[18389]: mounting /dev/drbd1 on /mnt/drbd1
Apr 28 19:25:31 RHClusPri rgmanager[18412]: mount /dev/drbd1 /mnt/drbd1
Apr 28 19:25:31 RHClusPri kernel: kjournald starting. Commit interval 5 seconds
Apr 28 19:25:31 RHClusPri kernel: EXT3 FS on drbd1, internal journal
Apr 28 19:25:31 RHClusPri kernel: EXT3-fs: mounted filesystem with ordered data mode.
Apr 28 19:25:31 RHClusPri rgmanager[1651]: Service service:Cluster started

 

Any help would be appreciated or if you could point me into the right direction for additional help.

Thanks

Responses

Hi Simon,

That is definitely strange.  Looking at your configuration, I don't see anything immediately obvious that would cause this type of issue.  We should certainly see a message from the smb resource when rgmanager attempts to start it, and the fact that we don't indicates its not getting started for some reason.

 

A couple questions:

 

* Does "cman_tool status" report the same Config Version that is listed in /etc/cluster/cluster.conf as config_version?

 

* Are you seeing any warnings or messages from rgmanager/clurgmgrd in /var/log/messages when the daemon first starts? 

 

* Have you made any modifications to any of the resource agents in /usr/share/cluster?  We don't recommend doing so, but I ask because I've seen instances in the past where syntax errors in these agents have caused behavior similar to this. 

 

If you would like to try to get some more verbose output while starting the service, you can try to start it with rg_test.  Note: you need to disable your service before using rg_test, since it bypasses status checks and the check to determine whether its running on other nodes.  If you'd like to give this a try and provide us the output, here is the procedure:

 

a) Stop the service using conga or with

 

  # clusvcadm -d Cluster

 

b) Start the service using rg_test like so:

 

  # rg_test test /etc/cluster/cluster.conf start service Cluster

 

Did that give any more information about why its not starting?  If not, try:

 

c) Start the individual smb resource with rg_test like so:

 

  # rg_test test /etc/cluster/cluster.conf start smb Samba

 

Feel free to provide the output here.  Once you have completed this test, you'll want to shut everything down again so that it can once again be controlled by the cluster:

 

  # rg_test test /etc/cluster/cluster.conf stop service Cluster

 

You're now free to start it back up with Conga or clusvcadm.

 

Hopefully these steps will shed some light on the problem.  Let us know if you have any questions.

 

Regards,
John Ruemker, RHCA

Red Hat Technical Account Manager

Online User Groups Moderator

 

 

P.S.  I should note that running a cluster without a valid fence device is unsupported.  That said, it sounds like you're just giving the product a trial, so for testing purposes that is fine.  However if you ever move this cluster into production, you'll want to investigate getting a proper fence device set up (which doesn't exist for VMWare guests).

Hi thanks for your quick response and sorry for my delay but if have been away for a week.

 

I will go through all your suggestions as soon as i get settled back in and will let you know the results.

 

I am sure that the answer is somewhere in your indepth reponse :)

 

Thanks.

 

Hi John,

 

Here are my findings:

Does "cman_tool status" report the same Config Version that is listed in /etc/cluster/cluster.conf as config_version?

 

No the config versions do not match on either node.  The cman_tool status version is one version less.  What do i need to do to correct this?

 

Here is the section of my /var/log/messages when rgmanager starts:

May  9 11:18:19 RHClusPri rgmanager[24530]: I am node #1

May  9 11:18:19 RHClusPri rgmanager[24530]: Resource Group Manager Starting

May  9 11:18:19 RHClusPri rgmanager[24530]: Loading Service Data

May  9 11:18:21 RHClusPri rgmanager[24530]: Initializing Services

May  9 11:18:21 RHClusPri rgmanager[25111]: /dev/drbd1 is not mounted

May  9 11:18:21 RHClusPri rgmanager[25142]: Executing /usr/local/etc/drbd.d/makepridrbd.sh stop

May  9 11:18:21 RHClusPri rgmanager[24530]: Services Initialized

May  9 11:18:21 RHClusPri rgmanager[24530]: State change: Local UP

May  9 11:18:21 RHClusPri rgmanager[24530]: Starting stopped service service:Cluster

May  9 11:18:22 RHClusPri rgmanager[25266]: Adding IPv4 address 192.168.16.26/27 to eth0

May  9 11:18:25 RHClusPri rgmanager[25360]: Executing /usr/local/etc/drbd.d/makepridrbd.sh start

May  9 11:18:25 RHClusPri kernel: block drbd1: role( Secondary -> Primary )

May  9 11:18:26 RHClusPri rgmanager[25484]: mounting /dev/drbd1 on /mnt/drbd1

May  9 11:18:26 RHClusPri rgmanager[25510]: mount   /dev/drbd1 /mnt/drbd1

May  9 11:18:26 RHClusPri kernel: kjournald starting.  Commit interval 5 seconds

May  9 11:18:26 RHClusPri kernel: EXT3 FS on drbd1, internal journal

May  9 11:18:26 RHClusPri kernel: EXT3-fs: mounted filesystem with ordered data mode.

May  9 11:18:26 RHClusPri rgmanager[24530]: Service service:Cluster started

 

Here is the output for the rg_test start:

 

Running in test mode.

Loading resource rule from /usr/share/cluster/named.sh

Loading resource rule from /usr/share/cluster/ip.sh

Loading resource rule from /usr/share/cluster/lvm_by_vg.sh

Loading resource rule from /usr/share/cluster/service.sh

Loading resource rule from /usr/share/cluster/ASEHAagent.sh

Loading resource rule from /usr/share/cluster/postgres-8.sh

Loading resource rule from /usr/share/cluster/netfs.sh

Loading resource rule from /usr/share/cluster/fs.sh

Loading resource rule from /usr/share/cluster/ocf-shellfuncs

Loading resource rule from /usr/share/cluster/tomcat-6.sh

Loading resource rule from /usr/share/cluster/clusterfs.sh

Loading resource rule from /usr/share/cluster/oracledb.sh

Loading resource rule from /usr/share/cluster/svclib_nfslock

Loading resource rule from /usr/share/cluster/apache.sh

Loading resource rule from /usr/share/cluster/mysql.sh

Loading resource rule from /usr/share/cluster/SAPDatabase

Loading resource rule from /usr/share/cluster/nfsclient.sh

Loading resource rule from /usr/share/cluster/lvm.sh

Loading resource rule from /usr/share/cluster/nfsserver.sh

Loading resource rule from /usr/share/cluster/lvm_by_lv.sh

Loading resource rule from /usr/share/cluster/samba.sh

Loading resource rule from /usr/share/cluster/SAPInstance

Loading resource rule from /usr/share/cluster/nfsexport.sh

Loading resource rule from /usr/share/cluster/vm.sh

Loading resource rule from /usr/share/cluster/openldap.sh

Loading resource rule from /usr/share/cluster/script.sh

Starting Cluster...

<debug>  Link for eth0: Detected

Link for eth0: Detected

<info>   Adding IPv4 address 192.168.16.26/27 to eth0

Adding IPv4 address 192.168.16.26/27 to eth0

<debug>  Pinging addr 192.168.16.26 from dev eth0

Pinging addr 192.168.16.26 from dev eth0

<debug>  Sending gratuitous ARP: 192.168.16.26 00:0c:29:3c:9d:8e brd ff:ff:ff:ff:ff:ff

Sending gratuitous ARP: 192.168.16.26 00:0c:29:3c:9d:8e brd ff:ff:ff:ff:ff:ff

<info>   Executing /usr/local/etc/drbd.d/makepridrbd.sh start

Executing /usr/local/etc/drbd.d/makepridrbd.sh start

Making this machine primary

<debug>  Running fsck on /dev/drbd1

Running fsck on /dev/drbd1

<info>   mounting /dev/drbd1 on /mnt/drbd1

mounting /dev/drbd1 on /mnt/drbd1

<err>    mount   /dev/drbd1 /mnt/drbd1

mount   /dev/drbd1 /mnt/drbd1

Start of Cluster complete

 

Running #rg_test test /etc/cluster/cluster.conf start smb Samba outputted the following error:

 

Running in test mode.

Loading resource rule from /usr/share/cluster/named.sh

Loading resource rule from /usr/share/cluster/ip.sh

Loading resource rule from /usr/share/cluster/lvm_by_vg.sh

Loading resource rule from /usr/share/cluster/service.sh

Loading resource rule from /usr/share/cluster/ASEHAagent.sh

Loading resource rule from /usr/share/cluster/postgres-8.sh

Loading resource rule from /usr/share/cluster/netfs.sh

Loading resource rule from /usr/share/cluster/fs.sh

Loading resource rule from /usr/share/cluster/ocf-shellfuncs

Loading resource rule from /usr/share/cluster/tomcat-6.sh

Loading resource rule from /usr/share/cluster/clusterfs.sh

Loading resource rule from /usr/share/cluster/oracledb.sh

Loading resource rule from /usr/share/cluster/svclib_nfslock

Loading resource rule from /usr/share/cluster/apache.sh

Loading resource rule from /usr/share/cluster/mysql.sh

Loading resource rule from /usr/share/cluster/SAPDatabase

Loading resource rule from /usr/share/cluster/nfsclient.sh

Loading resource rule from /usr/share/cluster/lvm.sh

Loading resource rule from /usr/share/cluster/nfsserver.sh

Loading resource rule from /usr/share/cluster/lvm_by_lv.sh

Loading resource rule from /usr/share/cluster/samba.sh

Loading resource rule from /usr/share/cluster/SAPInstance

Loading resource rule from /usr/share/cluster/nfsexport.sh

Loading resource rule from /usr/share/cluster/vm.sh

Loading resource rule from /usr/share/cluster/openldap.sh

Loading resource rule from /usr/share/cluster/script.sh

No resource Samba of type smb found

 

It seems to be to be the initial config version fault but i have had the samba resource an a fair few versions and have never had it started so it seems a bit coincidential that the 'cman_tool status' config version has never match any of the samba cluster.conf versions?

 

Thanks for your help.

 

Simon

I have tried running 'cman_tool version -r' to see if i can get the new version up and running (Google IS your friend) and got the following error:

 

 

Relax-NG validity error : Extra element rm in interleave

tempfile:14: element rm: Relax-NG validity error : Element cluster failed to validate content

Configuration fails to validate

cman_tool: Not reloading, configuration is not valid

 

Here is my current cluster.conf ('cman_tool status' shows config version 32):

 

 

<?xml version="1.0"?>

<cluster config_version="33" name="RHELClus2">

        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>

        <clusternodes>

                <clusternode name="RHClusPri" nodeid="1" votes="1">

                        <fence/>

                </clusternode>

                <clusternode name="RHClusSec" nodeid="2" votes="1">

                        <fence/>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices/>

        <rm log_level="7">

                <failoverdomains/>

                <resources>

                        <ip address="192.168.16.26" sleeptime="10"/>

                        <fs device="/dev/drbd1" fsid="17062" mountpoint="/mnt/drbd1" name="DRBD"/>

                        <script file="/usr/local/etc/drbd.d/makepridrbd.sh" name="MkDRBDDriPri"/>

                        <smb name="Samba" workgroup="DOMAIN.LOCAL"/>

                </resources>

                <service autostart="1" exclusive="0" name="Cluster" recovery="relocate">

                        <ip ref="192.168.16.26">

                                <script ref="MkDRBDDriPri">

                                        <fs ref="DRBD">

                                                <smb ref="Samba"/>

                                        </fs>

                                </script>

                        </ip>

                </service>

        </rm>

</cluster>

If you are seeing cman_tool report a Config Version different than what is in cluster.conf, then its possible the version in use at the moment did not have the reference to the smb resource, and thus rgmanager is not attempting to start it when you start the service.  Usually this version mismatch happens when you manually update cluster.conf, but don't apply/propagate those changes to the cluster.

 

The following article describes the procedure for apply changes to cluster.conf to the cluster in RHEL 5:

 

  https://access.redhat.com/kb/docs/DOC-5951

 

As you can see ,you'll need to use ccs_tool update before cman_tool version. 

 

As far as the validation error goes, I can't figure out what the problem is.  Go ahead and try the above procedure to update it using ccs_tool update, and let us know if you still get this error.

 

Thanks,

John Ruemker, RHCA

Red Hat Technical Account Manager

Online User Groups Moderator

Hi John,

 

Here are the results of the commands:

 

# ccs_tool update /etc/cluster/cluster.conf

Unknown command, update.

Try 'ccs_tool help' for help.

 

# cman_tool version -r 41

Warning: specifying a version for the -r flag is deprecated and no longer used

Relax-NG validity error : Extra element rm in interleave

tempfile:14: element rm: Relax-NG validity error : Element cluster failed to validate content

Configuration fails to validate

cman_tool: Not reloading, configuration is not valid

 

 

I have found that this only happens when i add the smb server, either directly into the service or as a resource.  The resource does not even have to be attached to the service for this validity error to occur.

 

Is this a fault with a missing smb module for the cluster and it does not understand the 'smb' entry?

I think I mistakenly assumed you were talking about RHEL 5, whereas now I see you are on RHEL 6 (at least I think so).  In that case, you were right and you only need to run 'cman_tool version -r'.  The command I gave you is not valid in RHEL 6.

 

With that said, this revelation has lead me to the source of your issue (I believe).  Your cluster.conf has this resource:

 

                        <smb name="Samba" workgroup="DOMAIN.LOCAL"/>

 

smb is not a valid resource type in RHEL 6.  Instead it is now called "samba".  Try replacing the references to smb with samba, and then do 'cman_tool version -r', and then see if the resource starts properly.

 

Regards,
John Ruemker, RHCA

Red Hat Technical Account Manager

Online User Groups Moderator

 

Thanks for that, maybe i should have mentioned that i use RHEL6 earlier.
 
The update of the conf now works perfectly but the new resourse entry now does not show in luci (conga).  Looking more into this i have noticed that Luci\Conga is not suppported on RHEL6.  Can i get a RHEL6 supported version or is there another way of doing this?

Conga is supported on RHEL 6.  I'm not sure why the resource wouldn't be showing up.  I'll see if I can take a look at that code today and find out why its failing.

 

In the meantime you should be able to continue using Conga to manage the service, but you just can't edit the service (since it doesn't recognize samba, it will remove that resource any time you edit it).

 

I'll let you know what I find.  If anyone else here has ideas on why this is failing, feel free to chime in.

 

Regards,

John Ruemker, RHCA

Red Hat Technial Account Manager

Online User Groups Moderator

Just to let you know that the it was Conga that created the 'smb' resource that should have been 'samba' not a manual entry into the cluster.conf.

 

This is section of my messages log now that i am using the 'samba' resource:

 

 

May 10 14:26:53 RHClusPri rgmanager[27787]: Starting Service samba:SmbServer

May 10 14:26:53 RHClusPri rgmanager[27841]: Query failed: Invalid argument (/cluster/rm/service[@name="Cluster"]/ip[1]/@address)

May 10 14:26:53 RHClusPri rgmanager[27864]: Looking For IP Addresses [samba:SmbServer] > Failed - No IP Addresses Found

May 10 14:26:53 RHClusPri rgmanager[26487]: start on samba "SmbServer" returned 1 (generic error)

May 10 14:26:53 RHClusPri rgmanager[26487]: #68: Failed to start service:Cluster; return value: 1

May 10 14:26:53 RHClusPri rgmanager[26487]: Stopping service service:Cluster

May 10 14:26:53 RHClusPri rgmanager[27954]: Stopping Service samba:SmbServer

May 10 14:26:54 RHClusPri rgmanager[27976]: Checking Existence Of File /var/run/cluster/samba/samba:SmbServer/smbd-smb.conf.pid [samba:SmbServer] > Failed - File Doesn't

May 10 14:26:54 RHClusPri rgmanager[27998]: Checking Existence Of File /var/run/cluster/samba/samba:SmbServer/nmbd-smb.conf.pid [samba:SmbServer] > Failed - File Doesn't

May 10 14:26:54 RHClusPri rgmanager[28020]: Stopping Service samba:SmbServer > Succeed

 

Here is my current cluster.conf.  The version is that same as in 'cman_tool status':

 

<?xml version="1.0"?>

<cluster config_version="43" name="RHELClus2">

        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>

        <clusternodes>

                <clusternode name="RHClusPri" nodeid="1" votes="1">

                        <fence/>

                </clusternode>

                <clusternode name="RHClusSec" nodeid="2" votes="1">

                        <fence/>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices/>

        <rm log_level="7">

                <failoverdomains/>

                <resources>

                        <ip address="192.168.16.26" sleeptime="10"/>

                        <fs device="/dev/drbd1" fsid="17062" mountpoint="/mnt/drbd1" name="DRBD"/>

                        <script file="/usr/local/etc/drbd.d/makepridrbd.sh" name="MkDRBDDriPri"/>

                </resources>

                <service autostart="1" exclusive="0" name="Cluster" recovery="relocate">

                        <ip ref="192.168.16.26">

                                <script ref="MkDRBDDriPri">

                                        <fs ref="DRBD">

                                                <samba name="SmbServer" shutdown_wait="0"/>

                                        </fs>

                                </script>

                        </ip>

                </service>

        </rm>

</cluster>