Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

Chapter 9. Troubleshooting

The following is a list of some problems you may see regarding the configuration of fence devices as well as some suggestions for how to address these problems.
  • If your system does not fence a node automatically, you can try to fence the node from the command line using the fence_node command, as described at the end of each of the fencing configuration procedures. The fence_node performs I/O fencing on a single node by reading the fencing settings from the cluster.conf file for the given node and then running the configured fencing agent against the node. For example, the following command fences node clusternode1.example.com:
    # /sbin/fence_node clusternode1.example.com
    If the fence_node command is unsuccessful, you may have made an error in defining the fence device configuration. To determine whether the fencing agent itself is able to talk to the fencing device, you can execute the I/O fencing command for your fence device directly from the command line. As a first step, you can execute the with the -o status option specified. For example, if you are using an APC switch as a fencing agent, you can execute a command such as the following:
    # /sbin/fence_apc -a (ipaddress) -l (login) ... -o status -v
    You can also use the I/O fencing command for your device to fence the node. For example, for an HP ILO device, you can issue the following command:
    # /sbin/fence_ilo -a myilo -l login -p passwd -o off -v
  • Check the version of firmware you are using in your fence device. You may want to consider upgrading your firmware. You may also want to scan bugzilla to see if there are any issues regarding your level of firmware.
  • If a node in your cluster is repeatedly getting fenced, it means that one of the nodes in your cluster is not seeing enough "heartbeat" network messages from the node that is getting fenced. Most of the time, this is a result of flaky or faulty hardware, such as bad cables or bad ports on the network hub or switch. Test your communications paths thoroughly without the cluster software running to make sure your hardware is working correctly.
  • If a node in your cluster is repeatedly getting fenced right at startup, if may be due to system activities that occur when a node joins a cluster. If your network is busy, your cluster may decide it is not getting enough heartbeat packets. To address this, you may have to increase the post_join_delay setting in your cluster.conf file. This delay is basically a grace period to give the node more time to join the cluster.
    In the following example, the fence_daemon entry in the cluster configuration file shows a post_join_delay setting that has been increased to 600.
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="600">
    
  • If a node fails while the fenced daemon is not running, it will not be fenced. It will cause problems if the fenced daemon is killed or exits while the node is using GFS. If the fenced daemon exits, it should be restarted.
If you find that you are seeing error messages when you try to configure your system, or if after configuration your system does not behave as expected, you can perform the following checks and examine the following areas.
  • Connect to one of the nodes in the cluster and execute the clustat(8) command. This command runs a utility that displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services.
    The following example shows the output of the clustat(8) command.
    [root@clusternode4 ~]# clustat
    Cluster Status for nfsclust @ Wed Dec  3 12:37:22 2008
    Member Status: Quorate
    
     Member Name                              ID   Status
     ------ ----                              ---- ------
     clusternode5.example.com          1 Online, rgmanager
     clusternode4.example.com          2 Online, Local, rgmanager
     clusternode3.example.com          3 Online, rgmanager
     clusternode2.example.com          4 Online, rgmanager
     clusternode1.example.com          5 Online, rgmanager
    
     Service Name             Owner (Last)                     State
     ------- ---              ----- ------                     -----
     service:nfssvc           clusternode2.example.com         starting
    
    In this example, clusternode4 is the local node since it is the host from which the command was run. If rgmanager did not appear in the Status category, it could indicate that cluster services are not running on the node.
  • Connect to one of the nodes in the cluster and execute the group_tool(8) command. This command provides information that you may find helpful in debugging your system. The following example shows the output of the group_tool(8) command.
    [root@clusternode1 ~]# group_tool
    type             level name       id       state
    fence            0     default    00010005 none
    [1 2 3 4 5]
    dlm              1     clvmd      00020005 none
    [1 2 3 4 5]
    dlm              1     rgmanager  00030005 none
    [3 4 5]
    dlm              1     mygfs      007f0005 none
    [5]
    gfs              2     mygfs      007e0005 none
    [5]
    
    The state of the group should be none. The numbers in the brackets are the node ID numbers of the cluster nodes in the group. The clustat shows which node IDs are associated with which nodes. If you do not see a node number in the group, it is not a member of that group. For example, if a node ID is not in dlm/rgmanager group, it is not using the rgmanager dlm lock space (and probably is not running rgmanager).
    The level of a group indicates the recovery ordering. 0 is recovered first, 1 is recovered second, and so forth.
  • Connect to one of the nodes in the cluster and execute the cman_tool nodes -f command This command provides information about the cluster nodes that you may want to look at. The following example shows the output of the cman_tool nodes -f command.
    [root@clusternode1 ~]# cman_tool nodes -f
    Node  Sts   Inc   Joined               Name
       1   M    752   2008-10-27 11:17:15  clusternode5.example.com
       2   M    752   2008-10-27 11:17:15  clusternode4.example.com
       3   M    760   2008-12-03 11:28:44  clusternode3.example.com
       4   M    756   2008-12-03 11:28:26  clusternode2.example.com
       5   M    744   2008-10-27 11:17:15  clusternode1.example.com
    
    The Sts heading indicates the status of a node. A status of M indicates the node is a member of the cluster. A status of X indicates that the node is dead. The Inc heading indicating the incarnation number of a node, which is for debugging purposes only.
  • Check whether the cluster.conf is identical in each node of the cluster. If you configure your system with Conga, as in the example provided in this document, these files should be identical, but one of the files may have accidentally been deleted or altered.