Red Hat Training

A Red Hat training course is available for Red Hat Gluster Storage

13.7. Troubleshooting Nagios

13.7.1. Troubleshooting NSCA and NRPE Configuration Issues

The possible errors while configuring Nagios Service Check Acceptor (NSCA) and Nagios Remote Plug-in Executor (NRPE) and the troubleshooting steps are listed in this section.
Troubleshooting NSCA Configuration Issues

  • Check Firewall and Port Settings on Nagios Server
    If port 5667 is not opened on the server host's firewall, a timeout error is displayed. Ensure that port 5667 is opened.
    1. Run the following command on the Red Hat Storage node as root to get the list of current iptables rules:
      # iptables -L
    2. The output is displayed as shown below:
      ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:5667
    3. If the port is not opened, add an iptables rule by adding the following lines in /etc/sysconfig/iptables file:
      -A INPUT -m state --state NEW -m tcp -p tcp --dport 5667 -j ACCEPT
    4. Restart the iptables service using the following command:
      # service iptables restart
    5. Restart the NSCA service using the following command:
      # service nsca restart
  • Check the Configuration File on Red Hat Storage Node
    Messages cannot be sent to the NSCA server, if Nagios server IP or FQDN, cluster name and hostname (as configured in Nagios server) are not configured correctly.
    Open the Nagios server configuration file /etc/nagios/nagios_server.conf and verify if the correct configurations are set as shown below:
    # NAGIOS SERVER 
    # The nagios server IP address or FQDN to which the NSCA command 
    # needs to be sent 
    [NAGIOS-SERVER] 
    nagios_server=NagiosServerIPAddress 
     
     
    # CLUSTER NAME 
    # The host name of the logical cluster configured in Nagios under which 
    # the gluster volume services reside 
    [NAGIOS-DEFINTIONS] 
    cluster_name=cluster_auto 
     
     
    # LOCAL HOST NAME 
    # Host name given in the nagios server 
    [HOST-NAME] 
    hostname_in_nagios=NagiosServerHostName
    If Host name is updated, restart the NSCA service using the following command:
    # service nsca restart

Troubleshooting NRPE Configuration Issues

  • CHECK_NRPE: Error - Could Not Complete SSL Handshake
    This error occurs if the IP address of the Nagios server is not defined in the nrpe.cfg file of the Red Hat Storage node. To fix this issue, follow the steps given below:
    1. Add the Nagios server IP address in /etc/nagios/nrpe.cfg file in the allowed_hosts line as shown below:
      allowed_hosts=127.0.0.1, NagiosServerIP
      The allowed_hosts is the list of IP addresses which can execute NRPE commands.
    2. Save the nrpe.cfg file and restart NRPE service using the following command:
      # service nrpe restart
  • CHECK_NRPE: Socket Timeout After n Seconds
    To resolve this issue perform the steps given below:
    On Nagios Server:
    The default timeout value for the NRPE calls is 10 seconds and if the server does not respond within 10 seconds, Nagios Server GUI displays an error that the NRPE call has timed out in 10 seconds. To fix this issue, change the timeout value for NRPE calls by modifying the command definition configuration files.
    1. Changing the NRPE timeout for services which directly invoke check_nrpe.
      For the services which directly invoke check_nrpe (check_disk_and_inode, check_cpu_multicore, and check_memory), modify the command definition configuration file /etc/nagios/gluster/gluster-commands.cfg by adding -t Time in Seconds as shown below:
      define command {
             command_name check_disk_and_inode
             command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk_and_inode -t TimeInSeconds
      }
    2. Changing the NRPE timeout for the services in nagios-server-addons package which invoke NRPE call through code.
      The services which invoke /usr/lib64/nagios/plugins/gluster/check_vol_server.py (check_vol_utilization, check_vol_status, check_vol_quota_status, check_vol_heal_status, and check_vol_georep_status) make NRPE call to the Red Hat Storage nodes for the details through code. To change the timeout for the NRPE calls, modify the command definition configuration file /etc/nagios/gluster/gluster-commands.cfg by adding -t No of seconds as shown below:
      define command {
            command_name check_vol_utilization
            command_line $USER1$/gluster/check_vol_server.py $ARG1$ $ARG2$ -w $ARG3$ -c $ARG4$ -o utilization -t TimeInSeconds
      }
      The auto configuration service gluster_auto_discovery makes NRPE calls for the configuration details from the Red Hat Storage nodes. To change the NRPE timeout value for the auto configuration service, modify the command definition configuration file /etc/nagios/gluster/gluster-commands.cfg by adding -t TimeInSeconds as shown below:
      define command{
              command_name    gluster_auto_discovery
              command_line    sudo $USER1$/gluster/configure-gluster-nagios.py -H $ARG1$ -c $HOSTNAME$ -m auto -n $ARG2$ -t TimeInSeconds
      }
    3. Restart Nagios service using the following command:
      # service nagios restart
    On Red Hat Storage node:
    1. Add the Nagios server IP address as described in CHECK_NRPE: Error - Could Not Complete SSL Handshake section in Troubleshooting NRPE Configuration Issues section.
    2. Edit the nrpe.cfg file using the following command:
      # vi /etc/nagios/nrpe.cfg
    3. Search for the command_timeout and connection_timeout settings and change the value. The command_timeout value must be greater than or equal to the timeout value set in Nagios server.
      The timeout on checks can be set as connection_timeout=300 and the command_timeout=60 seconds.
    4. Restart the NRPE service using the following command:
      # service nrpe restart
  • Check the NRPE Service Status
    This error occurs if the NRPE service is not running. To resolve this issue perform the steps given below:
    1. Log in as root to the Red Hat Storage node and run the following command to verify the status of NRPE service:
      # service nrpe status
    2. If NRPE is not running, start the service using the following command:
      # service nrpe start
  • Check Firewall and Port Settings
    This error is associated with firewalls and ports. The timeout error is displayed if the NRPE traffic is not traversing a firewall, or if port 5666 is not open on the Red Hat Storage node.
    Enure that port 5666 is opened on the Red Hat Storage node.
    1. Run check_nrpe command from the Nagios server to verify if the port is opened and if NRPE is running on the Red Hat Storage Node.
    2. Log into the Nagios server as root and run the following command:
      /usr/lib64/nagios/plugins/check_nrpe -H RedHatStorageNodeIP
    3. The output is displayed as given below:
      NRPE v2.14
    If not, ensure the that port 5666 is opened on the Red Hat Storage node.
    1. Run the following command on the Red Hat Storage node as root to get a list of the current iptables rules:
      # iptables -L
    2. The output is displayed as shown below:
      ACCEPT - tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:5666
  • If the port is not open, add iptables rule for it.
    1. To add iptables rule, edit the iptables file as shown below:
      vi /etc/sysconfig/iptables
    2. Add the following line in the file:
      -A INPUT -m state --state NEW -m tcp -p tcp --dport 5666 -j ACCEPT
    3. Restart the iptables service using the following command:
      # service iptables restart
    4. Save the file and restart NRPE service:
      # service nrpe restart
  • Checking Port 5666 From the Nagios Server with Telnet
    Use telnet to verify the Red Hat Storage node's ports. To verify the ports of the Red Hat Storage node, perform the steps given below:
    1. Log in as root on Nagios server.
    2. Test the connection on port 5666 from the Nagios server to the Red Hat Storage node using the following command:
      # telnet RedHatStorageNodeIP 5666
    3. The output displayed is similar to:
      telnet 10.70.36.49 5666 
      Trying 10.70.36.49... 
      Connected to 10.70.36.49. 
      Escape character is '^]'.
  • Connection Refused By Host
    This error is due to port/firewall issues or incorrectly configured allowed_hosts directives. See the sections CHECK_NRPE: Error - Could Not Complete SSL Handshake and CHECK_NRPE: Socket Timeout After n Seconds for troubleshooting steps.

13.7.2. Troubleshooting General Issues

This section describes the troubleshooting procedures for general issues related to Nagios.
All cluster services are in warning state and status information is displayed as (null).

Set SELinux to permissive and restart the Nagios server.