Chapter 4. Troubleshooting Core and MBaaS using Nagios

If you encounter issues when working with Core or MBaaS, Nagios can assist with troubleshooting the following

  • CPU - CPU check
  • Memory - Memory check
  • Storage - Disk Usage check
  • MongoDB - MongoDB check
  • MySQL - MySQL check
  • Overall health - Health Check
  • Network - Ping Check
Note

CPU and Memory are described in the same section

Each Nagios check provides a status of the current situation. For example, the status for the CPU is OK.

There are four different status levels

  • OK
  • WARNING
  • CRITICAL
  • UNKNOWN

4.1. Troubleshooting CPU/Memory Issues

4.1.1. Reviewing CPU and Memory Status

This section applies to Core and MBaaS components.

The status levels for CPU and Memory are

StatusDescription

OK

The CPU/Memory usage for all containers is within the set limits

WARNING

One or more of the container’s CPU/Memory usage is over the WARNING threshold of 80%

CRITICAL

One or more of the container’s CPU/Memory usage is over the CRITICAL threshold of 90%

UNKNOWN

An internal error has occurred and the check was unable to determine a correct status for this check

4.1.2. Resolving CRITICAL and WARNING Issues

The following describes how to check that CPU/Memory usage is within the desired threshold for all containers in the current project.

In order to calculate the current containers CPU/Memory usage, the check needs to run twice in order to gather adequate data and make the calculation. Every check records the current usage and uptime of the process and then compares it to the previous check.

Note

This check is only applied to containers that have a CPU/Memory limit set and ignores any containers that do not have a limit set. To monitor a component, it must have a recommended starting limit set for CPU/Memory in the corresponding template.

This check must be executed twice before it will provide CPU/Memory usage for containers. A status of UNKNOWN is displayed for the first check.

An occasional spike in CPU/Memory usage for a particular component might not be a cause for concern, especially if the value slightly exceeds the WARNING threshold. For example, it is common for a higher CPU/Memory usage to be reported on component startup or after an upgrade, in which case no action is required.

If the check is reporting issues, consider one of the following actions

4.1.2.1. Increasing Limits

All RHMAP components are created with a recommended limit for CPU/Memory, but these limits might need to be increased depending on individual usage of the platform.

  1. Check the CPU and Memory limits

    $ oc describe dc/<component>

    The output is similar to the following

      Limits:
        CPU:      800m
        Memory :   800Mi
  2. Modify the CPU and Memory limits in the components deploy config. For example to change both limits from 800 to 900

    $ oc edit dc/<component>

    Make the following edits

       spec:
         containers:
         ...
           resources:
             limits:
               CPU: 900m
               Memory: 900Mi

    You might want to only change one limit (CPU or Memory), depending on the Nagios status of CPU and Memory. Containers are automatically updated after changing the deployment configuration.

  3. Verify the new limits

    $ oc describe dc/<component>

    The output is similar to the following

      Limits:
        CPU:      900m
        Memory:   900Mi

4.1.2.2. Troubleshooting Component Container

If a high CPU/Memory usage is regularly reported for a component’s container, especially after an update, it might indicate a problem with the component. Checking the container’s logs can give some indication of internal errors:

$ oc logs -f <pod id> -c <container id>
Note

The <pod id> and <container id> can be retrieved from the status information of the Nagios check

  CPU example - 	OK: mysql-1-8puwv:mysql: - usage: 0.17%
  Memory example - 	OK: mysql-1-8puwv:mysql: - usage: 57.1%

In this example, the <pod id> is mysql-1-8puwv and the <container id> is mysql. To scale the component down and up again.

$ oc delete pod <component pod id>
$ watch 'oc get pods | grep <component>'

The output is similar to the following

<component>-2-9xeva    1/1       Running   0          10s

4.1.3. Resolving UNKNOWN Issues for CPU

The CPU check must be executed at least twice before it can gather adequate data to calculate the usage of a container. If a container has not previously been checked, usage cannot be calculated and Nagios displays a status of UNKNOWN. Forcing service checks on all containers ensures that all service checks are executed at least twice.

To force host service checks

$ oc exec -ti "$(oc get pods | grep nagios | awk '{print $1}')" -- /opt/rhmap/host-svc-check

4.1.4. Resolving UNKNOWN Issues for Memory

An UNKNOWN status should never occur and usually indicates an internal error in the check. Contact Support and include the status information displayed for the check.

4.1.5. Resolving Reported High Memory Usage

Millicore loads some of the binary objects into memory and stores them in mysql when completing app builds and also when using the app store. This can spike the reported memory usage. This issue is not related to JVM memory usage, but to the reported container memory usage and will be addressed in a future release.

If the reported memory usage stays high or gets to critical you may want to do a redeploy of millicore.

$ oc deploy dc/millicore --latest
$ oc deploy dc/mysql --latest

4.2. Troubleshooting Disk Usage Issues

4.2.1. Reviewing Disk Usage Status

This section applies to Core and MBaaS components.

The status levels for Disk Usage are

StatusDescription

OK

Successfully retrieve disk usage information for all containers without error and all are under the warning threshold of 80%

WARNING

One or more volumes are over the WARNING threshold of 80%

CRITICAL

One or more volumes are over the CRITICAL threshold of 90%

UNKNOWN

An internal error has occurred and the check was unable to determine a correct status for this check. The error code and message are provided which may help troubleshooting

4.2.2. Resolving CRITICAL and WARNING Issues

The following describes how to check disk usage for all volumes that are mounted in all containers and are also within the desired threshold for the current project.

For each volume, retrieve the disk usage from the container as a percentage for both the blocks and inodes. Compare the highest percentage against the threshold to determine the status.

Note

Due to the large number of volumes checked, volumes with usage below 10% are not reported in the detail.

4.2.2.1. Increasing Size of Persistent Volume

Many factors can affect storage in OpenShift, the following suggestions can help to resolve common issues:

  1. Moving a container to another node with more resources can resolve issues where a non-persistent volume has insufficient space on the node.
  2. Redeploying a pod and containers can resolve issues associated with temporary/shared memory volumes.

4.2.2.2. Troubleshooting Component Container

If a particularly high disk usage is continually indicated in a component’s container, especially after an update, it may be an indication that there is a problem with the component. Checking the container’s logs can give some indication of internal errors.

$ oc logs -f <pod id> -c <container id>

4.3. Troubleshooting MongoDB Issues

4.3.1. Reviewing MongoDB Status

The following applies to MBaaS only.

The status levels for MongoDB are

StatusDescription

OK

The MongoDB replicaset is running, can be connected to, and has a single primary member

WARNING

The service is running, with non-critical issues present

CRITICAL

The service is not functioning correctly

UNKNOWN

No pods running MongoDB were located

4.3.2. Resolving CRITICAL Issues

The following describes how to check that the MongoDB replicaset is correctly configured and healthy. The course of action required is determined by the current status and the information returned by the check. For each individual check, view this information using the Nagios dashboard.

4.3.2.1. Troubleshooting a Failed State

A member has a state of:

  1. Recovering or Startup2

    This can happen shortly after a restart of the replicaset, in which case it will usually correct itself after a short time. If the problem persists, delete the malfunctioning pod and allow time for the deploy config to recreate.

    $ oc delete pod <mongodb-pod-name>
  2. Fatal, Unknown, Down or Rollback

    This indicates a problem with the node in the replicaset. Delete the pod to force the deploy config to recreate resulting in a new pod which should correct the issue.

    $ oc delete pod <mongodb-pod-name>
  3. Removed

    This is possibly due to a pod connecting to the replicaset before the pod is able to resolve DNS. Restart the replicaset config inside the pod and wait a short time for the member to join the replicaset.

    $ oc rsh <mongodb-pod-name>
    $ mongo admin -u admin -p ${MONGODB_ADMIN_PASSWORD}
    $ rs.reconfig(rs.config(), {force: true});

To determine the appropriate value for the MONGODB_ADMIN_PASSWORD environment variable, view the details for the primary member container in the OpenShift console:

  1. Log into the OpenShift Console
  2. Navigate to the mongodb-1 pod
  3. Click on Details to view the values of environment variables, including MONGODB_ADMIN_PASSWORD.

4.3.2.2. Troubleshooting Multiple Failed States

Remove one of the primary nodes and allow the deploy config to restart which should result in the node joining as a secondary node.

$ oc delete pod <mongodb-pod-name>

4.3.3. Resolving WARNING Issues

4.3.3.1. Troubleshooting a Startup or Arbiter State

Remove the node whose state is Startup or Arbiter and allow the deploy config to restart.

$ oc delete pod <mongodb-pod-name>

4.3.3.2. Troubleshooting an Even Number of Voters

The MBaaS is configured to have an odd number of MongoDB services. An even number indicates that a MongoDB service has been removed or is is incorrectly connected to the replicaset. Check that each MongoDB-service has a replica count of 1.

$ for dc in $(oc get dc | grep mongodb-[0-9] | awk '{print $1}'); do echo $dc":"$(oc get dc $dc -o yaml | grep replicas); done

If any MongoDB services are not set to 1, modify the dc and set it to 1.

$ oc edit dc <mongodb-dc-name>
  spec:
    replicas: 1

4.3.3.3. Misconfiguration of Component Resources

The component’s resources, that is service, deploy configuration, etc, do not match that of the original template. For example: ports, env vars or image versions are incorrect.

4.3.4. Resolving UNKNOWN Issues

An UNKNOWN status should never occur and usually indicates an internal error in the check. Contact Support and include the status information displayed for the check.

4.4. Troubleshooting MySQL Issues

4.4.1. Reviewing MySQL Status

The following applies to Core only.

The status levels for MySQL are

StatusDescription

OK

Connection was successful and all checked metrics were gather successfully

WARNING

Connection most likely failed because MySQL is not yet ready to fully serve requests

CRITICAL

Connection failed due to major issue, could be out of memory, unable to open socket, etc. The additional text may help point to what is wrong

UNKNOWN

Deployment config for MySQL scaled to 0 or an internal error has occurred and the check was unable to determine a correct status for this check. The error code and message are provided which may help troubleshooting

4.4.2. Resolving CRITICAL and UNKNOWN Issues

The following describes how to check a connection to MySQL and also gather various metric values that are useful for diagnosing issues. Although the check does gather statistics, it does not validate the stats against a threshold. If you can connect, authenticate and run the metric collection commands against MySQL - it will return OK.

Sample output

  Status Information:	Uptime: 1461116 Threads: 8 Questions: 758388 Slow queries: 0 Opens: 588 Flush tables: 1 Open tables: 100 Queries per second avg: 0.519
  Performance Data:	Connections=175748c;;; Open_files=20;;;
  Open_tables=100;;; Qcache_free_memory=0;;; Qcache_hits=0c;;;
  Qcache_inserts=0c;;; Qcache_lowmem_prunes=0c;;; Qcache_not_cached=0c;;;
  Qcache_queries_in_cache=0;;; Queries=758389c;;; Questions=753634c;;;
  Table_locks_waited=0c;;; Threads_connected=8;;; Threads_running=1;;;
  Uptime=1461116c;;;

4.4.2.1. Testing UNKNOWN status

To force an UNKNOWN status, execute.

$ oc scale --replicas=0 dc mysql

4.4.2.2. Testing CRITICAL status

To force a CRITICAL status after achieving an UNKNOWN status (force recheck of Nagios before it has a chance to fully start up).

$ oc scale --replicas=1 dc mysql

A status of OK should appear once startup is complete.

4.5. Troubleshooting Health Check Issues

4.5.1. Reviewing Health Check Status

The following applies to Core and MBaaS.

The status levels for Health Check are

StatusDescription

OK

The component is responding to the health check and reporting that it can successfully communicate with all dependent components

CRITICAL

The component health endpoint has responded but has reported an issue with a dependent component

UNKNOWN

The component health endpoint has not responded

4.5.2. Resolving CRITICAL Issues

The component health endpoint is responding, that is an output can be viewed, but an issue with a dependent component is reported in the output.

4.5.2.1. Troubleshooting a Critical Test Item Error

The reported error is

A critical test item encountered an error. Please investigate this. See the "details" object for specifics.

Access the Nagios dashboard and examine the component’s state. View the output of the component health endpoint by selecting the Service named <component>::Health from the Services screen of the dashboard. This output should display which component is experiencing issues.

The fh-supercore health endpoint is reporting an issue connecting to fh-metrics. The full error is

Check: fh-supercore::health

A critical test item encountered an error. Please investigate this. See the "details" object for specifics.
Test: Check fh-metrics - Status: crit - Details: {u'status': u'crit', u'id': u'fh-fhmetrics', u'error': {u'syscall': u'connect', u'errno': u'EHOSTUNREACH', u'code': u'EHOSTUNREACH', u'port': 8080, u'address': u'172.30.112.65'}}

Check the Nagios dashboard for issues with the component’s Ping check by following the Troubleshooting guide in the Ping Health section.

4.5.3. Resolving UNKNOWN Issues

The component health endpoint has failed to respond.

4.5.3.1. Troubleshooting No Route to Host Error

The reported error is

RequestError: The health endpoint at "<component-health-endpoint>" is not contactable. Error: <urlopen error [Errno 113] No route to host>

Check the Nagios dashboard for issues with the component’s Ping check by following the Troubleshooting guide in Ping Health section.

4.6. Troubleshooting Ping Check Issues

4.6.1. Reviewing Ping Check Status

The following applies to Core and MBaaS.

The status levels for Ping Check are

StatusDescription

OK

The service is running and a HTTP connection can be established

WARNING

The service is running, but there was some issue with the connection. For example: Authentication failure, Bad Request

CRITICAL

The service is down and a HTTP connection cannot be established

UNKNOWN

An internal error has occurred and the check was unable to determine a correct status for this check

4.6.2. Resolving CRITICAL Issues

The following describes how to check that a HTTP connection can be established to the RHMAP component service.

Check the current status and the status information returned by the check using the Nagios dashboard.

4.6.2.1. Troubleshooting a TCP Socket Error

An attempt was made to connect to address <component> on port 8080 and results in an error.

The full error is

No Route To host HTTP CRITICAL - Unable to open TCP socket

The service is not running or not contactable. This could be an indication of:

  1. No component deployment config

    $ oc get dc <component>

    The solution is to recreate the deployment configuration for this component from the installer templates.

    $ oc get dc <component>
    NAME       	REVISION   REPLICAS   TRIGGERED BY
    <component>	1      	1      	config
  2. No replicas for component deployment config (scaled to 0).

    $ oc get dc <component>
    NAME       	REVISION   REPLICAS   TRIGGERED BY
    <component>	1      	0      	config

    The solution is to scale the component so it has at least 1 replica.

    oc scale --replicas=1 dc/<component>
    oc get dc <component>
    NAME       	REVISION   REPLICAS   TRIGGERED BY
    <component>	1      	1      	config
  3. No component service

    $ oc get svc <component>

    The solution is to re-create the service for this component from the installer templates.

    $ oc get svc <component>
    NAME       	CLUSTER-IP   	EXTERNAL-IP   PORT(S)	AGE
    <component>	172.30.204.178   <none>    	8080/TCP   1d

4.6.2.2. Misconfiguration of Component Resources

The component’s resources, that is service, deploy configuration, etc, do not match that of the original template. For example: ports, environment variables or image versions are incorrect.

The solution is to validate each resource against the original template ensuring that all relevant attributes match.

4.6.3. Resolving WARNING Issues

A WARNING status will most commonly happen if the service is running and can be contacted but the client appears to have failed (4xx HTTP status).

4.6.3.1. Troubleshooting HTTP 4xx status

The Client has failed and the result was a 4xx HTTP status. This could be an indication of:

  1. Misconfiguration of Component Resources

    The component’s resources, that is service, deploy configuration, etc, do not match that of the original template. For example: ports, environment variables or image versions are incorrect.

    The solution is to validate each resource against the original template and ensure all relevant attributes match.

  2. Performance Issues

    The Ping check might enter a WARNING state if the component under test has performance related issues and the request is taking too long to respond.

    The solution is to ensure that that the CPU and Memory thresholds meet the desired threshold for the component. Check this by following the Troubleshooting CPU/Memory Issues.

4.6.4. Resolving UNKNOWN Issues

An UNKNOWN status should never occur and usually indicates an internal error in the check. Contact Support and include the status information displayed for the check.