Chapter 4. Troubleshooting Core and MBaaS using Nagios
If you encounter issues when working with Core or MBaaS, Nagios can assist with troubleshooting the following
- CPU - CPU check
- Memory - Memory check
- Storage - Disk Usage check
- MongoDB - MongoDB check
- MySQL - MySQL check
- Overall health - Health Check
- Network - Ping Check
CPU and Memory are described in the same section
Each Nagios check provides a status of the current situation. For example, the status for the CPU is OK.
There are four different status levels
- OK
- WARNING
- CRITICAL
- UNKNOWN
4.1. Troubleshooting CPU/Memory Issues
4.1.1. Reviewing CPU and Memory Status
This section applies to Core and MBaaS components.
The status levels for CPU and Memory are
| Status | Description |
|---|---|
| OK | The CPU/Memory usage for all containers is within the set limits |
| WARNING | One or more of the container’s CPU/Memory usage is over the WARNING threshold of 80% |
| CRITICAL | One or more of the container’s CPU/Memory usage is over the CRITICAL threshold of 90% |
| UNKNOWN | An internal error has occurred and the check was unable to determine a correct status for this check |
4.1.2. Resolving CRITICAL and WARNING Issues
The following describes how to check that CPU/Memory usage is within the desired threshold for all containers in the current project.
In order to calculate the current containers CPU/Memory usage, the check needs to run twice in order to gather adequate data and make the calculation. Every check records the current usage and uptime of the process and then compares it to the previous check.
This check is only applied to containers that have a CPU/Memory limit set and ignores any containers that do not have a limit set. To monitor a component, it must have a recommended starting limit set for CPU/Memory in the corresponding template.
This check must be executed twice before it will provide CPU/Memory usage for containers. A status of UNKNOWN is displayed for the first check.
An occasional spike in CPU/Memory usage for a particular component might not be a cause for concern, especially if the value slightly exceeds the WARNING threshold. For example, it is common for a higher CPU/Memory usage to be reported on component startup or after an upgrade, in which case no action is required.
If the check is reporting issues, consider one of the following actions
4.1.2.1. Increasing Limits
All RHMAP components are created with a recommended limit for CPU/Memory, but these limits might need to be increased depending on individual usage of the platform.
Check the CPU and Memory limits
$ oc describe dc/<component>
The output is similar to the following
Limits: CPU: 800m Memory : 800MiModify the CPU and Memory limits in the components deploy config. For example to change both limits from 800 to 900
$ oc edit dc/<component>
Make the following edits
spec: containers: ... resources: limits: CPU: 900m Memory: 900MiYou might want to only change one limit (CPU or Memory), depending on the Nagios status of CPU and Memory. Containers are automatically updated after changing the deployment configuration.
Verify the new limits
$ oc describe dc/<component>
The output is similar to the following
Limits: CPU: 900m Memory: 900Mi
4.1.2.2. Troubleshooting Component Container
If a high CPU/Memory usage is regularly reported for a component’s container, especially after an update, it might indicate a problem with the component. Checking the container’s logs can give some indication of internal errors:
$ oc logs -f <pod id> -c <container id>
The <pod id> and <container id> can be retrieved from the status information of the Nagios check
CPU example - OK: mysql-1-8puwv:mysql: - usage: 0.17% Memory example - OK: mysql-1-8puwv:mysql: - usage: 57.1%
In this example, the <pod id> is mysql-1-8puwv and the <container id> is mysql. To scale the component down and up again.
$ oc delete pod <component pod id> $ watch 'oc get pods | grep <component>'
The output is similar to the following
<component>-2-9xeva 1/1 Running 0 10s
4.1.3. Resolving UNKNOWN Issues for CPU
The CPU check must be executed at least twice before it can gather adequate data to calculate the usage of a container. If a container has not previously been checked, usage cannot be calculated and Nagios displays a status of UNKNOWN. Forcing service checks on all containers ensures that all service checks are executed at least twice.
To force host service checks
$ oc exec -ti "$(oc get pods | grep nagios | awk '{print $1}')" -- /opt/rhmap/host-svc-check4.1.4. Resolving UNKNOWN Issues for Memory
An UNKNOWN status should never occur and usually indicates an internal error in the check. Contact Support and include the status information displayed for the check.
4.1.5. Resolving Reported High Memory Usage
Millicore loads some of the binary objects into memory and stores them in mysql when completing app builds and also when using the app store. This can spike the reported memory usage. This issue is not related to JVM memory usage, but to the reported container memory usage and will be addressed in a future release.
If the reported memory usage stays high or gets to critical you may want to do a redeploy of millicore.
$ oc deploy dc/millicore --latest $ oc deploy dc/mysql --latest
4.2. Troubleshooting Disk Usage Issues
4.2.1. Reviewing Disk Usage Status
This section applies to Core and MBaaS components.
The status levels for Disk Usage are
| Status | Description |
|---|---|
| OK | Successfully retrieve disk usage information for all containers without error and all are under the warning threshold of 80% |
| WARNING | One or more volumes are over the WARNING threshold of 80% |
| CRITICAL | One or more volumes are over the CRITICAL threshold of 90% |
| UNKNOWN | An internal error has occurred and the check was unable to determine a correct status for this check. The error code and message are provided which may help troubleshooting |
4.2.2. Resolving CRITICAL and WARNING Issues
The following describes how to check disk usage for all volumes that are mounted in all containers and are also within the desired threshold for the current project.
For each volume, retrieve the disk usage from the container as a percentage for both the blocks and inodes. Compare the highest percentage against the threshold to determine the status.
Due to the large number of volumes checked, volumes with usage below 10% are not reported in the detail.
4.2.2.1. Increasing Size of Persistent Volume
Many factors can affect storage in OpenShift, the following suggestions can help to resolve common issues:
- Moving a container to another node with more resources can resolve issues where a non-persistent volume has insufficient space on the node.
- Redeploying a pod and containers can resolve issues associated with temporary/shared memory volumes.
4.2.2.2. Troubleshooting Component Container
If a particularly high disk usage is continually indicated in a component’s container, especially after an update, it may be an indication that there is a problem with the component. Checking the container’s logs can give some indication of internal errors.
$ oc logs -f <pod id> -c <container id>
4.3. Troubleshooting MongoDB Issues
4.3.1. Reviewing MongoDB Status
The following applies to MBaaS only.
The status levels for MongoDB are
| Status | Description |
|---|---|
| OK | The MongoDB replicaset is running, can be connected to, and has a single primary member |
| WARNING | The service is running, with non-critical issues present |
| CRITICAL | The service is not functioning correctly |
| UNKNOWN | No pods running MongoDB were located |
4.3.2. Resolving CRITICAL Issues
The following describes how to check that the MongoDB replicaset is correctly configured and healthy. The course of action required is determined by the current status and the information returned by the check. For each individual check, view this information using the Nagios dashboard.
4.3.2.1. Troubleshooting a Failed State
A member has a state of:
Recovering or Startup2
This can happen shortly after a restart of the replicaset, in which case it will usually correct itself after a short time. If the problem persists, delete the malfunctioning pod and allow time for the deploy config to recreate.
$ oc delete pod <mongodb-pod-name>
Fatal, Unknown, Down or Rollback
This indicates a problem with the node in the replicaset. Delete the pod to force the deploy config to recreate resulting in a new pod which should correct the issue.
$ oc delete pod <mongodb-pod-name>
Removed
This is possibly due to a pod connecting to the replicaset before the pod is able to resolve DNS. Restart the replicaset config inside the pod and wait a short time for the member to join the replicaset.
$ oc rsh <mongodb-pod-name> $ mongo admin -u admin -p ${MONGODB_ADMIN_PASSWORD} $ rs.reconfig(rs.config(), {force: true});
To determine the appropriate value for the MONGODB_ADMIN_PASSWORD environment variable, view the details for the primary member container in the OpenShift console:
- Log into the OpenShift Console
-
Navigate to the
mongodb-1pod -
Click on
Detailsto view the values of environment variables, includingMONGODB_ADMIN_PASSWORD.
4.3.2.2. Troubleshooting Multiple Failed States
Remove one of the primary nodes and allow the deploy config to restart which should result in the node joining as a secondary node.
$ oc delete pod <mongodb-pod-name>
4.3.3. Resolving WARNING Issues
4.3.3.1. Troubleshooting a Startup or Arbiter State
Remove the node whose state is Startup or Arbiter and allow the deploy config to restart.
$ oc delete pod <mongodb-pod-name>
4.3.3.2. Troubleshooting an Even Number of Voters
The MBaaS is configured to have an odd number of MongoDB services. An even number indicates that a MongoDB service has been removed or is is incorrectly connected to the replicaset. Check that each MongoDB-service has a replica count of 1.
$ for dc in $(oc get dc | grep mongodb-[0-9] | awk '{print $1}'); do echo $dc":"$(oc get dc $dc -o yaml | grep replicas); doneIf any MongoDB services are not set to 1, modify the dc and set it to 1.
$ oc edit dc <mongodb-dc-name>
spec:
replicas: 14.3.3.3. Misconfiguration of Component Resources
The component’s resources, that is service, deploy configuration, etc, do not match that of the original template. For example: ports, env vars or image versions are incorrect.
4.3.4. Resolving UNKNOWN Issues
An UNKNOWN status should never occur and usually indicates an internal error in the check. Contact Support and include the status information displayed for the check.
4.4. Troubleshooting MySQL Issues
4.4.1. Reviewing MySQL Status
The following applies to Core only.
The status levels for MySQL are
| Status | Description |
|---|---|
| OK | Connection was successful and all checked metrics were gather successfully |
| WARNING | Connection most likely failed because MySQL is not yet ready to fully serve requests |
| CRITICAL | Connection failed due to major issue, could be out of memory, unable to open socket, etc. The additional text may help point to what is wrong |
| UNKNOWN | Deployment config for MySQL scaled to 0 or an internal error has occurred and the check was unable to determine a correct status for this check. The error code and message are provided which may help troubleshooting |
4.4.2. Resolving CRITICAL and UNKNOWN Issues
The following describes how to check a connection to MySQL and also gather various metric values that are useful for diagnosing issues. Although the check does gather statistics, it does not validate the stats against a threshold. If you can connect, authenticate and run the metric collection commands against MySQL - it will return OK.
Sample output
Status Information: Uptime: 1461116 Threads: 8 Questions: 758388 Slow queries: 0 Opens: 588 Flush tables: 1 Open tables: 100 Queries per second avg: 0.519 Performance Data: Connections=175748c;;; Open_files=20;;; Open_tables=100;;; Qcache_free_memory=0;;; Qcache_hits=0c;;; Qcache_inserts=0c;;; Qcache_lowmem_prunes=0c;;; Qcache_not_cached=0c;;; Qcache_queries_in_cache=0;;; Queries=758389c;;; Questions=753634c;;; Table_locks_waited=0c;;; Threads_connected=8;;; Threads_running=1;;; Uptime=1461116c;;;
4.4.2.1. Testing UNKNOWN status
To force an UNKNOWN status, execute.
$ oc scale --replicas=0 dc mysql
4.4.2.2. Testing CRITICAL status
To force a CRITICAL status after achieving an UNKNOWN status (force recheck of Nagios before it has a chance to fully start up).
$ oc scale --replicas=1 dc mysql
A status of OK should appear once startup is complete.
4.5. Troubleshooting Health Check Issues
4.5.1. Reviewing Health Check Status
The following applies to Core and MBaaS.
The status levels for Health Check are
| Status | Description |
|---|---|
| OK | The component is responding to the health check and reporting that it can successfully communicate with all dependent components |
| CRITICAL | The component health endpoint has responded but has reported an issue with a dependent component |
| UNKNOWN | The component health endpoint has not responded |
4.5.2. Resolving CRITICAL Issues
The component health endpoint is responding, that is an output can be viewed, but an issue with a dependent component is reported in the output.
4.5.2.1. Troubleshooting a Critical Test Item Error
The reported error is
A critical test item encountered an error. Please investigate this. See the "details" object for specifics.
Access the Nagios dashboard and examine the component’s state. View the output of the component health endpoint by selecting the Service named <component>::Health from the Services screen of the dashboard. This output should display which component is experiencing issues.
The fh-supercore health endpoint is reporting an issue connecting to fh-metrics. The full error is
Check: fh-supercore::health
A critical test item encountered an error. Please investigate this. See the "details" object for specifics.
Test: Check fh-metrics - Status: crit - Details: {u'status': u'crit', u'id': u'fh-fhmetrics', u'error': {u'syscall': u'connect', u'errno': u'EHOSTUNREACH', u'code': u'EHOSTUNREACH', u'port': 8080, u'address': u'172.30.112.65'}}Check the Nagios dashboard for issues with the component’s Ping check by following the Troubleshooting guide in the Ping Health section.
4.5.3. Resolving UNKNOWN Issues
The component health endpoint has failed to respond.
4.5.3.1. Troubleshooting No Route to Host Error
The reported error is
RequestError: The health endpoint at "<component-health-endpoint>" is not contactable. Error: <urlopen error [Errno 113] No route to host>
Check the Nagios dashboard for issues with the component’s Ping check by following the Troubleshooting guide in Ping Health section.
4.6. Troubleshooting Ping Check Issues
4.6.1. Reviewing Ping Check Status
The following applies to Core and MBaaS.
The status levels for Ping Check are
| Status | Description |
|---|---|
| OK | The service is running and a HTTP connection can be established |
| WARNING | The service is running, but there was some issue with the connection. For example: Authentication failure, Bad Request |
| CRITICAL | The service is down and a HTTP connection cannot be established |
| UNKNOWN | An internal error has occurred and the check was unable to determine a correct status for this check |
4.6.2. Resolving CRITICAL Issues
The following describes how to check that a HTTP connection can be established to the RHMAP component service.
Check the current status and the status information returned by the check using the Nagios dashboard.
4.6.2.1. Troubleshooting a TCP Socket Error
An attempt was made to connect to address <component> on port 8080 and results in an error.
The full error is
No Route To host HTTP CRITICAL - Unable to open TCP socket
The service is not running or not contactable. This could be an indication of:
No component deployment config
$ oc get dc <component>
The solution is to recreate the deployment configuration for this component from the installer templates.
$ oc get dc <component>
NAME REVISION REPLICAS TRIGGERED BY <component> 1 1 config
No replicas for component deployment config (scaled to 0).
$ oc get dc <component>
NAME REVISION REPLICAS TRIGGERED BY <component> 1 0 config
The solution is to scale the component so it has at least 1 replica.
oc scale --replicas=1 dc/<component> oc get dc <component>
NAME REVISION REPLICAS TRIGGERED BY <component> 1 1 config
No component service
$ oc get svc <component>
The solution is to re-create the service for this component from the installer templates.
$ oc get svc <component>
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE <component> 172.30.204.178 <none> 8080/TCP 1d
4.6.2.2. Misconfiguration of Component Resources
The component’s resources, that is service, deploy configuration, etc, do not match that of the original template. For example: ports, environment variables or image versions are incorrect.
The solution is to validate each resource against the original template ensuring that all relevant attributes match.
4.6.3. Resolving WARNING Issues
A WARNING status will most commonly happen if the service is running and can be contacted but the client appears to have failed (4xx HTTP status).
4.6.3.1. Troubleshooting HTTP 4xx status
The Client has failed and the result was a 4xx HTTP status. This could be an indication of:
Misconfiguration of Component Resources
The component’s resources, that is service, deploy configuration, etc, do not match that of the original template. For example: ports, environment variables or image versions are incorrect.
The solution is to validate each resource against the original template and ensure all relevant attributes match.
Performance Issues
The Ping check might enter a WARNING state if the component under test has performance related issues and the request is taking too long to respond.
The solution is to ensure that that the CPU and Memory thresholds meet the desired threshold for the component. Check this by following the Troubleshooting CPU/Memory Issues.
4.6.4. Resolving UNKNOWN Issues
An UNKNOWN status should never occur and usually indicates an internal error in the check. Contact Support and include the status information displayed for the check.

Where did the comment section go?
Red Hat's documentation publication system recently went through an upgrade to enable speedier, more mobile-friendly content. We decided to re-evaluate our commenting platform to ensure that it meets your expectations and serves as an optimal feedback mechanism. During this redesign, we invite your input on providing feedback on Red Hat documentation via the discussion platform.