Chapter 13. Troubleshooting the RHMAP MBaaS

13.1. Overview

This document provides information on how to identify, debug, and resolve possible issues that can be encountered during installation and usage of the RHMAP MBaaS on OpenShift 3.

In Common Problems, you can see a list of resolutions for problems that might occur during installation or usage of the MBaaS.

13.2. Check the Health Endpoint of the MBaaS

The first step to check whether an MBaaS is running correctly is to see the output of its health endpoint. The HTTP health endpoint in the MBaaS reports the health status of the MBaaS and of its individual components.

From the command line, run the following command:

curl https://<MBaaS URL>/sys/info/health

To find the MBaaS URL:

  1. Navigate to the MBaaS that you want to check, by selecting Admin, then MBaaS Targets, and then choosing the MBaaS.
  2. Copy the value of the MBaaS URL field from the MBaaS details screen.

If the MBaaS is running correctly without any errors, the output of the command should be similar to the following, showing a "test_status": "ok" for each component:

{
 "status": "ok",
 "summary": "No issues to report. All tests passed without error",
 "details": [
  {
   "description": "Check fh-statsd running",
   "test_status": "ok",
   "result": {
    "id": "fh-statsd",
    "status": "OK",
    "error": null
   },
   "runtime": 6
  },
...
}

If there are any errors, the output will contain error messages in the result.error object of the details array for individual components. Use this information to identify which component is failing and to get information on further steps to resolve the failure.

You can also see a HTTP 503 Service Unavailable error returned from the health endpoint. This can happen in several situations:

Alternatively, you can see the result of this health check in the Studio. Navigate to the Admin > MBaaS Targets section, select your MBaaS target, and click Check the MBaaS Status.

If there is an error, you are presented with a screen showing the same information as described above. Use the provided links to OpenShift Web Console and the associated MBaaS Project in OpenShift to help with debugging of the issue on the OpenShift side.

OpenShift 3 MBaaS target failure

13.3. Analyze Logs

To see the logging output of individual MBaaS components, you must configure centralized logging in your OpenShift cluster. See Enabling Centralized Logging for a detailed procedure.

The section Identifying Issues in an MBaaS provides guidance on discovering MBaaS failures by searching and filtering its logging output.

13.4. Common Problems

13.4.1. A replica pod of mongodb-service is replaced with a new one

13.4.1.1. Summary

The replica set is susceptible to down time if the replica set members configuration is not up to date with the actual set of pods. There must be at least two members active at any time, in order for an election of a primary member to happen. Without a primary member, the MongoDB service won’t perform any read or write operations.

A MongoDB replica may get terminated in several situations:

  • A node hosting a MongoDB replica is terminated or evacuated.
  • A re-deploy is triggered on one of the MongoDB Deployment objects in the project – manually or by a configuration change.
  • One of the MongoDB deployments is scaled down to zero pods, then scaled back up to one pod.

To learn more about replication in MongoDB, see Replication in the official MongoDB documentation.

13.4.1.2. Fix

The following procedure shows you how to re-configure a MongoDB replica set into a fully operational state. You must synchronize the list of replica set members with the actual set of MongoDB pods in the cluster, and set a primary member of the replica set.

  1. Note the MongoDB endpoints.

    oc get ep | grep mongo

    Make note of the list of endpoints. It is used later to set the replica set members configuration.

    NAME              ENDPOINTS                                      AGE
    mongodb-1         10.1.2.152:27017                               17h
    mongodb-2         10.1.4.136:27017                               17h
    mongodb-3         10.1.5.16:27017                                17h
  2. Log in to the oldest MongoDB replica pod.

    List all the MongoDB replica pods.

    oc get po -l name=mongodb-replica

    In the output, find the pod with the highest value in the AGE field.

    NAME                READY     STATUS    RESTARTS   AGE
    mongodb-1-1-4nsrv   1/1       Running   0          19h
    mongodb-2-1-j4v3x   1/1       Running   0          3h
    mongodb-3-2-7tezv   1/1       Running   0          1h

    In this case, it is mongodb-1-1-4nsrv with an age of 19 hours.

    Log in to the pod using oc rsh.

    oc rsh mongodb-1-1-4nsrv
  3. Open a MongoDB shell on the primary member.

    mongo admin -u admin -p ${MONGODB_ADMIN_PASSWORD} --host ${MONGODB_REPLICA_NAME}/localhost
    MongoDB shell version: 2.4.9
    connecting to: rs0/localhost:27017/admin
    [...]
    Welcome to the MongoDB shell.
    For interactive help, type "help".
    For more comprehensive documentation, see
        http://docs.mongodb.org/
    Questions? Try the support group
        http://groups.google.com/group/mongodb-user
    rs0:PRIMARY>
  4. List the configured members.

    Run rs.conf() in the MongoDB shell.

    rs0:PRIMARY> rs.conf()
    {
        "_id" : "rs0",
        "version" : 56239,
        "members" : [
            {
                "_id" : 3,
                "host" : "10.1.0.2:27017"
            },
            {
                "_id" : 4,
                "host" : "10.1.1.2:27017"
            },
            {
                "_id" : 5,
                "host" : "10.1.6.4:27017"
            }
        ]
    }
  5. Ensure all hosts have either PRIMARY or SECONDARY status.

    Run the following command. It may take several seconds to complete.

    rs0:PRIMARY> rs.status().members.forEach(function(member) {print(member.name + ' :: ' + member.stateStr)})
    mongodb-1:27017  :: PRIMARY
    mongodb-2:27017  :: SECONDARY
    mongodb-3:27017  :: SECONDARY
    rs0:PRIMARY>

    There must be exactly one PRIMARY node. All the other nodes must be SECONDARY. If a member is in a STARTUP, STARTUP2, RECOVERING, or UNKNOWN state, try running the above command again in a few minutes. These states may signify that the replica set is performing a startup, recovery, or other procedure potentially resulting in an operational state.

13.4.1.3. Result

After applying the fix, all three MongoDB pods will be members of the replica set. If one of the three members terminates unexpectedly, the two remaining members are enough to keep the MongoDB service fully operational.

13.4.2. MongoDB doesn’t respond after repeated installation of the MBaaS

13.4.2.1. Summary

Note

The described situation can result from an attempt to create an MBaaS with the same name as a previously deleted one. We suggest you use unique names for every MBaaS installation.

If the mongodb-service is not responding after the installation of the MBaaS, it is possible that some of the MongoDB replica set members failed to start up. This can happen due to a combination of the following factors:

  • The most likely cause of failure in MongoDB startup is the presence of a mongod.lock lock file and journal files in the MongoDB data folder, left over from an improperly terminated MongoDB instance.
  • If a MongoDB pod is terminated, the associated persistent volumes transition to a Released state. When a new MongoDB pod replaces a terminated one, it may get attached to the same persistent volume which was attached to the terminated MongoDB instance, and thus get exposed to the files created by the terminated instance.

13.4.2.2. Fix

Note

SSH access and administrator rights on the OpenShift master and the NFS server are required for the following procedure.

Note

This procedure describes a fix only for persistent volumes backed by NFS. Refer to Configuring Persistent Storage in the official OpenShift documentation for general information on handling other volume types.

The primary indicator of this situation is the mongodb-initiator pod not reaching the Completed status.

Run the following command to see the status of mongodb-initiator:

oc get pod mongodb-initiator
NAME                READY     STATUS      RESTARTS   AGE
mongodb-initiator   1/1       Running     0          5d

If the status is any other than Completed, the MongoDB replica set is not created properly. If mongodb-initiator stays in this state too long, it may be a signal that one of the MongoDB pods has failed to start. To confirm whether this is the case, check logs of mongodb-initiator using the following command:

oc logs mongodb-initiator
=> Waiting for 3 MongoDB endpoints ...mongodb-1
mongodb-2
mongodb-3
=> Waiting for all endpoints to accept connections...

If the above message is the last one in the output, it signifies that some of the MongoDB pods are not responding.

Check the event log by running the following command:

oc get ev

If the output contains a message similar to the following, you should continue with the below procedure to clean up the persistent volumes:

  FailedMount   {kubelet ip-10-0-0-100.example.internal}   Unable to mount volumes for pod "mongodb-1-1-example-mbaas": Mount failed: exit status 32

The following procedure will guide you through the process of deleting contents of existing persistent volumes, creating new persistent volumes, and re-creating persistent volume claims.

  1. Find the NFS paths.

    On the OpenShift master node, execute the following command to find the paths of all persistent volumes associated with an MBaaS. Replace <mbaas-project-name> with the name of the MBaaS project in OpenShift.

    list=$(oc get pv | grep <mbaas-project-name> | awk '{ print $1}');
    for pv in ${list[@]} ; do
      path=$(oc describe pv ${pv} | grep Path: | awk '{print $2}' | tr -d '\r')
      echo ${path}
    done

    Example output:

    /nfs/exp222
    /nfs/exp249
    /nfs/exp255
  2. Delete all contents of the found NFS paths.

    Log in to the NFS server using ssh.

    Execute the following command to list contents of the paths. Replace <NFS paths> with the list of paths from the previous step, separated by spaces.

    for path in <NFS paths> ; do
      echo ${path}
      sudo ls -l ${path}
      echo " "
    done

    Example output:

    /nfs/exp222
    -rw-------. 1001320000 nfsnobody admin.0
    -rw-------. 1001320000 nfsnobody admin.ns
    -rw-------. 1001320000 nfsnobody fh-mbaas.0
    -rw-------. 1001320000 nfsnobody fh-mbaas.ns
    -rw-------. 1001320000 nfsnobody fh-metrics.0
    -rw-------. 1001320000 nfsnobody fh-metrics.ns
    -rw-------. 1001320000 nfsnobody fh-reporting.0
    -rw-------. 1001320000 nfsnobody fh-reporting.ns
    drwxr-xr-x. 1001320000 nfsnobody journal
    -rw-------. 1001320000 nfsnobody local.0
    -rw-------. 1001320000 nfsnobody local.1
    -rw-------. 1001320000 nfsnobody local.ns
    -rwxr-xr-x. 1001320000 nfsnobody mongod.lock
    drwxr-xr-x. 1001320000 nfsnobody _tmp
    
    /nfs/exp249
    drwxr-xr-x. 1001320000 nfsnobody journal
    -rw-------. 1001320000 nfsnobody local.0
    -rw-------. 1001320000 nfsnobody local.ns
    -rwxr-xr-x. 1001320000 nfsnobody mongod.lock
    drwxr-xr-x. 1001320000 nfsnobody _tmp
    
    /nfs/exp255
    drwxr-xr-x. 1001320000 nfsnobody journal
    -rw-------. 1001320000 nfsnobody local.0
    -rw-------. 1001320000 nfsnobody local.ns
    -rwxr-xr-x. 1001320000 nfsnobody mongod.lock
    drwxr-xr-x. 1001320000 nfsnobody _tmp
    Warning

    Make sure to back up all data before proceeding. The following operation may result in irrecoverable loss of data.

    If the listed contents of the paths resemble the output shown above, delete all contents of the found NFS paths. Replace <NFS paths> with the list of paths from step 1, separated by spaces.

    for path in <NFS paths>
    do
      if [ -z ${path+x} ]
      then
        echo "path is unset"
      else
        echo "path is set to '$path'"
        cd ${path} && rm -rf ./*
      fi
    done
  3. Re-create persistent volumes.

    Log in to the OpenShift master node using ssh.

    Navigate to the directory which contains the YAML files that were used to create the persistent volumes.

    Execute the following command to delete and re-create the persistent volumes. Replace <mbaas-project-name> with the name of the MBaaS project in OpenShift.

    list=$(oc get pv | grep <mbaas-project-name> | awk '{ print $1}');
    for pv in ${list}; do
      oc delete pv ${pv}
      oc create -f ${pv}.yaml
    done

    The persistent volumes are now re-created and in Available state.

    Note

    The re-created persistent volumes will not be used by OpenShift again for the same persistent volume claims. Make sure you have at least three additional persistent volumes in Available state.

  4. Re-create persistent volume claims for MongoDB.

    Create three JSON files, with the following names:

    • mongodb-claim-1.json
    • mongodb-claim-2.json
    • mongodb-claim-3.json

    Copy the following contents into each file. Change the metadata.name value to match the name of the file without the suffix. For example, the contents for the mongodb-claim-1.json file are as follows:

    {
      "kind": "PersistentVolumeClaim",
      "apiVersion": "v1",
      "metadata": {
        "name": "mongodb-claim-1"
      },
      "spec": {
        "accessModes": ["ReadWriteOnce"],
        "resources": {
          "requests": {
            "storage": "50Gi"
          }
        }
      }
    }

    Run the following command to re-create the persistent volume claims.

    for pvc in mongodb-claim-1 mongodb-claim-2 mongodb-claim-3; do
      oc delete pvc ${pvc}
      oc create -f ${pvc}.json
    done
  5. Verify that mongodb-initiator proceeds with initialization.

    Run the following command to see the logs of mongodb-initiator.

    oc logs mongodb-initiator -f

    After mongodb-initiator completes its work, the log output should contain the following message, indicating that the MongoDB replica set was successfully created.

    => Successfully initialized replSet

13.4.2.3. Result

The MongoDB service is fully operational with all three replicas attached to their persistent volumes. The persistent volumes left in Released state from the previous installation are now in the Available state, ready for use by other persistent volume claims.

13.4.3. MongoDB replica set stops replicating correctly

13.4.3.1. Summary

If some of the MBaaS components start to crash, this may be because they can not connect to a primary member in the MongoDB replica set. This usually indicates that the replica set configuration has become inconsistent. This can happen if a majority of the member pods get replaced and have new IP addresses. In this case, data cannot be written to or read from MongoDB replica set in the MBaaS project.

To verify the replica set state as seen by each member, run the following command in the shell of a user logged in to OpenShift with access to the MBaaS project:

for i in `oc get po -a | grep -e "mongodb-[0-9]\+"  | awk '{print $1}'`; do
  echo "## ${i} ##"
  echo mongo admin -u admin -p \${MONGODB_ADMIN_PASSWORD} --eval "printjson\(rs.status\(\)\)" |  oc rsh --shell='/bin/bash' $i
done

For a fully consistent replica set, the output for each member would contain a members object listing details about each member. If the output resembles the following, containing the "ok" : 0 value for some members, proceed to the fix in order to make the replica set consistent.

## mongodb-1-1-8syid ##
MongoDB shell version: 2.4.9
connecting to: admin
{
    "startupStatus" : 1,
    "ok" : 0,
    "errmsg" : "loading local.system.replset config (LOADINGCONFIG)"
}
## mongodb-2-1-m6ao1 ##
MongoDB shell version: 2.4.9
connecting to: admin
{
    "startupStatus" : 1,
    "ok" : 0,
    "errmsg" : "loading local.system.replset config (LOADINGCONFIG)"
}
## mongodb-3-2-e0a11 ##
MongoDB shell version: 2.4.9
connecting to: admin
{
    "startupStatus" : 1,
    "ok" : 0,
    "errmsg" : "loading local.system.replset config (LOADINGCONFIG)"
}

13.4.3.2. Fix

You can make the replica set consistent by forcing a re-deploy.

  1. Note the MongoDB endpoints which are in an error status.

    oc get po

    Example:

    ```bash
    NAME                   READY     STATUS      RESTARTS   AGE
    mongodb-1-1-pu0fz      1/1       Error          0        1h
    
    ```
  2. Force a deploy of this Pod

    oc deploy mongodb-1 --latest

13.4.3.3. Result

The replica starts replicating properly again and dependent MBaaS components start working again.

13.4.4. An MBaaS component fails to start because no suitable nodes are found

13.4.4.1. Summary

If some of the MBaaS components are not starting up after the installation, it may be the case that the OpenShift scheduler failed to find suitable nodes on which to schedule the pods of those MBaaS components. This means that the OpenShift cluster doesn’t contain all the nodes required by the MBaaS OpenShift template, or that those nodes don’t satisfy the requirements on system resources, node labels, and other parameters.

Read more about the OpenShift Scheduler in the OpenShift documentation.

To verify that this is the problem, run the following command to list the event log:

oc get ev

If the output contains one of the following messages, you are most likely facing this problem – the nodes in your OpenShift cluster don’t fulfill some of the requirements.

  • Failed for reason MatchNodeSelector and possibly others
  • Failed for reason PodExceedsFreeCPU and possibly others

13.4.4.2. Fix

To fix this problem, configure nodes in your OpenShift cluster to match the requirements of the MBaaS OpenShift template.

  • Apply correct labels to nodes.

    Refer to Apply Node Labels in the guide Provisioning an MBaaS in Red Hat OpenShift Enterprise 3 for details on what labels must be applied to nodes.

  • Make sure the OpenShift cluster has sufficient resources for the MBaaS components, Cloud Apps, and Cloud Services it runs.

    Configure the machines used as OpenShift nodes to have more CPU power and internal memory available, or add more nodes to the cluster. Refer to the guide on Overcommitting and Compute Resources in the OpenShift documentation for more information on how containers use system resources.

  • Clean up the OpenShift instance.

    Delete unused projects from the OpenShift instance.

Alternatively, it is also possible to correct the problem from the other side — change the deployment configurations in the MBaaS OpenShift template to match the setup of your OpenShift cluster.

Warning

Changing the deployment configurations may negatively impact the performance and reliability of the MBaaS. Therefore, this is not a recommended approach.

To list all deployment configurations, run the following command:

oc get dc
NAME           TRIGGERS       LATEST
fh-mbaas       ConfigChange   1
fh-messaging   ConfigChange   1
fh-metrics     ConfigChange   1
fh-statsd      ConfigChange   1
mongodb-1      ConfigChange   1
mongodb-2      ConfigChange   1
mongodb-3      ConfigChange   1

To edit a deployment configuration, use the oc edit dc <deployment> command. For example, to edit the configuration of the fh-mbaas deployment, run the following command:

oc edit dc fh-mbaas

You can modify system resource requirements in the resources sections.

...
resources:
  limits:
    cpu: 800m
    memory: 800Mi
  requests:
    cpu: 200m
    memory: 200Mi
...

Changing a deployment configuration triggers a deployment operation.

13.4.4.3. Result

If you changed the setup of nodes in the OpenShift cluster to match the requirements of the MBaaS OpenShift template, the MBaaS is now fully operational without any limitation to quality of service.

If you changed the deployment configuration of any MBaaS component, the cluster should now be fully operational, with a potential limitation to quality of service.

13.4.4.4. Result

You can now deploy your app to the RHMAP MBaaS.