11.6. Troubleshooting the Overcloud after Creation

After creating your Overcloud, you might want to perform certain Overcloud operations in the future. For example, you might aim to scale your available nodes, or replace faulty nodes. Certain issues might arise when performing these operations. This section provides some advice to diagnose and troubleshoot failed post-creation operations.

11.6.1. Overcloud Stack Modifications

Problems can occur when modifying the overcloud stack through the director. Example of stack modifications include:
  • Scaling Nodes
  • Removing Nodes
  • Replacing Nodes
Modifying the stack is similar to the process of creating the stack, in that the director checks the availability of the requested number of nodes, provisions additional or removes existing nodes, and then applies the Puppet configuration. Here are some guidelines to follow in situations when modifying the overcloud stack.
As an initial step, follow the advice set in Section 11.3, “Troubleshooting Overcloud Creation”. These same steps can help diagnose problems with updating the Overcloud heat stack. In particular, use the following command to help identify problematic resources:
heat stack-list --show-nested
List all stacks. The --show-nested displays all child stacks and their respective parent stacks. This command helps identify the point where a stack failed.
heat resource-list overcloud
List all resources in the overcloud stack and their current states. This helps identify which resource is causing failures in the stack. You can trace this resource failure to its respective parameters and configuration in the heat template collection and the Puppet modules.
heat event-list overcloud
List all events related to the overcloud stack in chronological order. This includes the initiation, completion, and failure of all resources in the stack. This helps identify points of resource failure.
The next few sections provide advice to diagnose issues on specific node types.

11.6.2. Controller Service Failures

The Overcloud Controller nodes contain the bulk of Red Hat OpenStack Platform services. Likewise, you might use multiple Controller nodes in a high availability cluster. If a certain service on a node is faulty, the high availability cluster provides a certain level of failover. However, it then becomes necessary to diagnose the faulty service to ensure your Overcloud operates at full capacity.
The Controller nodes use Pacemaker to manage the resources and services in the high availability cluster. The Pacemaker Configuration System (pcs) command is a tool that manages a Pacemaker cluster. Run this command on a Controller node in the cluster to perform configuration and monitoring functions. Here are few commands to help troubleshoot Overcloud services on a high availability cluster:
pcs status
Provides a status overview of the entire cluster including enabled resources, failed resources, and online nodes.
pcs resource show
Shows a list of resources, and their respective nodes.
pcs resource disable [resource]
Stop a particular resource.
pcs resource enable [resource]
Start a particular resource.
pcs cluster standby [node]
Place a node in standby mode. The node is no longer available in the cluster. This is useful for performing maintenance on a specific node without affecting the cluster.
pcs cluster unstandby [node]
Remove a node from standby mode. The node becomes available in the cluster again.
Use these Pacemaker commands to identify the faulty component and/or node. After identifying the component, view the respective component log file in /var/log/.

11.6.3. Compute Service Failures

Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service. For example:
  • View the status of the service using the following systemd function:
    $ sudo systemctl status openstack-nova-compute.service
    Likewise, view the systemd journal for the service using the following command:
    $ sudo journalctl -u openstack-nova-compute.service
  • The primary log file for Compute nodes is /var/log/nova/nova-compute.log. If issues occur with Compute node communication, this log file is usually a good place to start a diagnosis.
  • If performing maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node. See Section 8.9, “Migrating VMs from an Overcloud Compute Node” for more information on node migrations.

11.6.4. Ceph Storage Service Failures

For any issues that occur with Red Hat Ceph Storage clusters, see Part X. Logging and Debugging in the Red Hat Ceph Storage Configuration Guide. This section provides information on diagnosing logs for all Ceph storage services.