11.3. Troubleshooting Overcloud Creation

There are three layers where the deployment can fail:
  • Orchestration (heat and nova services)
  • Bare Metal Provisioning (ironic service)
  • Post-Deployment Configuration (Puppet)
If an Overcloud deployment has failed at any of these levels, use the OpenStack clients and service log files to diagnose the failed deployment.

11.3.1. Orchestration

In most cases, Heat shows the failed Overcloud stack after the Overcloud creation fails:
$ heat stack-list

+-----------------------+------------+--------------------+----------------------+
| id                    | stack_name | stack_status       | creation_time        |
+-----------------------+------------+--------------------+----------------------+
| 7e88af95-535c-4a55... | overcloud  | CREATE_FAILED      | 2015-04-06T17:57:16Z |
+-----------------------+------------+--------------------+----------------------+
If the stack list is empty, this indicates an issue with the initial Heat setup. Check your Heat templates and configuration options, and check for any error messages that presented after running openstack overcloud deploy.

11.3.2. Bare Metal Provisioning

Check ironic to see all registered nodes and their current status:
$ ironic node-list

+----------+------+---------------+-------------+-----------------+-------------+
| UUID     | Name | Instance UUID | Power State | Provision State | Maintenance |
+----------+------+---------------+-------------+-----------------+-------------+
| f1e261...| None | None          | power off   | available       | False       |
| f0b8c1...| None | None          | power off   | available       | False       |
+----------+------+---------------+-------------+-----------------+-------------+
Here are some common issues that arise from the provisioning process.
  • Review the Provision State and Maintenance columns in the resulting table. Check for the following:
    • An empty table, or fewer nodes than you expect
    • Maintenance is set to True
    • Provision State is set to manageable
    This usually indicates an issue with the registration or discovery processes. For example, if Maintenance sets itself to True automatically, the nodes are usually using the wrong power management credentials.
  • If Provision State is available, then the problem occurred before bare metal deployment has even started.
  • If Provision State is active and Power State is power on, the bare metal deployment has finished successfully. This means that the problem occurred during the post-deployment configuration step.
  • If Provision State is wait call-back for a node, the bare metal provisioning process has not yet finished for this node. Wait until this status changes, otherwise, connect to the virtual console of the failed node and check the output.
  • If Provision State is error or deploy failed, then bare metal provisioning has failed for this node. Check the bare metal node's details:
    $ ironic node-show [NODE UUID]
    
    Look for last_error field, which contains error description. If the error message is vague, you can use logs to clarify it:
    $ sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
    
  • If you see wait timeout error and the node Power State is power on, connect to the virtual console of the failed node and check the output.

11.3.3. Post-Deployment Configuration

Many things can occur during the configuration stage. For example, a particular Puppet module could fail to complete due to an issue with the setup. This section provides a process to diagnose such issues.

Procedure 11.4. Diagnosing Post-Deployment Configuration Issues

  1. List all the resources from the Overcloud stack to see which one failed:
    $ heat resource-list overcloud
    
    This shows a table of all resources and their states. Look for any resources with a CREATE_FAILED.
  2. Show the failed resource:
    $ heat resource-show overcloud [FAILED RESOURCE]
    
    Check for any information in the resource_status_reason field that can help your diagnosis.
  3. Use the nova command to see the IP addresses of the Overcloud nodes.
    $ nova list
    
    Log in as the heat-admin user to one of the deployed nodes. For example, if the stack's resource list shows the error occurred on a Controller node, log in to a Controller node. The heat-admin user has sudo access.
    $ ssh heat-admin@192.0.2.14
    
  4. Check the os-collect-config log for a possible reason for the failure.
    $ sudo journalctl -u os-collect-config
    
  5. In some cases, nova fails deploying the node in entirety. This situation would be indicated by a failed OS::Heat::ResourceGroup for one of the Overcloud role types. Use nova to see the failure in this case.
    $ nova list
    $ nova show [SERVER ID]
    
    The most common error shown will reference the error message No valid host was found. See Section 11.5, “Troubleshooting "No Valid Host Found" Errors” for details on troubleshooting this error. In other cases, look at the following log files for further troubleshooting:
    • /var/log/nova/*
    • /var/log/heat/*
    • /var/log/ironic/*
  6. Use the SOS toolset, which gathers information about system hardware and configuration. Use this information for diagnostic purposes and debugging. SOS is commonly used to help support technicians and developers. SOS is useful on both the Undercloud and Overcloud. Install the sos package:
    $ sudo yum install sos
    
    Generate a report:
    $ sudo sosreport --all-logs
    
The post-deployment process for Controller nodes uses six main steps for the deployment. This includes:

Table 11.1. Controller Node Configuration Steps

Step
Description
ControllerLoadBalancerDeployment_Step1
Initial load balancing software configuration, including Pacemaker, RabbitMQ, Memcached, Redis, and Galera.
ControllerServicesBaseDeployment_Step2
Initial cluster configuration, including Pacemaker configuration, HAProxy, MongoDB, Galera, Ceph Monitor, and database initialization for OpenStack Platform services.
ControllerRingbuilderDeployment_Step3
Initial ring build for OpenStack Object Storage (swift).
ControllerOvercloudServicesDeployment_Step4
Configuration of all OpenStack Platform services (nova, neutron, cinder, sahara, ceilometer, heat, horizon, aodh, gnocchi).
ControllerOvercloudServicesDeployment_Step5
Configure service start up settings in Pacemaker, including constraints to determine service start up order and service start up parameters.
ControllerOvercloudServicesDeployment_Step6
Final pass of the Overcloud configuration.