12.3. Troubleshooting Overcloud Creation

There are three layers where the deployment can fail:
  • Orchestration (Heat and Nova services)
  • Bare Metal Provisioning (Ironic service)
  • Post-Deployment Configuration (Puppet)
If an Overcloud deployment has failed at any of these levels, use the OpenStack clients and service log files to diagnose the failed deployment.

12.3.1. Orchestration

In most cases, Heat shows the failed overcloud stack after Overcloud creation fails:
$ heat stack-list

+-----------------------+------------+--------------------+----------------------+
| id                    | stack_name | stack_status       | creation_time        |
+-----------------------+------------+--------------------+----------------------+
| 7e88af95-535c-4a55... | overcloud  | CREATE_FAILED      | 2015-04-06T17:57:16Z |
+-----------------------+------------+--------------------+----------------------+
If the stack list is empty, this indicate an issue with the initial orchestration setup. Check your Heat templates and configuration options, and check for any error messages after running openstack overcloud deploy.

12.3.2. Bare Metal Provisioning

Check ironic to see all registered nodes and their current status:
$ ironic node-list

+----------+------+---------------+-------------+-----------------+-------------+
| UUID     | Name | Instance UUID | Power State | Provision State | Maintenance |
+----------+------+---------------+-------------+-----------------+-------------+
| f1e261...| None | None          | power off   | available       | False       |
| f0b8c1...| None | None          | power off   | available       | False       |
+----------+------+---------------+-------------+-----------------+-------------+
Here are some common issues that arise from the provisioning process.
  • Check the Provision State and Maintenance columns in the resulting table. Check for the following:
    • An empty table or less nodes that you expect
    • Maintenance is set to True
    • Provision State is set to manageable
    This usually indicates an issue from the registration or discovery processes. For example, if Maintenance sets to True automatically, the nodes are usually using the wrong power management credentials.
  • If Provision State is available then the problem occurred before bare metal deployment has even started.
  • If Provision State is active and Power State is power on, the bare metal deployment has finished successfully. This means the the problem occurred during the post-deployment configuration step.
  • If Provision State is wait call-back for a node, the bare metal provisioning process has not finished for this node yet. Wait until this status changes. Otherwise, connect to the virtual console of the failed node and check the output.
  • If Provision State is error or deploy failed, then bare metal provisioning has failed for this node. Check the bare metal node's details:
    $ ironic node-show [NODE UUID]
    
    Look for last_error field, which contains error description. If the error message is vague, you can use logs to clarify it:
    $ sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
    
  • If you see wait timeout error and the node Power State is power on, connect to the virtual console of the failed node and check the output.

12.3.3. Post-Deployment Configuration

Many things can occur during the configuration stage. For example, a particular Puppet module could fail to complete due to an issue with the setup. This section provides a process to diagnose such issues.

Procedure 12.4. Diagnosing Post-Deployment Configuration Issues

  1. List all the resources from the Overcloud stack to see which one failed:
    $ heat resource-list overcloud
    
    This shows a table of all resources and their states. Look for any resources with a CREATE_FAILED.
  2. Show the failed resource:
    $ heat resource-show overcloud [FAILED RESOURCE]
    
    Check for any information in the resource_status_reason field that can help your diagnosis.
  3. Use the nova command to see the IP addresses of the Overcloud nodes.
    $ nova list
    
    Login as the heat-admin user to one of the deployed nodes. For example, if the stack's resource list shows the error occurred on a Controller node, login to a Controller node. The heat-admin user has sudo access.
    $ ssh heat-admin@192.0.2.14
    
  4. Check the os-collect-config log for a possible reason for the failure.
    $ sudo journalctl -u os-collect-config
    
  5. In some cases, Nova fails deploying the node in entirety. This situation would be indicated by a failed OS::Heat::ResourceGroup for one of the Overcloud role types. Use nova to see the failure in this case.
    $ nova list
    $ nova show [SERVER ID]
    
    The most common error shown will reference the error message No valid host was found. See Section 12.5, “Troubleshooting "No Valid Host Found" Errors” for details on troubleshooting this error. In other cases, look at the following log files for further troubleshooting:
    • /var/log/nova/*
    • /var/log/heat/*
    • /var/log/ironic/*
  6. Use the SOS toolset, which gathers information about system hardware and configuration. Use this information for diagnostic purposes and debugging. SOS is commonly used to help support technicians and developers. SOS is useful on both the Undercloud and Overcloud. Install the sos package:
    $ sudo yum install sos
    
    Generate a report:
    $ sudo sosreport --all-logs