Chapter 17. Troubleshooting Director Issues

Errors can occur at certain stages of the director’s processes. This section contains some information about diagnosing common problems.

Note the common logs for the director’s components:

  • The /var/log directory contains logs for many common OpenStack Platform components as well as logs for standard Red Hat Enterprise Linux applications.
  • ironic-inspector also stores the ramdisk logs in /var/log/ironic-inspector/ramdisk/ as gz-compressed tar files. Filenames contain date, time, and the IPMI address of the node. Use these logs to diagnose introspection issues.

17.1. Troubleshooting Node Registration

Issues with node registration usually occur due to issues with incorrect node details. In this case, use ironic to fix problems with node data registered. Here are a few examples:

Identify the assigned port UUID:

$ source ~/stackrc
(undercloud) $ openstack baremetal port list --node [NODE UUID]

Update the MAC address:

(undercloud) $ openstack baremetal port set --address=[NEW MAC] [PORT UUID]

Configure a new IPMI address on the node:

(undercloud) $ openstack baremetal node set --driver-info ipmi_address=[NEW IPMI ADDRESS] [NODE UUID]

17.2. Troubleshooting Hardware Introspection

The introspection process must run to completion. However, the ironic discovery daemon (ironic-inspector) times out after a default one hour period if the discovery ramdisk does not respond. Sometimes this might indicate a bug in the discovery ramdisk but usually this time-out occurs due to an environment misconfiguration, particularly BIOS boot settings.

This section contains information about common scenarios where environment misconfiguration occurs and advice about how to diagnose and resolve them.

Errors with Starting Node Introspection

Normally the introspection process uses the openstack overcloud node introspect command. However, if running the introspection directly with ironic-inspector, the introspection might fail to discover nodes in the AVAILABLE state, which is meant for deployment and not for discovery. In this situation, change the node status to the MANAGEABLE state before discovery:

$ source ~/stackrc
(undercloud) $ openstack baremetal node manage [NODE UUID]

When discovery completes, revert the node state to AVAILABLE before provisioning:

(undercloud) $ openstack baremetal node provide [NODE UUID]

Stopping the Discovery Process

To stop the introspection process, run the following command:

$ source ~/stackrc
(undercloud) $ openstack baremetal introspection abort [NODE UUID]

You can also wait until the process times out. If necessary, change the timeout setting in /etc/ironic-inspector/inspector.conf to another duration in minutes.

Accessing the Introspection Ramdisk

The introspection ramdisk uses a dynamic login element. This means you can provide either a temporary password or an SSH key to access the node during introspection debugging. Complete the following procedure to configure ramdisk access:

  1. Run the openssl passwd -1 command with a temporary password to generate an MD5 hash:

    $ openssl passwd -1 mytestpassword
    $1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/
  2. Edit the /httpboot/inspector.ipxe file, find the line starting with kernel, and append the rootpwd parameter and the MD5 hash. For example:

    kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk rootpwd="$1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/" selinux=0

    Alternatively, append your public SSH key to the sshkey parameter.

    Note

    Include quotation marks for both the rootpwd and sshkey parameters.

  3. Start the introspection and identify the IP address from either the arp command or the DHCP logs:

    $ arp
    $ sudo journalctl -u openstack-ironic-inspector-dnsmasq
  4. SSH as a root user with the temporary password or the SSH key.

    $ ssh root@192.168.24.105

Checking Introspection Storage

The director uses OpenStack Object Storage (swift) to save the hardware data obtained during the introspection process. If this service is not running, the introspection can fail. Check all services related to OpenStack Object Storage to ensure the service is running:

$ sudo docker ps --filter name=".*swift.*"

17.3. Troubleshooting Workflows and Executions

The OpenStack Workflow (mistral) service groups multiple OpenStack tasks into workflows. Red Hat OpenStack Platform uses a set of these workflow to perform common functions across the director, including bare metal node control, validations, plan management, and overcloud deployment.

For example, when running the openstack overcloud deploy command, the OpenStack Workflow service executes two workflows. The first workflow uploads the deployment plan:

Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: aef1e8c6-a862-42de-8bce-073744ed5e6b
Plan updated

The second workflow starts the overcloud deployment:

Deploying templates in the directory /tmp/tripleoclient-LhRlHX/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 97b64abe-d8fc-414a-837a-1380631c764d
2016-11-28 06:29:26Z [overcloud]: CREATE_IN_PROGRESS  Stack CREATE started
2016-11-28 06:29:26Z [overcloud.Networks]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.HeatAuthEncryptionKey]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.ServiceNetMap]: CREATE_IN_PROGRESS  state changed
...

Workflow Objects

OpenStack Workflow uses the following objects to track the workflow:

Actions
A particular instruction that OpenStack performs once an associated task runs. Examples include running shell scripts or performing HTTP requests. Some OpenStack components have in-built actions that OpenStack Workflow uses.
Tasks
Defines the action to run and the result of running the action. These tasks usually have actions or other workflows associated with them. Once a task completes, the workflow directs to another task, usually depending on whether the task succeeded or failed.
Workflows
A set of tasks grouped together and executed in a specific order.
Executions
Defines a particular action, task, or workflow running.

Workflow Error Diagnosis

OpenStack Workflow also provides robust logging of executions, which helps identify issues with certain command failures. For example, if a workflow execution fails, you can identify the point of failure. List the workflow executions that have the failed state ERROR:

$ source ~/stackrc
(undercloud) $ openstack workflow execution list | grep "ERROR"

Get the UUID of the failed workflow execution (for example, dffa96b0-f679-4cd2-a490-4769a3825262) and view the execution and its output:

(undercloud) $ openstack workflow execution show dffa96b0-f679-4cd2-a490-4769a3825262
(undercloud) $ openstack workflow execution output show dffa96b0-f679-4cd2-a490-4769a3825262

These commands return information about the failed task in the execution. The openstack workflow execution show command also displays the workflow used for the execution (for example, tripleo.plan_management.v1.publish_ui_logs_to_swift). You can view the full workflow definition using the following command:

(undercloud) $ openstack workflow definition show tripleo.plan_management.v1.publish_ui_logs_to_swift

This is useful for identifying where in the workflow a particular task occurs.

You can also view action executions and their results using a similar command syntax:

(undercloud) $ openstack action execution list
(undercloud) $ openstack action execution show 8a68eba3-0fec-4b2a-adc9-5561b007e886
(undercloud) $ openstack action execution output show 8a68eba3-0fec-4b2a-adc9-5561b007e886

This is useful for identifying a specific action that causes issues.

17.4. Troubleshooting Overcloud Creation

The overcloud deployment can fail at one of three layers:

  • Orchestration (heat and nova services)
  • Bare Metal Provisioning (ironic service)
  • Post-Deployment Configuration (Ansible and Puppet)

If an overcloud deployment has failed at any of these levels, use the OpenStack clients and service log files to diagnose the failed deployment. You can also run the following command to display details of the failure:

$ openstack stack failures list <OVERCLOUD_NAME> --long

Replace <OVERCLOUD_NAME> with the name of your overcloud.

17.4.1. Accessing deployment command history

Understanding historical director deployment commands and arguments can be useful for troubleshooting and support. You can view this information in /home/stack/.tripleo/history.

17.4.2. Orchestration

In most cases, Heat shows the failed overcloud stack after the overcloud creation fails:

$ source ~/stackrc
(undercloud) $ openstack stack list --nested --property status=FAILED
+-----------------------+------------+--------------------+----------------------+
| id                    | stack_name | stack_status       | creation_time        |
+-----------------------+------------+--------------------+----------------------+
| 7e88af95-535c-4a55... | overcloud  | CREATE_FAILED      | 2015-04-06T17:57:16Z |
+-----------------------+------------+--------------------+----------------------+

If the stack list is empty, this indicates an issue with the initial Heat setup. Check your Heat templates and configuration options, and check for any error messages that presented after running openstack overcloud deploy.

17.4.3. Bare Metal Provisioning

Check the bare metal service to see all registered nodes and their current status:

$ source ~/stackrc
(undercloud) $ openstack baremetal node list

+----------+------+---------------+-------------+-----------------+-------------+
| UUID     | Name | Instance UUID | Power State | Provision State | Maintenance |
+----------+------+---------------+-------------+-----------------+-------------+
| f1e261...| None | None          | power off   | available       | False       |
| f0b8c1...| None | None          | power off   | available       | False       |
+----------+------+---------------+-------------+-----------------+-------------+

Here are some common issues that can occur from the provisioning process:

  • Review the Provision State and Maintenance columns in the resulting table. Check for the following:

    • An empty table, or fewer nodes than you expect
    • Maintenance is set to True
    • Provision State is set to manageable. This usually indicates an issue with the registration or discovery processes. For example, if Maintenance sets itself to True automatically, the nodes are usually using the wrong power management credentials.
  • If Provision State is available, then the problem occurred before bare metal deployment has even started.
  • If Provision State is active and Power State is power on, the bare metal deployment has finished successfully. This means that the problem occurred during the post-deployment configuration step.
  • If Provision State is wait call-back for a node, the bare metal provisioning process has not yet finished for this node. Wait until this status changes, otherwise, connect to the virtual console of the failed node and check the output.
  • If Provision State is error or deploy failed, then bare metal provisioning has failed for this node. Check the bare metal node’s details:

    (undercloud) $ openstack baremetal node show [NODE UUID]

    Look for last_error field, which contains error description. If the error message is vague, you can use logs to clarify it:

    (undercloud) $ sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
  • If you see wait timeout error and the node Power State is power on, connect to the virtual console of the failed node and check the output.

17.4.4. Checking overcloud configuration failures

If an overcloud deployment operation fails at the Ansible configuration stage, use the openstack overcloud failures command to show failed configuration steps.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Run the deployment failures command:

    $ openstack overcloud failures

17.5. Troubleshooting IP Address Conflicts on the Provisioning Network

Discovery and deployment tasks will fail if the destination hosts are allocated an IP address which is already in use. To prevent these failures, you can perform a port scan of the Provisioning network to determine whether the discovery IP range and host IP range are free.

Perform the following steps from the undercloud host:

Install nmap:

$ sudo yum install nmap

Use nmap to scan the IP address range for active addresses. This example scans the 192.168.24.0/24 range, replace this with the IP subnet of the Provisioning network (using CIDR bitmask notation):

$ sudo nmap -sn 192.168.24.0/24

Review the output of the nmap scan:

For example, you should see the IP address(es) of the undercloud, and any other hosts that are present on the subnet. If any of the active IP addresses conflict with the IP ranges in undercloud.conf, you will need to either change the IP address ranges or free up the IP addresses before introspecting or deploying the overcloud nodes.

$ sudo nmap -sn 192.168.24.0/24

Starting Nmap 6.40 ( http://nmap.org ) at 2015-10-02 15:14 EDT
Nmap scan report for 192.168.24.1
Host is up (0.00057s latency).
Nmap scan report for 192.168.24.2
Host is up (0.00048s latency).
Nmap scan report for 192.168.24.3
Host is up (0.00045s latency).
Nmap scan report for 192.168.24.5
Host is up (0.00040s latency).
Nmap scan report for 192.168.24.9
Host is up (0.00019s latency).
Nmap done: 256 IP addresses (5 hosts up) scanned in 2.45 seconds

17.6. Troubleshooting "No Valid Host Found" Errors

Sometimes the /var/log/nova/nova-conductor.log contains the following error:

NoValidHost: No valid host was found. There are not enough hosts available.

This error occurs when the Compute Scheduler cannot find a bare metal node suitable for booting the new instance. This usually means there is a mismatch between resources that the Compute service expects to find and resources that the Bare Metal service advertised to Compute. Check the following in this case:

  1. Ensure the introspection succeeds. If the introspection fails, check that each node contains the required ironic node properties:

    $ source ~/stackrc
    (undercloud) $ openstack baremetal node show [NODE UUID]

    Check the properties JSON field has valid values for keys cpus, cpu_arch, memory_mb and local_gb.

  2. Check that the Compute flavor used does not exceed the node properties above for a required number of nodes:

    (undercloud) $ openstack flavor show [FLAVOR NAME]
  3. Run the openstack baremetal node list command to ensure sufficient nodes in the available state. Nodes in manageable state usually signify a failed introspection.
  4. Run the openstack baremetal node list command to check that the nodes are not in maintenace mode. If a node changes to maintenance mode automatically, the likely cause is an issue with incorrect power management credentials. Check the power management credentials and then remove maintenance mode:

    (undercloud) $ openstack baremetal node maintenance unset [NODE UUID]
  5. If you are using the Automated Health Check (AHC) tools to perform automatic node tagging, check that you have enough nodes corresponding to each flavor/profile. Run the openstack baremetal node show command on a node and check the capabilities key in the properties field. For example, a node tagged for the Compute role should contain profile:compute.
  6. It takes some time for node information to propagate from Bare Metal to Compute after introspection. However, if you performed some steps manually, there might be a short period of time when nodes are not available to nova. Use the following command to check the total resources in your system:

    (undercloud) $ openstack hypervisor stats show

17.7. Troubleshooting the Overcloud after Creation

After creating your overcloud, you might want to perform certain overcloud operations in the future. For example, you might want to scale your available nodes, or replace faulty nodes. Certain issues might arise when performing these operations. This section contains information to consider when diagnosing and troubleshooting failed post-creation operations.

17.7.1. Overcloud Stack Modifications

Problems can occur when you modify the overcloud stack through the director. Examples of stack modifications include the following operations:

  • Scaling Nodes
  • Removing Nodes
  • Replacing Nodes

Modifying the stack is similar to the process of creating the stack, in that the director checks the availability of the requested number of nodes, provisions additional or removes existing nodes, and then applies the Puppet configuration. Use the guidelines in the following sections when you modify the overcloud stack. These sections contain information to consider when diagnosing issues on specific node types.

17.7.2. Controller Service Failures

The overcloud Controller nodes contain the bulk of Red Hat OpenStack Platform services. Likewise, you might use multiple Controller nodes in a high availability cluster. If a certain service on a node is faulty, the high availability cluster provides a certain level of failover. However, to ensure your overcloud operates at full capacity you must diagnose the faulty service.

The Controller nodes use Pacemaker to manage the resources and services in the high availability cluster. The Pacemaker Configuration System (pcs) command is a tool that manages a Pacemaker cluster. Run the pcs command on a Controller node in the cluster to perform configuration and monitoring functions. Use the following commands to troubleshoot overcloud services on a high availability cluster:

pcs status
Provides a status overview of the entire cluster including enabled resources, failed resources, and online nodes.
pcs resource show
Shows a list of resources and the respective nodes for each resource
pcs resource disable [resource]
Stop a particular resource.
pcs resource enable [resource]
Start a particular resource.
pcs cluster standby [node]
Place a node in standby mode. The node is no longer available in the cluster. This is useful for performing maintenance on a specific node without affecting the cluster.
pcs cluster unstandby [node]
Remove a node from standby mode. The node becomes available in the cluster again.

Use these Pacemaker commands to identify the faulty component and/or node. After identifying the component, view the respective component log file in /var/log/.

17.7.3. Containerized Service Failures

If a containerized service fails during or after overcloud deployment, use the following commands to determine the root cause for the failure:

Checking the container logs

Each container retains standard output from its main process. Use this output as a log to help determine what actually occurs during a container run. For example, to view the log for the keystone container, use the following command:

$ sudo docker logs keystone

In most cases, this log contains information about the cause of a container’s failure.

Inspecting the container

In some situations, you might need to verify information about a container. For example, use the following command to view keystone container data:

$ sudo docker inspect keystone

This command returns a JSON object containing low-level configuration data. You can pipe the output to the jq command to parse specific data. For example, to view the container mounts for the keystone container, run the following command:

$ sudo docker inspect keystone | jq .[0].Mounts

You can also use the --format option to parse data to a single line, which is useful for running commands against sets of container data. For example, to recreate the options used to run the keystone container, use the following inspect command with the --format option:

$ sudo docker inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone
Note

The --format option uses Go syntax to create queries.

Use these options in conjunction with the docker run command to recreate the container for troubleshooting purposes:

$ OPTIONS=$( sudo docker inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone )
$ sudo docker run --rm $OPTIONS /bin/bash

Running commands in the container

In some cases, you might need to obtain information from within a container through a specific Bash command. In this situation, use the following docker command to execute commands within a running container. For example, run the docker exec command to run a command inside the keystone container:

$ sudo docker exec -ti keystone <COMMAND>
Note

The -ti options run the command through an interactive pseudoterminal.

Replace <COMMAND> with the command you want to run. For example, each container has a health check script to verify the service connection. You can run the health check script for keystone with the following command:

$ sudo docker exec -ti keystone /openstack/healthcheck

To access the container’s shell, run docker exec using /bin/bash as the command you want to run inside the container:

$ sudo docker exec -ti keystone /bin/bash

Exporting a container

When a container fails, you might need to investigate the full contents of the file. In this case, you can export the full file system of a container as a tar archive. For example, to export the keystone container’s file system, run the following command:

$ sudo docker export keystone -o keystone.tar

This command create the keystone.tar archive, which you can extract and explore.

17.7.4. Compute Service Failures

Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service. For example, to view the status of the container, run the following command:

  • View the status of the container:

    $ sudo docker ps -f name=nova_compute
  • The primary log file for Compute nodes is /var/log/containers/nova/nova-compute.log. If issues occur with Compute node communication, this log file is usually a good place to start a diagnosis.
  • If performing maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node. See Section 8.12, “Migrating instances from a Compute node” for more information on node migrations.

17.7.5. Ceph Storage Service Failures

For any issues that occur with Red Hat Ceph Storage clusters, see "Logging Configuration Reference" in the Red Hat Ceph Storage Configuration Guide. This section contains information about diagnosing logs for all Ceph storage services.

17.8. Creating an sosreport

If you need to contact Red Hat for support on OpenStack Platform, you might need to generate an sosreport. See the following knowledgebase article for more information about creating an sosreport:

17.9. Important Logs for Undercloud and Overcloud

Use the following logs to find out information about the undercloud and overcloud when troubleshooting.

Table 17.1. Important Logs for the Undercloud

InformationLog Location

OpenStack Compute log

/var/log/containers/nova/nova-compute.log

OpenStack Compute API interactions

/var/log/nova/nova-api.log

OpenStack Compute Conductor log

/var/log/nova/nova-conductor.log

OpenStack Orchestration log

heat-engine.log

OpenStack Orchestration API interactions

heat-api.log

OpenStack Orchestration CloudFormations log

/var/log/heat/heat-api-cfn.log

OpenStack Bare Metal Conductor log

ironic-conductor.log

OpenStack Bare Metal API interactions

ironic-api.log

Introspection

/var/log/ironic-inspector/ironic-inspector.log

OpenStack Workflow Engine log

/var/log/mistral/engine.log

OpenStack Workflow Executor log

/var/log/mistral/executor.log

OpenStack Workflow API interactions

/var/log/mistral/api.log

Table 17.2. Important Logs for the Overcloud

InformationLog Location

Cloud-Init Log

/var/log/cloud-init.log

Overcloud Configuration (Summary of Last Puppet Run)

/var/lib/puppet/state/last_run_summary.yaml

Overcloud Configuration (Report from Last Puppet Run)

/var/lib/puppet/state/last_run_report.yaml

Overcloud Configuration (All Puppet Reports)

/var/lib/puppet/reports/overcloud-*/*

Overcloud Configuration (stdout from each Puppet Run)

/var/run/heat-config/deployed/*-stdout.log

Overcloud Configuration (stderr from each Puppet Run)

/var/run/heat-config/deployed/*-stderr.log

High availability log

/var/log/pacemaker.log