Chapter 23. Troubleshooting director errors

Errors can occur at certain stages of the director’s processes. This section contains some information about diagnosing common problems.

23.1. Troubleshooting node registration

Issues with node registration usually occur due to issues with incorrect node details. In these situations, validate the template file containing your node details and correct the imported node details.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Run the node import command with the --validate-only option. This option validates your node template without performing an import:

    (undercloud) $ openstack overcloud node import --validate-only ~/nodes.json
    Waiting for messages on queue 'tripleo' with no timeout.
    
    Successfully validated environment file
  3. To fix incorrect details with imported nodes, run the openstack baremetal commands to update node details. The following example shows how to change networking details:

    1. Identify the assigned port UUID for the imported node:

      $ source ~/stackrc
      (undercloud) $ openstack baremetal port list --node [NODE UUID]
    2. Update the MAC address:

      (undercloud) $ openstack baremetal port set --address=[NEW MAC] [PORT UUID]
    3. Configure a new IPMI address on the node:

      (undercloud) $ openstack baremetal node set --driver-info ipmi_address=[NEW IPMI ADDRESS] [NODE UUID]

23.2. Troubleshooting hardware introspection

You must run the introspection process to completion. However, ironic-inspector times out after a default one hour period if the inspection ramdisk does not respond. Sometimes this indicates a bug in the inspection ramdisk but usually this time-out occurs due to an environment misconfiguration, particularly BIOS boot settings.

This procedure contains information about common scenarios where environment misconfiguration occurs and advice about how to diagnose and resolve them.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. The director uses OpenStack Object Storage (swift) to save the hardware data obtained during the introspection process. If this service is not running, the introspection can fail. Check all services related to OpenStack Object Storage to ensure the service is running:

    (undercloud) $ sudo systemctl list-units tripleo_swift*
  3. Check your nodes are in a manageable state. The introspection does not inspect nodes in an available state, which is meant for deployment. In this situation, change the node status to manageable state before introspection:

    (undercloud) $ openstack baremetal node manage [NODE UUID]
  4. Configure temporary access to the introspection ramdisk. You can provide either a temporary password or an SSH key to access the node during introspection debugging. Complete the following procedure to configure ramdisk access:

    1. Run the openssl passwd -1 command with a temporary password to generate an MD5 hash:

      (undercloud) $ openssl passwd -1 mytestpassword
      $1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/
    2. Edit the /var/lib/ironic/httpboot/inspector.ipxe file, find the line starting with kernel, and append the rootpwd parameter and the MD5 hash. For example:

      kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk rootpwd="$1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/" selinux=0

      Alternatively, append your public SSH key to the sshkey parameter.

      Note

      Include quotation marks for both the rootpwd and sshkey parameters.

  5. Run the introspection on the node:

    (undercloud) $ openstack overcloud node introspect [NODE UUID] --provide

    The --provide option causes the node state to change to available when the introspection completes.

  6. Identify the IP address of the node from the dnsmasq logs:

    (undercloud) $ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log
  7. If an error occurs, access the node using the root user and temporary access details:

    $ ssh root@192.168.24.105

    Accessing the node during introspection means you can run diagnostic commands to troubleshoot the introspection failure.

  8. To stop the introspection process, run the following command:

    (undercloud) $ openstack baremetal introspection abort [NODE UUID]

    You can also wait until the process times out.

    Note

    OpenStack Platform director retries introspection three times after the initial abort. Run the openstack baremetal introspection abort command at each attempt to abort the introspection completely.

23.3. Troubleshooting workflows and executions

The OpenStack Workflow (mistral) service groups multiple OpenStack tasks into workflows. Red Hat OpenStack Platform uses a set of these workflow to perform common functions across the director, including bare metal node control, validations, plan management, and overcloud deployment.

For example, when running the openstack overcloud deploy command, the OpenStack Workflow service executes two workflows. The first workflow uploads the deployment plan:

Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: aef1e8c6-a862-42de-8bce-073744ed5e6b
Plan updated

The second workflow starts the overcloud deployment:

Deploying templates in the directory /tmp/tripleoclient-LhRlHX/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 97b64abe-d8fc-414a-837a-1380631c764d
2016-11-28 06:29:26Z [overcloud]: CREATE_IN_PROGRESS  Stack CREATE started
2016-11-28 06:29:26Z [overcloud.Networks]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.HeatAuthEncryptionKey]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.ServiceNetMap]: CREATE_IN_PROGRESS  state changed
...

The OpenStack Workflow service uses the following objects to track the workflow:

Actions
A particular instruction that OpenStack performs once an associated task runs. Examples include running shell scripts or performing HTTP requests. Some OpenStack components have in-built actions that OpenStack Workflow uses.
Tasks
Defines the action to run and the result of running the action. These tasks usually have actions or other workflows associated with them. Once a task completes, the workflow directs to another task, usually depending on whether the task succeeded or failed.
Workflows
A set of tasks grouped together and executed in a specific order.
Executions
Defines a particular action, task, or workflow running.

OpenStack Workflow also provides robust logging of executions, which helps identify issues with certain command failures. For example, if a workflow execution fails, you can identify the point of failure.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. List the workflow executions that have the failed state ERROR:

    (undercloud) $ openstack workflow execution list | grep "ERROR"
  3. Get the UUID of the failed workflow execution (for example, dffa96b0-f679-4cd2-a490-4769a3825262) and view the execution and its output:

    (undercloud) $ openstack workflow execution show dffa96b0-f679-4cd2-a490-4769a3825262
    (undercloud) $ openstack workflow execution output show dffa96b0-f679-4cd2-a490-4769a3825262
  4. These commands return information about the failed task in the execution. The openstack workflow execution show command also displays the workflow used for the execution (for example, tripleo.plan_management.v1.publish_ui_logs_to_swift). You can view the full workflow definition using the following command:

    (undercloud) $ openstack workflow definition show tripleo.plan_management.v1.publish_ui_logs_to_swift

    This is useful for identifying where in the workflow a particular task occurs.

  5. View action executions and their results using a similar command syntax:

    (undercloud) $ openstack action execution list
    (undercloud) $ openstack action execution show 8a68eba3-0fec-4b2a-adc9-5561b007e886
    (undercloud) $ openstack action execution output show 8a68eba3-0fec-4b2a-adc9-5561b007e886

    This is useful for identifying a specific action that causes issues.

23.4. Troubleshooting overcloud creation and deployment

The initial creation of the overcloud occurs with the OpenStack Orchestration (heat) service. If an overcloud deployment has failed, use the OpenStack clients and service log files to diagnose the failed deployment.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Run the deployment failures command:

    $ openstack overcloud failures
  3. Run the following command to display details of the failure:

    (undercloud) $ openstack stack failures list <OVERCLOUD_NAME> --long

    Replace <OVERCLOUD_NAME> with the name of your overcloud.

  4. Run the following command to identify the stacks that failed:

    (undercloud) $ openstack stack list --nested --property status=FAILED

23.5. Troubleshooting node provisioning

The OpenStack Orchestration (heat) service controls the provisioning process. If node provisioning fails, use the OpenStack clients and service log files to diagnose the issues.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Check the bare metal service to see all registered nodes and their current status:

    (undercloud) $ openstack baremetal node list
    
    +----------+------+---------------+-------------+-----------------+-------------+
    | UUID     | Name | Instance UUID | Power State | Provision State | Maintenance |
    +----------+------+---------------+-------------+-----------------+-------------+
    | f1e261...| None | None          | power off   | available       | False       |
    | f0b8c1...| None | None          | power off   | available       | False       |
    +----------+------+---------------+-------------+-----------------+-------------+

    All nodes available for provisioning should have the following states set:

    • Maintenance set to False.
    • Provision State set to available before provisioning.

    The following table outlines some common provisioning failure scenarios.

ProblemCauseSolution

Maintenance sets itself to True automatically.

The director cannot access the power management for the nodes.

Check the credentials for node power management.

Provision State is set to available but nodes do not provision.

The problem occurred before bare metal deployment started.

Check the node details including the profile and flavor mapping. Check that the node hardware details are within the requirements for the flavor.

Provision State is set to wait call-back for a node.

The node provisioning process has not yet finished for this node.

Wait until this status changes. Otherwise, connect to the virtual console of the node and check the output.

Provision State is active and Power State is power on but the nodes do not respond.

The node provisioning has finished successfully and there is a problem during the post-deployment configuration step.

Diagnose the node configuration process. Connect to the virtual console of the node and check the output.

Provision State is error or deploy failed.

Node provisioning has failed.

View the bare metal node details with the openstack baremetal node show command and check the last_error field, which contains error description.

23.6. Troubleshooting IP address conflicts during provisioning

Introspection and deployment tasks will fail if the destination hosts are allocated an IP address that is already in use. To prevent these failures, you can perform a port scan of the Provisioning network to determine whether the discovery IP range and host IP range are free.

Procedure

  1. Install nmap:

    $ sudo dnf install nmap
  2. Use nmap to scan the IP address range for active addresses. This example scans the 192.168.24.0/24 range, replace this with the IP subnet of the Provisioning network (using CIDR bitmask notation):

    $ sudo nmap -sn 192.168.24.0/24
  3. Review the output of the nmap scan. For example, you should see the IP address of the undercloud, and any other hosts that are present on the subnet:

    $ sudo nmap -sn 192.168.24.0/24
    
    Starting Nmap 6.40 ( http://nmap.org ) at 2015-10-02 15:14 EDT
    Nmap scan report for 192.168.24.1
    Host is up (0.00057s latency).
    Nmap scan report for 192.168.24.2
    Host is up (0.00048s latency).
    Nmap scan report for 192.168.24.3
    Host is up (0.00045s latency).
    Nmap scan report for 192.168.24.5
    Host is up (0.00040s latency).
    Nmap scan report for 192.168.24.9
    Host is up (0.00019s latency).
    Nmap done: 256 IP addresses (5 hosts up) scanned in 2.45 seconds

    If any of the active IP addresses conflict with the IP ranges in undercloud.conf, you will need to either change the IP address ranges or free up the IP addresses before introspecting or deploying the overcloud nodes.

23.7. Troubleshooting "No Valid Host Found" errors

Sometimes the /var/log/nova/nova-conductor.log contains the following error:

NoValidHost: No valid host was found. There are not enough hosts available.

This error occurs when the Compute Scheduler cannot find a bare metal node suitable for booting the new instance. This usually means there is a mismatch between resources that the Compute service expects to find and resources that the Bare Metal service advertised to Compute. This procedure shows how to check if this is the case.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Check that the introspection succeeded on the node. If the introspection fails, check that each node contains the required ironic node properties:

    (undercloud) $ openstack baremetal node show [NODE UUID]

    Check the properties JSON field has valid values for keys cpus, cpu_arch, memory_mb and local_gb.

  3. Check the Compute flavor mapped to the node:

    (undercloud) $ openstack flavor show [FLAVOR NAME]

    Make sure it does not exceed the node properties for the required number of nodes.

  4. Run the openstack baremetal node list command to ensure sufficient nodes in the available state. Nodes in manageable state usually signify a failed introspection.
  5. Run the openstack baremetal node list command to check the nodes are not in maintenance mode. If a node changes to maintenance mode automatically, the likely cause is an issue with incorrect power management credentials. Check the power management credentials and then remove maintenance mode:

    (undercloud) $ openstack baremetal node maintenance unset [NODE UUID]
  6. If you are using automatic profile tagging, check that you have enough nodes corresponding to each flavor and profile. Run the openstack baremetal node show command on a node and check the capabilities key in the properties field. For example, a node tagged for the Compute role should contain profile:compute.
  7. It takes some time for node information to propagate from Bare Metal to Compute after introspection. However, if you performed some steps manually, there might be a short period of time when nodes are not available to nova. Use the following command to check the total resources in your system:

    (undercloud) $ openstack hypervisor stats show

23.8. Troubleshooting overcloud configuration

OpenStack Platform director uses Ansible to configure the overcloud. This procedure shows how to diagnose the overcloud’s Ansible playbooks (config-download) when errors occur.

Procedure

  1. Make sure the stack user has access to the files in the /var/lib/mistral directory on the undercloud:

    $ sudo setfacl -R -m u:stack:rwx /var/lib/mistral

    This command retains mistral user access to the directory.

  2. Change to the working directory for the config-download files. This is usually /var/lib/mistral/overcloud/.

    $ cd /var/lib/mistral/overcloud/
  3. Search the ansible.log file for the point of failure.

    $ less ansible.log

    Make a note of the step that failed.

  4. Find the step that failed in the config-download playbooks within the working directory to identify the action that took place.

23.9. Troubleshooting container configuration

OpenStack Platform director uses paunch to launch containers, podman to manage containers, and puppet to create container configuration. This procedure shows how to diagnose the a container when errors occur.

Accessing the host

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Get the IP address of the node with the container failure.

    (undercloud) $ openstack server list
  3. Log into the node:

    (undercloud) $ ssh heat-admin@192.168.24.60
  4. Change to the root user:

    $ sudo -i

Identifying failed containers

  1. View all containers:

    $ podman ps --all

    Identify the failed container. The failed container usually exits with a non-zero status.

Checking container logs

  1. Each container retains standard output from its main process. Use this output as a log to help determine what actually occurs during a container run. For example, to view the log for the keystone container, use the following command:

    $ sudo podman logs keystone

    In most cases, this log contains information about the cause of a container’s failure.

  2. The host also retains the stdout log for the failed service. You can find the stdout logs in /var/log/containers/stdouts/. For example, to view the log for a failed keystone container, run the following command:

    $ cat /var/log/containers/stdouts/keystone.log

Inspecting containers

In some situations, you might need to verify information about a container. For example, use the following command to view keystone container data:

$ sudo podman inspect keystone

This command returns a JSON object containing low-level configuration data. You can pipe the output to the jq command to parse specific data. For example, to view the container mounts for the keystone container, run the following command:

$ sudo podman inspect keystone | jq .[0].Mounts

You can also use the --format option to parse data to a single line, which is useful for running commands against sets of container data. For example, to recreate the options used to run the keystone container, use the following inspect command with the --format option:

$ sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}:{{ join .Options "," }}{{end}} -ti {{.Config.Image}}' keystone
Note

The --format option uses Go syntax to create queries.

Use these options in conjunction with the podman run command to recreate the container for troubleshooting purposes:

$ OPTIONS=$( sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone )
$ sudo podman run --rm $OPTIONS /bin/bash

Running commands in a container

In some cases, you might need to obtain information from within a container through a specific Bash command. In this situation, use the following podman command to execute commands within a running container. For example, run the podman exec command to run a command inside the keystone container:

$ sudo podman exec -ti keystone <COMMAND>
Note

The -ti options run the command through an interactive pseudoterminal.

Replace <COMMAND> with the command you want to run. For example, each container has a health check script to verify the service connection. You can run the health check script for keystone with the following command:

$ sudo podman exec -ti keystone /openstack/healthcheck

To access the container’s shell, run podman exec using /bin/bash as the command you want to run inside the container:

$ sudo podman exec -ti keystone /bin/bash

Viewing a container filesystem

  1. To view the file system for the failed container, run the podman mount command. For example, to view the file system for a failed keystone container, run the following command:

    $ podman mount keystone

    This provides a mounted location to view the filesystem contents:

    /var/lib/containers/storage/overlay/78946a109085aeb8b3a350fc20bd8049a08918d74f573396d7358270e711c610/merged

    This is useful for viewing the Puppet reports within the container. You can find these reports in the var/lib/puppet/ directory within the container mount.

Exporting a container

When a container fails, you might need to investigate the full contents of the file. In this case, you can export the full file system of a container as a tar archive. For example, to export the keystone container’s file system, run the following command:

$ sudo podman export keystone -o keystone.tar

This command create the keystone.tar archive, which you can extract and explore.

23.10. Troubleshooting Compute node failures

Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service.

Procedure

  1. Source the stackrc file:

    $ source ~/stackrc
  2. Get the IP address of the Compute node containing the failure:

    (undercloud) $ openstack server list
  3. Log into the node:

    (undercloud) $ ssh heat-admin@192.168.24.60
  4. Change to the root user:

    $ sudo -i
  5. View the status of the container:

    $ sudo podman ps -f name=nova_compute
  6. The primary log file for Compute nodes is /var/log/containers/nova/nova-compute.log. If issues occur with Compute node communication, this log file is usually a good place to start a diagnosis.
  7. If performing maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node.

23.11. Creating an sosreport

If you need to contact Red Hat for support on OpenStack Platform, you might need to generate an sosreport. See the following knowledgebase article for more information about creating an sosreport:

23.12. Log locations

Use the following logs to find out information about the undercloud and overcloud when troubleshooting.

Table 23.1. Logs on both the undercloud and overcloud nodes

InformationLog Location

Containerized service logs

/var/log/containers/

Standard output from containerized services

/var/log/containers/stdouts

Ansible configuration logs

/var/lib/mistral/overcloud/ansible.log

Table 23.2. Additional logs on the undercloud node

InformationLog Location

Command history for openstack overcloud deploy

/home/stack/.tripleo/history

Undercloud installation log

/home/stack/install-undercloud.log

Table 23.3. Additional logs on the overcloud nodes

InformationLog Location

Cloud-Init Log

/var/log/cloud-init.log

High availability log

/var/log/pacemaker.log