Language:
Format:

Chapter 14. Troubleshooting director errors

Errors can occur at certain stages of the director processes. This section contains some information about diagnosing common problems.

14.1. Troubleshooting node registration

Issues with node registration usually occur due to issues with incorrect node details. In these situations, validate the template file containing your node details and correct the imported node details.

Procedure

Source the stackrc file:
```
$ source ~/stackrc
```

Run the node import command with the --validate-only option. This option validates your node template without performing an import:

(undercloud) $ openstack overcloud node import --validate-only ~/nodes.json
Waiting for messages on queue 'tripleo' with no timeout.

Successfully validated environment file

To fix incorrect details with imported nodes, run the openstack baremetal commands to update node details. The following example shows how to change networking details:
1. Identify the assigned port UUID for the imported node:
```
$ source ~/stackrc
(undercloud) $ openstack baremetal port list --node [NODE UUID]
```
2. Update the MAC address:
```
(undercloud) $ openstack baremetal port set --address=[NEW MAC] [PORT UUID]
```
3. Configure a new IPMI address on the node:
```
(undercloud) $ openstack baremetal node set --driver-info ipmi_address=[NEW IPMI ADDRESS] [NODE UUID]
```

14.2. Troubleshooting hardware introspection

The Bare Metal Provisioning inspector service, ironic-inspector, times out after a default one-hour period if the inspection RAM disk does not respond. The timeout might indicate a bug in the inspection RAM disk, but usually the timeout occurs due to an environment misconfiguration.

You can diagnose and resolve common environment misconfiguration issues to ensure the introspection process runs to completion.

Procedure

Source the stackrc undercloud credentials file:
```
$ source ~/stackrc
```
Ensure that your nodes are in a manageable state. The introspection does not inspect nodes in an available state, which is meant for deployment. If you want to inspect nodes that are in an available state, change the node status to manageable state before introspection:
```
(undercloud)$ openstack baremetal node manage <node_uuid>
```

To configure temporary access to the introspection RAM disk during introspection debugging, use the sshkey parameter to append your public SSH key to the kernel configuration in the /httpboot/inspector.ipxe file:

kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk selinux=0 sshkey="<public_ssh_key>"

Run the introspection on the node:
```
(undercloud)$ openstack overcloud node introspect <node_uuid> --provide
```
Use the --provide option to change the node state to available after the introspection completes.

Identify the IP address of the node from the dnsmasq logs:

(undercloud)$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log

If an error occurs, access the node using the root user and temporary access details:
```
$ ssh root@192.168.24.105
```
Access the node during introspection to run diagnostic commands and troubleshoot the introspection failure.
To stop the introspection process, run the following command:
```
(undercloud)$ openstack baremetal introspection abort <node_uuid>
```
You can also wait until the process times out.
Note
Red Hat OpenStack Platform director retries introspection three times after the initial abort. Run the openstack baremetal introspection abort command at each attempt to abort the introspection completely.

14.3. Troubleshooting overcloud creation and deployment

The initial creation of the overcloud occurs with the OpenStack Orchestration (heat) service. If an overcloud deployment fails, use the OpenStack clients and service log files to diagnose the failed deployment.

Procedure

Source the stackrc file:
```
$ source ~/stackrc
```

Launch the ephemeral Heat process:

(undercloud)$ openstack tripleo launch heat --heat-dir /home/stack/overcloud-deploy/overcloud/heat-launcher --restore-db
(undercloud)$ export OS_CLOUD=heat

View the details of the failure:
```
(undercloud)$ openstack stack failures list <overcloud> --long
```
- Replace <overcloud> with the name of your overcloud.

Identify the stacks that failed:

(undercloud)$ openstack stack list --nested --property status=FAILED

Remove the ephemeral Heat process from the undercloud:
```
(undercloud)$ openstack tripleo launch heat --kill
```

14.4. Troubleshooting node provisioning

The OpenStack Orchestration (heat) service controls the provisioning process. If node provisioning fails, use the OpenStack clients and service log files to diagnose the issues.

Procedure

Source the stackrc file:
```
$ source ~/stackrc
```

Check the bare metal service to see all registered nodes and their current status:

(undercloud) $ openstack baremetal node list

+----------+------+---------------+-------------+-----------------+-------------+
| UUID     | Name | Instance UUID | Power State | Provision State | Maintenance |
+----------+------+---------------+-------------+-----------------+-------------+
| f1e261...| None | None          | power off   | available       | False       |
| f0b8c1...| None | None          | power off   | available       | False       |
+----------+------+---------------+-------------+-----------------+-------------+

All nodes available for provisioning should have the following states set:

Maintenance set to False.
Provision State set to available before provisioning.

If a node does not have Maintenance set to False or Provision State set to available, then use the following table to identify the problem and the solution:

Problem	Cause	Solution
Maintenance sets itself to `True` automatically.	Director cannot access the power management for the nodes.	Check the credentials for node power management.
Provision State is set to `available` but nodes do not provision.	The problem occurred before bare-metal deployment started.	Check that the node hardware details are within the requirements.
Provision State is set to `wait call-back` for a node.	The node provisioning process has not yet finished for this node.	Wait until this status changes. Otherwise, connect to the virtual console of the node and check the output.
Provision State is `active` and Power State is `power on` but the nodes do not respond.	The node provisioning has finished successfully and there is a problem during the post-deployment configuration step.	Diagnose the node configuration process. Connect to the virtual console of the node and check the output.
Provision State is `error` or `deploy failed`.	Node provisioning has failed.	View the bare metal node details with the `openstack baremetal node show` command and check the `last_error` field, which contains error description.

Additional resources

Bare-metal node provisioning states

14.5. Troubleshooting IP address conflicts during provisioning

Introspection and deployment tasks fail if the destination hosts are allocated an IP address that is already in use. To prevent these failures, you can perform a port scan of the Provisioning network to determine whether the discovery IP range and host IP range are free.

Procedure

Install nmap:
```
$ sudo dnf install nmap
```
Use nmap to scan the IP address range for active addresses. This example scans the 192.168.24.0/24 range, replace this with the IP subnet of the Provisioning network (using CIDR bitmask notation):
```
$ sudo nmap -sn 192.168.24.0/24
```

Review the output of the nmap scan. For example, you should see the IP address of the undercloud, and any other hosts that are present on the subnet:

$ sudo nmap -sn 192.168.24.0/24

Starting Nmap 6.40 ( http://nmap.org ) at 2015-10-02 15:14 EDT
Nmap scan report for 192.168.24.1
Host is up (0.00057s latency).
Nmap scan report for 192.168.24.2
Host is up (0.00048s latency).
Nmap scan report for 192.168.24.3
Host is up (0.00045s latency).
Nmap scan report for 192.168.24.5
Host is up (0.00040s latency).
Nmap scan report for 192.168.24.9
Host is up (0.00019s latency).
Nmap done: 256 IP addresses (5 hosts up) scanned in 2.45 seconds

If any of the active IP addresses conflict with the IP ranges in undercloud.conf, you must either change the IP address ranges or release the IP addresses before you introspect or deploy the overcloud nodes.

14.6. Troubleshooting overcloud configuration

Red Hat OpenStack Platform director uses Ansible to configure the overcloud. Complete the following steps to diagnose Ansible playbook errors (config-download) on the overcloud.

Procedure

Ensure that the stack user has access to the files in the ~/config-download/overcloud directory on the undercloud:
```
$ sudo setfacl -R -m u:stack:rwx ~/config-download/overcloud
```
Change to the working directory for the config-download files:
```
$ cd ~/config-download/overcloud
```
Search the ansible.log file for the point of failure:
```
$ less ansible.log
```
Make a note of the step that failed.
Find the step that failed in the config-download playbooks within the working directory to identify the action that occurred.

14.7. Troubleshooting container configuration

Red Hat OpenStack Platform director uses podman to manage containers and puppet to create container configuration. This procedure shows how to diagnose a container when errors occur.

Accessing the host

Source the stackrc file:
```
$ source ~/stackrc
```
Get the IP address of the node with the container failure.
```
(undercloud) $ metalsmith list
```

(undercloud) $ ssh tripleo-admin@192.168.24.60

Identifying failed containers

View all containers:
```
$ sudo podman ps --all
```
Identify the failed container. The failed container usually exits with a non-zero status.

Checking container logs

Each container retains standard output from its main process. Use this output as a log to help determine what actually occurs during a container run. For example, to view the log for the keystone container, run the following command:
```
$ sudo podman logs keystone
```
In most cases, this log contains information about the cause of a container failure.
The host also retains the stdout log for the failed service. You can find the stdout logs in /var/log/containers/stdouts/. For example, to view the log for a failed keystone container, run the following command:
```
$ cat /var/log/containers/stdouts/keystone.log
```

Inspecting containers

In some situations, you might need to verify information about a container. For example, use the following command to view keystone container data:

$ sudo podman inspect keystone

This command returns a JSON object containing low-level configuration data. You can pipe the output to the jq command to parse specific data. For example, to view the container mounts for the keystone container, run the following command:

$ sudo podman inspect keystone | jq .[0].Mounts

You can also use the --format option to parse data to a single line, which is useful for running commands against sets of container data. For example, to recreate the options used to run the keystone container, use the following inspect command with the --format option:

$ sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}:{{ join .Options "," }}{{end}} -ti {{.Config.Image}}' keystone

Note

The --format option uses Go syntax to create queries.

Use these options in conjunction with the podman run command to recreate the container for troubleshooting purposes:

$ OPTIONS=$( sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone )
$ sudo podman run --rm $OPTIONS /bin/bash

Running commands in a container

In some cases, you might need to obtain information from within a container through a specific Bash command. In this situation, use the following podman command to execute commands within a running container. For example, run the podman exec command to run a command inside the keystone container:

$ sudo podman exec -ti keystone <COMMAND>

Note

The -ti options run the command through an interactive pseudoterminal.

Replace <COMMAND> with the command you want to run. For example, each container has a health check script to verify the service connection. You can run the health check script for keystone with the following command:

$ sudo podman exec -ti keystone /openstack/healthcheck

To access the container shell, run podman exec using /bin/bash as the command you want to run inside the container:

$ sudo podman exec -ti keystone /bin/bash

Viewing a container filesystem

To view the file system for the failed container, run the podman mount command. For example, to view the file system for a failed keystone container, run the following command:
```
$ sudo podman mount keystone
```
This provides a mounted location to view the filesystem contents:
```
/var/lib/containers/storage/overlay/78946a109085aeb8b3a350fc20bd8049a08918d74f573396d7358270e711c610/merged
```
This is useful for viewing the Puppet reports within the container. You can find these reports in the var/lib/puppet/ directory within the container mount.

Exporting a container

When a container fails, you might need to investigate the full contents of the file. In this case, you can export the full file system of a container as a tar archive. For example, to export the keystone container file system, run the following command:

$ sudo podman export keystone -o keystone.tar

This command creates the keystone.tar archive, which you can extract and explore.

14.8. Troubleshooting Compute node failures

Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service.

Procedure

Source the stackrc file:
```
$ source ~/stackrc
```
Get the IP address of the Compute node that contains the failure:
```
(undercloud) $ openstack server list
```

(undercloud) $ ssh tripleo-admin@192.168.24.60

Change to the root user:
```
$ sudo -i
```
View the status of the container:
```
$ sudo podman ps -f name=nova_compute
```
The primary log file for Compute nodes is /var/log/containers/nova/nova-compute.log. If issues occur with Compute node communication, use this file to begin the diagnosis.
If you perform maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node.

14.9. Creating an sosreport

If you need to contact Red Hat for support with Red Hat OpenStack Platform, you might need to generate an sosreport. For more information about creating an sosreport, see:

"How to collect all required logs for Red Hat Support to investigate an OpenStack issue"

14.10. Log locations

Use the following logs to gather information about the undercloud and overcloud when you troubleshoot issues.

Table 14.1. Logs on both the undercloud and overcloud nodes

Information	Log location
Containerized service logs	`/var/log/containers/`
Standard output from containerized services	`/var/log/containers/stdouts`
Ansible configuration logs	`~/ansible.log`

Table 14.2. Additional logs on the undercloud node

Information	Log location
Command history for `openstack overcloud deploy`	`/home/stack/.tripleo/history`
Undercloud installation log	`/home/stack/install-undercloud.log`

Table 14.3. Additional logs on the overcloud nodes

Information	Log location
Cloud-Init Log	`/var/log/cloud-init.log`
High availability log	`/var/log/pacemaker.log`

Select Your Language

Chapter 14. Troubleshooting director errors

14.1. Troubleshooting node registration

14.2. Troubleshooting hardware introspection

14.3. Troubleshooting overcloud creation and deployment

14.4. Troubleshooting node provisioning

14.5. Troubleshooting IP address conflicts during provisioning

14.6. Troubleshooting overcloud configuration

14.7. Troubleshooting container configuration

14.8. Troubleshooting Compute node failures

14.9. Creating an sosreport

14.10. Log locations

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Language and Page Formatting Options

Chapter 14. Troubleshooting director errors

14.1. Troubleshooting node registration

14.2. Troubleshooting hardware introspection

14.3. Troubleshooting overcloud creation and deployment

14.4. Troubleshooting node provisioning

14.5. Troubleshooting IP address conflicts during provisioning

14.6. Troubleshooting overcloud configuration

14.7. Troubleshooting container configuration

14.8. Troubleshooting Compute node failures

14.9. Creating an sosreport

14.10. Log locations

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links