Chapter 7. Investigating and Fixing HA Controller Resources
The pcs constraint show command displays any constraints on how services are launched. The output from the command shows constraints relating to where each resource is located, the order in which it starts and what it must be colocated with. If there are any problems, you can try to fix those problems, then clean up the resources.
The pcs constraint show command shows how a resource is constrained by location (can only run on certain hosts), ordering (depends on other resources to be enabled before starting), or colocation (requires it be colocated with another resource). Here is truncated output from pcs constraint show on a controller node:
$ sudo pcs constraint show Location Constraints: Resource: my-ipmilan-for-controller-0 Disabled on: overcloud-controller-0 (score:-INFINITY) Resource: my-ipmilan-for-controller-1 Disabled on: overcloud-controller-1 (score:-INFINITY) Resource: my-ipmilan-for-controller-2 Disabled on: overcloud-controller-2 (score:-INFINITY) Ordering Constraints: start ip-172.16.0.10 then start haproxy-clone (kind:Optional) start ip-10.200.0.6 then start haproxy-clone (kind:Optional) start ip-172.19.0.10 then start haproxy-clone (kind:Optional) start ip-192.168.1.150 then start haproxy-clone (kind:Optional) start ip-172.16.0.11 then start haproxy-clone (kind:Optional) start ip-172.18.0.10 then start haproxy-clone (kind:Optional) start mongod-clone then start openstack-ceilometer-central-clone (kind:Mandatory) start openstack-glance-registry-clone then start openstack-glance-api-clone (kind:Mandatory) start openstack-heat-api-clone then start openstack-heat-api-cfn-clone (kind:Mandatory) start delay-clone then start openstack-ceilometer-alarm-evaluator-clone (kind:Mandatory) ... Colocation Constraints: ip-172.16.0.10 with haproxy-clone (score:INFINITY) ip-172.18.0.10 with haproxy-clone (score:INFINITY) ip-10.200.0.6 with haproxy-clone (score:INFINITY) ip-172.19.0.10 with haproxy-clone (score:INFINITY) ip-172.16.0.11 with haproxy-clone (score:INFINITY) ip-192.168.1.150 with haproxy-clone (score:INFINITY) openstack-glance-api-clone with openstack-glance-registry-clone (score:INFINITY) openstack-cinder-volume with openstack-cinder-scheduler-clone (score:INFINITY) neutron-dhcp-agent-clone with neutron-openvswitch-agent-clone (score:INFINITY) ...
This output displays three major sections:
- Location Constraints
- This section shows there are no particular constraints on where resources are assigned. However, the output shows that the ipmilan resource is disabled on each of the controllers. So that requires further investigation.
- Ordering Constraints
- Here, notice that the virtual IP address resources (IPaddr2) are set to start before HAProxy. There are also many mandatory Ordering Constraints, including starting mongod-clone before openstack-ceilometer-central-clone, and starting openstack-glance-registry-clone before openstack-glance-api-clone. Knowing these constraints can help understand the dependencies between services. In other words, you want to know what dependencies need to be in place for you to be able to fix a broken service or another resource.
- Colocation Constraints
- This section shows what resources need to be located together. For example, certain virtual IP addresses are tied to the haproxy-clone resource. In addition, the openstack-glance-api-clone resource needs to be on the same host as the openstack-glance-registry-clone resource.
7.1. Correcting Resource Problems on Controllers
Failed actions are listed by the pcs status command. There are lots of different kinds of problems that can occur. In general, you can approach problems in the following ways:
- Controller problem
If health checks to a controller are failing, log into the controller and check if services can start up without problems. Service startup problems could indicate a communication problem between controllers. Other indications of communication problems between controllers include:
- A controller gets fenced disproportionately more than other controllers, and/or
- A suspiciously large amount of services are failing from a specific controller.
- Individual resource problem
- If services from a controller are generally working, but an individual resource is failing, see if you can figure out the problem from the pcs status messages. If you need more information, log into the controller where the resource is failing and try some of the steps below.
To determine the problem with an individual failed resource, look at the Ordering Constraints illustrated in Chapter 7, Investigating and Fixing HA Controller Resources. Make sure all the resources the failed resource depends on are up and running. Then work your way up from the bottom, correcting them.
Given the name of the failed resource and the controller it’s running on, you can log into the controller to debug the problem. If the failed resource is a systemd service (such as openstack-ceilometer-api), you could use systemctl to check its status and journalctl to search through journal messages. For example:
$ sudo systemctl status openstack-ceilometer-api openstack-ceilometer-api.service - Cluster Controlled openstack-ceilometer-api Loaded: loaded (/usr/lib/systemd/system/openstack-ceilometer-api.service; disabled) Drop-In: /run/systemd/system/openstack-ceilometer-api.service.d └─50-pacemaker.conf Active: active (running) since Thu 2015-10-08 13:30:44 EDT; 1h 4min ago Main PID: 17865 (ceilometer-api) CGroup: /system.slice/openstack-ceilometer-api.service └─17865 /usr/bin/python /usr/bin/ceilometer-api --logfile /var/log/ceilometer/api.log Oct 08 13:30:44 overcloud-controller-2.localdomain systemd: Starting Cluster Controlled openstack-ceilo..... Oct 08 13:30:44 overcloud-controller-2.localdomain systemd: Started Cluster Controlled openstack-ceilom...i. Oct 08 13:30:49 overcloud-controller-2.localdomain ceilometer-api: /usr/lib64/python2.7/site-package.... $ sudo journalctl -u openstack-ceilometer-api -- Logs begin at Thu 2015-10-01 08:57:25 EDT, end at Thu 2015-10-08 14:40:18 EDT. -- Oct 01 11:22:41 overcloud-controller-2.localdomain systemd: Starting Cluster Controlled openstack... Oct 01 11:22:41 overcloud-controller-2.localdomain systemd: Started Cluster Controlled openstack-ceilometer-api... Oct 01 11:22:52 overcloud-controller-2.localdomain ceilometer-api: /usr/lib64/python2.7/...
After you have corrected the failed resource, you can run the pcs resource cleanup command to reset the status of the resource and its fail count. For example, after finding and fixing a problem with the httpd-clone resource, run:
$ sudo pcs resource cleanup httpd-clone Resource: httpd-clone successfully cleaned up