Deploying overcloud fails with "No valid host was found. There are not enough hosts available., Code:500"

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7.2
  • Red Hat Enterprise Linux Openstack 7

Issue

In another scenario :
  • Openstack overcloud node delete command is running for a long time and errors out.
  • One of the existing nodes was brought down by poweroff command as root.
  • UCS profile was updated and RAID1 was configured. Re-applying this profile wiped the OS and the blade was powered down. After this, the openstack node delete command was ran, and it timed out with errors.
  • The compute nodes were deleted and were not being re-added due to the same error.
  • Unable to re-add the nodes to Openstack compute cluster No valid host was found. There are not enough hosts available., Code:500
  • Overcloud flavor
  • control flavor too big for overcloud nodes
  • control or compute flavors are bigger than the overcloud node specs

Resolution

  • Check the exact hardware specs of the overcloud nodes and create the flavors accordingly.
  • If the values in the flavors are MORE than the actual specs on the overcloud nodes, this issue might occur.
  • If the above two points are not applicable then please check the ironic-conductor logs /var/log/ironic/ironic-conductor.log to verify whether following sort of message is reported in log file. If yes, then you may need to expand the /var filesystem, if /var/lib/ironic/ is not a separate filesystem to get rid of the "No valid host found" message.
2016-06-01 15:39:01.214 1392 WARNING ironic.conductor.manager [-] Error in deploy of node 48c71e36-3dec-4820-9404-284c6f9a600b: Disk volume where '/var/lib/ironic/master_images/tmpgojxFy' is located doesn't have enough disk space. Required 4207 MiB, only 1179 MiB available space present.

Root Cause

  • In general, this indicates a typo in one or more templates.
  • In this specific setup, we saw that : One of the nodes did not match the flavor created.
  • This was because the node did not have similar raid configuration done.
  • Hence the primary disk was seen as 1GB instead of 277GB .
  • This issue is also reported when sufficient space is not available in /var/lib/ironic/ which most of the time is under /var filesystem to copy the overcloud image. Ensure that /var filesystem is having at-least 5 GB of free space.
In another scenario :
  • The node which was having issue was having LUNs assigned from cinder volumes which were not removed. Before deletion of node, instances on it needs to be migrated(pre-requisite). Changed the UCS profile which generated a new WWN-id for the compute and was able to deploy the affected node and all compute nodes were up and running

Diagnostic Steps

  • Check the failed resources :

$ heat resource-list -n 5 overcloud | grep -v COMPLETE +-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | parent_resource | +-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------+ | Compute | 885e9fb7-2f17-4094-be78-041d87a0a443 | OS::Heat::ResourceGroup | CREATE_FAILED | 2015-12-24T07:11:12Z | | | Controller | 601add83-39a4-4674-8245-648e48ef6a96 | OS::Heat::ResourceGroup | CREATE_FAILED | 2015-12-24T07:11:12Z | | | 0 | 2b380d37-77a0-4f82-adf5-2e5978eadc26 | OS::TripleO::Controller | CREATE_FAILED | 2015-12-24T07:11:23Z | Controller | | 0 | 8386e2bf-6d00-4884-91b4-6dbc59e9db87 | OS::TripleO::Compute | CREATE_FAILED | 2015-12-24T07:11:23Z | Compute | | Controller | 828bd6f2-a275-4f45-a9db-fff48c779df5 | OS::Nova::Server | CREATE_FAILED | 2015-12-24T07:11:25Z | 0 | | NovaCompute | 40507abf-1751-4bfa-8de1-85e81fe008f1 | OS::Nova::Server | CREATE_FAILED | 2015-12-24T07:11:26Z | 0 | +-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------+
  • Check failed resource details if it provides any further hints:
$ heat resource-show overcloud <resource>
...(output-truncated)...
resource_status_reason | ResourceInError: resources.Controller.resources[1].resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 
...
  • Check what flavors are present and their details:
$ nova flavor-list

$ nova flavor-show <flavor-id>
  • Match from the nova-compute.log which nova overcloud instance to its ironic id equivalent:

[In nova-compute.log-20160109.gz] -------------------------------- 2016-01-08 21:31:10.204 20474 DEBUG nova.virt.ironic.driver [-] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Still waiting for ironic node 6b665b11-16ae-4c09-960b-191e06c0e801 to become ACTIVE: power_state="power off", target_power_state="power on", provision_state="deploying", target_provision_state="active" _log_ironic_polling /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:161 ... 2016-01-08 21:31:16.199 20474 DEBUG nova.virt.ironic.driver [-] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Still waiting for ironic node 6b665b11-16ae-4c09-960b-191e06c0e801 to become ACTIVE: power_state="power on", target_power_state=None, provision_state="wait call-back", target_provision_state="active" _log_ironic_polling /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:161 ... 2016-01-08 22:04:58.653 20474 ERROR nova.compute.manager [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c 1534 9e24833435f04f338e7b14df1e051e91 - - -] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Instance failed to spawn 2016-01-08 22:04:58.653 20474 TRACE nova.compute.manager [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Traceback (most recen t call last): ----------------->8[trace-snipped]8<------------------- 2016-01-08 22:04:58.653 20474 TRACE nova.compute.manager [instance: a300cc91-99c1-4981-b0e9-408636aa1609] InstanceDeployFailure: Failed to provision instance a300cc91-99c1-4981-b0e9-408636aa1609: Timeout reached while waiting for callback for node 6b665b11-16ae-4c09-960b-191e06c0e801 2016-01-08 22:04:58.653 20474 TRACE nova.compute.manager [instance: a300cc91-99c1-4981-b0e9-408636aa1609] 2016-01-08 22:04:58.654 20474 INFO nova.compute.manager [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Terminating instance 2016-01-08 22:04:58.663 20474 WARNING nova.virt.ironic.driver [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] Destroy called on non-existing instance a300cc91-99c1-4981-b0e9-408636aa1609. 2016-01-08 22:04:58.664 20474 DEBUG nova.compute.claims [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Aborting claim: [Claim: 131072 MB memory, 277 GB disk] abort /usr/lib/python2.7/site-packages/nova/compute/claims.py:130 2016-01-08 22:04:58.664 20474 DEBUG oslo_concurrency.lockutils [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] Lock "compute_resources" acquired by "abort_instance_claim" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444 2016-01-08 22:04:58.678 20474 INFO nova.scheduler.client.report [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] Compute_service record updated for ('osp7.sdiad.com', u'6b665b11-16ae-4c09-960b-191e06c0e801') 2016-01-08 22:04:58.678 20474 DEBUG oslo_concurrency.lockutils [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] Lock "compute_resources" released by "abort_instance_claim" :: held 0.014s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456 2016-01-08 22:04:58.678 20474 DEBUG nova.compute.utils [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Failed to provision instance a300cc91-99c1-4981-b0e9-408636aa1609: Timeout reached while waiting for callback for node 6b665b11-16ae-4c09-960b-191e06c0e801 notify_about_instance_usage /usr/lib/python2.7/site-packages/nova/compute/utils.py:310 2016-01-08 22:04:58.679 20474 DEBUG nova.compute.manager [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Build of instance a300cc91-99c1-4981-b0e9-408636aa1609 was re-scheduled: Failed to provision instance a300cc91-99c1-4981-b0e9-408636aa1609: Timeout reached while waiting for callback for node 6b665b11-16ae-4c09-960b-191e06c0e801 _do_build_and_run_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:2275 2016-01-08 22:04:58.679 20474 DEBUG nova.compute.manager [req-fb391c00-4f60-4ce2-872c-520e075fd9cb 2fe0c1376ce14bae97aaf496d63c1534 9e24833435f04f338e7b14df1e051e91 - - -] [instance: a300cc91-99c1-4981-b0e9-408636aa1609] Deallocating network for instance _deallocate_network /usr/lib/python2.7/site-packages/nova/compute/manager.py:2124
  • Hence can figure out that :
nova list id of the node in ERROR state       : a300cc91-99c1-4981-b0e9-408636aa1609 
corresponding ironic node-list id of the node : 6b665b11-16ae-4c09-960b-191e06c0e801

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments