OpenStack baremetal introspection bulk start fails - Red hat Openstack Director

Latest response

I am trying to install director and overcloud.
Followed steps from : https://keithtenzer.com/2015/10/14/howto-openstack-deployment-using-tripleo-and-the-red-hat-openstack-director/comment-page-1/#comment-1335
In this environment we have used KVM hypervisor host , the undercloud (single VM) and overcloud (1 compute VM , 1 controller VM). . The KVM hypervisor host is on the 192.168.122.0/24 network and has IP of 192.168.122.136. The undercloud runs on a single VM on the 192.168.122.0/24 management network and 192.168.126.0/24 (provisioning) netowrk. The undercloud has an IP address of 192.168.122.90 (eth0). The overcloud is on the 192.168.126.0/24 (provisioning) and 192.168.125.0/24 (external) network.

I have facing issue while introspection (command: openstack baremetal introspection bulk start). The overcloud VMs are getting started but unable to configure network interface. Eventually introspection fails. Can anyone tell me what might have gone wrong?

Responses

What kind of output are you getting from the VMs when they boot?

configuring (net0 52:54:00:30:d2:a6) ...........error 0x040ee119 no more network devices no bootable devices

Thanks, Dinesh! So it looks like it's not resolving DHCP/booting from PXE. You mentioned that the Undercloud has an IP on the management network, but does it have an IP on the Provisioning network?

Also can you post the IP config settings from the undercloud.conf? Specifically the following:

  • local_ip
  • network_gateway
  • network_cidr
  • masquerade_network
  • dhcp_start
  • dhcp_end
  • inspection_iprange

Hi Daniel, Yes the undercloud VM has an IP (192.0.2.1) on provisioning network. I have changed the settings to default values set in undercloud now. Below are the settings in undercloud.conf: 1. local_ip = 192.0.2.1/24 2. network_gateway = 192.0.2.5 3. undercloud_public_vip = 192.0.2.2 4. undercloud_admin_vip = 192.0.2.3 5. local_interface = ens9 6. network_cidr = 192.0.2.0/24 7. masquerade_network = 192.0.2.0/24 8. dhcp_start = 192.0.2.5 9. dhcp_end = 192.0.2.24 10. inspection_interface = br-ctlplane 11. inspection_iprange = 192.0.2.100,192.0.2.120

It looks like you need to change the default values if you're using 192.168.126.0/24 as your provisioning network.

Hi Daniel, Now the I am using 192.0.2.0/24 as provisioning network. New environment: The KVM hypervisor host is on the 172.16.73.0/24 network and has IP of 172.16.73.136 . The undercloud runs on a single VM on the 172.16.73.0/24 management network and 192.0.2.0/24 (provisioning) netowrk. The undercloud has an IP address of 172.16.73.146 (eth0). The overcloud is on the 192.0.2.0/24 (provisioning) and 192.0.3.0/24(external) network.

Thanks!

Cool. Let me know how the introspection works out with the new environment and if I can help further.

The issue is still the same.

Can you run the following command on your KVM host and post the results?

# virsh net-dumpxml [name-of-provisioning-network]

(Replace [name-of-provisioning-network] with the provisioning network name obviously)

(network connections='1') (name)provisioning(/name) (uuid)b2f42abd-86cb-4088-b3b0-f283c3b03769(/uuid) (bridge name='virbr2' stp='on' delay='0'/) (mac address='52:54:00:cb:5a:f5'/) (ip address='192.0.2.254' netmask='255.255.255.0') (/ip) (/network)

Unable to post in xml format so replaced '<>' with '()'

Anything else? There should be an xml description of the network.

EDIT: Ah, never mind. I see what happened.

Hi Daniel, The overcloud VMs are getting the IP now. I configured the second nic to the provisioning network on host. below is the current output:

net0: 192.0.2.101/255.255.255.0 gw 192.0.2.1 net0: fe80::5054:ff:fe3a:e797/64 net1: fe80::5054:ff:fef1:8d87/64 (inaccessible) next server: 192.0.2.1 Filename: http://192.0.2.1:8088/inspector.ipxe http://192.0.2.1:8088/inspector.ipxe....ok could not boot image: Exec format error (http://ipxe.org/2e008001) No more network devices No bootable devices

Okay, that's a step in the right direction. How does your /httpboot/inspector.ipxe look? Can you post that here too?

/httpboot/inspector.ipxe is empty and /httpboot/boot.ipxe as well.

Okay, the inspector.ipxe file should have an ipxe configuration. This file gets created during the "openstack undercloud install" phase.

My advice is to rerun "openstack undercloud install" on the director VM to repopulate the content of this file.

I have rerun openstack undercloud install. Content in inspector.ipxe:

!ipxe

:retry_dhcp dhcp || goto retry_dhcp

:retry_boot imgfree kernel --timeout 60000 http://192.0.2.1:8088/agent.kernel ipa-inspection-callback-url=http://192.0.2.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${net0/mac} ipa-debug=1 initrd=agent.ramdisk || goto retry_boot initrd --timeout 60000 http://192.0.2.1:8088/agent.ramdisk || goto retry_boot boot

Still introspection is failing. Checked the VM console while bootup. It got inspector.ipxe, kernel and ramdisk. but fails to execute /init and execution hangs after that.

So on the VM console, it doesn't even get to the login prompt?

No it doesn't get to login prompt.

Sounds like it's having trouble loading agent.ramdisk. What version of the introspection images are you using?

$ yum list rhosp-director-images-ipa

And did you load the images into glance without error as per: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/director_installation_and_usage/chap-installing_the_undercloud#sect-Obtaining_Images_for_Overcloud_Nodes ?

#yum list rhosp-director-images-ipa
Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager Repo rhel-7-server-extras-rpms forced skip_if_unavailable=True due to: /etc/pki/entitlement/3007115437606610065-key.pem
Repo rhel-7-server-openstack-8-director-rpms forced skip_if_unavailable=True due to: /etc/pki/entitlement/3007115437606610065-key.pem
Repo rhel-7-server-rh-common-rpms forced skip_if_unavailable=True due to: /etc/pki/entitlement/3007115437606610065-key.pem
Repo rhel-7-server-rpms forced skip_if_unavailable=True due to: /etc/pki/entitlement/3007115437606610065-key.pem
Repo rhel-7-server-openstack-8-rpms forced skip_if_unavailable=True due to: /etc/pki/entitlement/3007115437606610065-key.pem
Installed Packages
rhosp-director-images-ipa.noarch 8.0-20161202.1.el7ost @rhel-7-server-openstack-8-director-rpms

I loaded the images into glance successfully. #glance image-list +--------------------------------------+------------------------+-------------+------------------+------------+--------+ | ID | Name | Disk Format | Container Format | Size | Status | +--------------------------------------+------------------------+-------------+------------------+------------+--------+ | 6ed0d52e-1793-46bf-ae83-be1ee8f8f2db | bm-deploy-kernel | aki | aki | 5390944 | active | | 4f3108ba-c065-49c2-abe7-77527a21be19 | bm-deploy-ramdisk | ari | ari | 449025575 | active | | 131ee5f5-8e11-429b-a62c-8d35154bcb59 | cirros | qcow2 | bare | 13287936 | active | | 2a3db6e5-da35-4388-897c-75444a636173 | overcloud-full | qcow2 | bare | 1122824192 | active | | e8ff0ff9-c29f-4e27-b39e-40b2f0b7f68a | overcloud-full-initrd | ari | ari | 44509689 | active | | 899a5a42-dbc1-446a-89d3-35dca1d0708e | overcloud-full-vmlinuz | aki | aki | 5390944 | active | +--------------------------------------+------------------------+-------------+------------------+------------+--------+

And both /httpboot/agent.kernel and /httpboot/agent.ramdisk exist? And they have normal sizes?

$ ls /httpboot -l

$ ls /httpboot -l
total 443784
-rwxrwxrwx. 1 ironic ironic 5390944 Feb 9 06:03 agent.kernel
-rwxrwxrwx. 1 ironic ironic 449025575 Feb 9 06:04 agent.ramdisk
-rw-r--r-- 1 ironic ironic 770 Feb 21 02:12 boot.ipxe
-rw-r--r-- 1 ironic ironic 433 Feb 21 02:12 inspector.ipxe
drwxr-xr-x 2 ironic ironic 6 Feb 17 07:12 pxelinux.cfg

That all looks normal.

I'm a little stumped as to why it's not loading the inspection agent properly.

How do we find out whether agent.ramdisk is correctly loaded? Because I see it getting loaded till 100% and then it start unpacking initramfs.

Sorry, what I meant was it sounds like it's loading the agent image fine, but it's not kicking off the actual inspection process, which is the weird part.

Yes.... Any advice to how to proceed further?

$ironic node-validate 87a0d844-69dd-47a5-9783-cf4e5bbb06ca
+------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------+
| Interface | Result | Reason | +------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------+
| boot | False | Cannot validate PXE bootloader. Some parameters were missing in node's instance_info.. Missing are: ['ramdisk', 'kernel', 'image_source'] |
| console | None | not supported |
| deploy | False | Cannot validate PXE bootloader. Some parameters were missing in node's instance_info.. Missing are: ['ramdisk', 'kernel', 'image_source'] |
| inspect | True | | | management | True | | | power | True | | | raid | None | not supported | +------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------+

Does this has to do something with the issue?

Though image sources are available in driver_info.

$ironic node-show 87a0d844-69dd-47a5-9783-cf4e5bbb06ca
+------------------------+-----------------------------------------------------------------------+
| Property | Value | +------------------------+-----------------------------------------------------------------------+
| target_power_state | None |
| extra | {} |
| last_error | None |
| updated_at | 2017-02-21T08:49:15+00:00 |
| maintenance_reason | None |
| provision_state | available |
| clean_step | {} |
| uuid | 87a0d844-69dd-47a5-9783-cf4e5bbb06ca |
| console_enabled | False |
| target_provision_state | None |
| provision_updated_at | None |
| maintenance | False |
| inspection_started_at | None |
| inspection_finished_at | None |
| power_state | power off |
| driver | pxe_ssh |
| reservation | None |
| properties | {u'memory_mb': u'2048', u'cpu_arch': u'x86_64', u'local_gb': u'60', |
| | u'cpus': u'2', u'capabilities': u'boot_option:local'} |
| instance_uuid | None |
| name | None |
| driver_info | {u'ssh_username': u'ubuntu', u'deploy_kernel': u'6ed0d52e-1793-46bf- |
| | ae83-be1ee8f8f2db', u'deploy_ramdisk': |
| | u'4f3108ba-c065-49c2-abe7-77527a21be19', u'ssh_key_contents': u'----- |
| | BEGIN RSA PRIVATE KEY-----
| | -----END RSA PRIVATE KEY-----', u'ssh_virt_type': |
| | u'virsh', u'ssh_address': u'172.16.73.136'} |
| created_at | 2017-02-21T08:48:57+00:00 |
| driver_internal_info | {} |
| chassis_uuid | |
| instance_info | {} |
+------------------------+-----------------------------------------------------------------------+

Hi Daniel Introspection issue is solved. RAM was not sufficient. I had to increase it for the overcloud VMs. But while deploying overcloud again I encountered that dhcp issue which I was getting before. Question is if the introspection was successful which was getting IP from dhcp then during overcloud deployment also it should be working? Is there anything else I am missing?

Glad to hear the introspection worked!

As for the overcloud deployment, it should still use the provisioning network for DHCP, but the mechanism changes slightly between introspection vs provisioning. Here's why:

The introspection process uses ironic-inspector, which sets a dynamic DHCP range using dnsmasq.

The provisioning process uses neutron, which also sets a DHCP range using dnsmasq. However, neutron/dnsmasq maps each node's MAC address to an IP address. This way each node has a permanent IP assignment.

This is also the reason why you have to specify two different DHCP ranges:

  • dhcp_start, dhcp_end - sets the provisioning network DHCP range in the director's neutron
  • inspection_iprange - sets the DHCP range in ironic-inspector's dnsmasq.conf file

So first thing to do is check that you had a valid dhcp range for dhcp_start and dhcp_end in your undercloud.conf file. Also check that these settings carried over to neutron during the "openstack undercloud install" phase. The following command should show the allocation pool for all your existing subnets:

$ neutron subnet-list

Also, when running a deployment, check if dnsmasq is running:

$ ps -aux | grep dnsmasq

There might be two dnsmasq commands: one that uses /etc/ironic-inspector/dnsmasq.conf for config (ignore this one, it's the ironic-inspection one), and a larger command that uses configs from /var/lib/neutron/dhcp/. Make sure that second one is running during your provisioning process. if not, you might have to check if there a problem with neutron on the undercloud.

Also check that /httpboot/boot.ipxe has a pxe config in it. I remember you had problems with inspector.pxe being empty before, so might be a good idea to check this one as well.

Thanks Daniel !
The issue was because boot.ipxe was empty.
Running undercloud installation again fixed it.
Now the overcloud nodes are booting up and am getting the login prompt too but after sometime it fails giving error "[Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

You might need to check how your nodes are tagged:

$ openstack overcloud profiles list

Is one tagged as control and one tagged as compute?

I tagged one node as control and other as compute as you advised and tried again.
Controller is deployed successfully after it retried 2 times and compute failed giving the same error "No valid host was found. There are not enough hosts available., Code: 500".

What are the current specs for the nodes? RAM, CPU, and disk?

I ask because the only other thing I think it could be is that the nodes need to have specs that align or exceed the specs defined for each flavor that corresponds with the tags. Each flavor uses a default 4096MB RAM, 40GB disk and 1 CPU. If the specs for the VM are less than that, the director ignores the node even if tagged.

So if the specs are lower than the flavors, you might have to:

  • Bump up the specs for each node
  • Reduce the specs for the compute and control flavors (not recommended as it can lead to performance issues)

Both VMs are having RAM 4096, Disk 60, CPUs 2. Do I have to increase the RAM?

That should be enough. What does ironic list as the specs for each of the nodes:

$ ironic node-list $ ironic node-show [UUID of node]

Copy and paste the value in the properties field for both nodes.

$ ironic node-show eb162824-9d40-41e7-a26f-42bfb22be132
+------------------------+--------------------------------------------------------------------------+ | Property | Value | +------------------------+--------------------------------------------------------------------------+ | target_power_state | None | | extra | {u'hardware_swift_object': u'extra_hardware-eb162824-9d40-41e7-a26f- | | | 42bfb22be132'} | | last_error | None | | updated_at | 2017-02-24T05:28:15+00:00 | | maintenance_reason | None | | provision_state | wait call-back | | clean_step | {} | | uuid | eb162824-9d40-41e7-a26f-42bfb22be132 | | console_enabled | False | | target_provision_state | active | | provision_updated_at | 2017-02-24T05:28:14+00:00 | | maintenance | False | | inspection_started_at | None | | inspection_finished_at | None | | power_state | power on | | driver | pxe_ssh | | reservation | None | | properties | {u'memory_mb': u'4096', u'cpu_arch': u'x86_64', u'local_gb': u'59', | | | u'cpus': u'2', u'capabilities': u'profile:control,boot_option:local'} | | instance_uuid | 879216da-b585-4393-b4ea-a498764eb015 | | name | None | | driver_info | {u'ssh_username': u'ubuntu', u'deploy_kernel': u'ee6b4534-19e3-497a- | | | 80af-36d490974edc', u'deploy_ramdisk': u'5a2f3f48-cc8e- | | | 46a6-b7b9-554a2893e435', u'ssh_key_contents': u'-----BEGIN RSA PRIVATE K END RSA PRIVATE KEY-----', u'ssh_virt_type': u'virsh', u'ssh_address': | | | u'172.16.73.136'} | | created_at | 2017-02-23T12:19:36+00:00 | | driver_internal_info | {u'agent_url': u'http://192.0.2.12:9999', u'root_uuid_or_disk_id': u | | | '41655cab-3d53-44fc-962c-3a887594fff5', u'is_whole_disk_image': False, | | | u'agent_last_heartbeat': 1487913810} | | chassis_uuid | | | instance_info | {u'ramdisk': u'3cbd4870-9300-4eb2-8ebd-0f2edfc70fbd', u'kernel': u | | | '1d35562d-23eb-4214-9679-26f18f512960', u'root_gb': u'58', | | | u'display_name': u'overcloud-compute-0', u'image_source': u'a6c3af7f- | | | e27e-4125-ac89-aa064d4b6c88', u'local_gb': u'59', u'capabilities': | | | u'{"boot_option": "local"}', u'memory_mb': u'4096', u'vcpus': u'2', | | | u'deploy_key': u'U41V6ZIR77CVQBL6CH6Q620GTFVDNRZL', u'configdrive': u'H4 | | | sICFXErAAAAD4Bvj/pK | | | k0OwBABwA=', u'swap_mb': u'0'} | +------------------------+--------------------------------------------------------------------------+

$ ironic node-show 37b2ce1f-c467-4033-8b9a-91beac7462ca
+------------------------+--------------------------------------------------------------------------+ | Property | Value | +------------------------+--------------------------------------------------------------------------+ | target_power_state | None | | extra | {u'hardware_swift_object': u'extra_hardware-37b2ce1f-c467-4033-8b9a- | | | 91beac7462ca'} | | last_error | None | | updated_at | 2017-02-24T05:28:15+00:00 | | maintenance_reason | None | | provision_state | wait call-back | | clean_step | {} | | uuid | 37b2ce1f-c467-4033-8b9a-91beac7462ca | | console_enabled | False | | target_provision_state | active | | provision_updated_at | 2017-02-24T05:28:14+00:00 | | maintenance | False | | inspection_started_at | None | | inspection_finished_at | None | | power_state | power on | | driver | pxe_ssh | | reservation | None | | properties | {u'memory_mb': u'4096', u'cpu_arch': u'x86_64', u'local_gb': u'59', | | | u'cpus': u'2', u'capabilities': u'profile:compute,boot_option:local'} | | instance_uuid | f4864c71-baa6-4744-b4bb-8430fcb5f0fd | | name | None | | driver_info | {u'ssh_username': u'ubuntu', u'deploy_kernel': u'ee6b4534-19e3-497a- | | | 80af-36d490974edc', u'deploy_ramdisk': u'5a2f3f48-cc8e- | | | 46a6-b7b9-554a2893e435', u'ssh_key_contents': u'-----BEGIN RSA PRIVATE K | | | EY----- END RSA PRIVATE KEY-----', u'ssh_virt_type': u'virsh', u'ssh_address': | | | u'172.16.73.136'} | | created_at | 2017-02-23T12:19:37+00:00 | | driver_internal_info | {u'agent_url': u'http://192.0.2.11:9999', u'is_whole_disk_image': False, | | | u'agent_last_heartbeat': 1487913816} | | chassis_uuid | | | instance_info | {u'ramdisk': u'3cbd4870-9300-4eb2-8ebd-0f2edfc70fbd', u'kernel': u | | | '1d35562d-23eb-4214-9679-26f18f512960', u'root_gb': u'58', | | | u'display_name': u'overcloud-controller-0', u'image_source': u'a6c3af7f- | | | e27e-4125-ac89-aa064d4b6c88', u'local_gb': u'59', u'capabilities': | | | u'{"boot_option": "local"}', u'memory_mb': u'4096', u'vcpus': u'2', | | | u'deploy_key': u'6WZWRXUOMXWN6VZJS5Z8775YPNYD12CJ', u'configdrive': u'H4 | | | sICFXEr1AAAAAAAMA3wP8H2E49mA | | | BABwA=', u'swap_mb': u'0'} | +------------------------+--------------------------------------------------------------------------+

Okay, they both look fine to me.

When doing a deploy, do both the nodes at least boot up and start the pxe boot?

Sometimes on lower specs, there can be a race condition where a node starts before the pxe content is ready on the director.

Yes during deploy boot nodes boot up. Even i suspect that node starts before pxe content is ready Sometimes start pxe boot and sometimes fail to start pxe boot. Sometimes 1 node starts pxe boot and 1 node fails. It is so random.

Yep, sounds like the race condition.

So if a node fails to pxe boot, try manually rebooting it and see if it picks up the pxe boot after the reboot.

Okay Daniel I will try to reboot manually in such conditions. One thing I missed to specify in the environment description, Director is running on VM having 8GB RAM. This can be the reason for such conditions?

Yep, memory and CPU usage are main factors here. The docs list 8 cores and 16GB RAM minimum for production environments, and for good reason.

I used to test on a very low spec machine (8GB, low spec CPU) and would get race conditions for this particular issue. I'm now testing POCs with 16GB and a decent 4-core CPU and haven't experienced any race conditions. Plus I think they refined the code since OSPd 8 to avoid these types of race conditions.

How did the reboot go? Are both nodes now PXE booting and provisioning?

Thanks Daniel !
Yes the reboot worked. PXE boot started. provisioning state of nodes changed to deploying state. i will try to get 16GB RAM for director. Right now I have limited resources.

Awesome! Glad to hear it's working for you!

Okay, let's have a look at the DHCP config. Can you run the following on the undercloud host:

# grep dhcp-range /etc/ironic-inspector/dnsmasq.conf

Post back the introspection range. Hopefully, it should be the default (192.0.2.100,192.0.2.120) as per the undercloud.conf.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.