Deploy to add new compute node fail for qemu-img error

Solution In Progress - Updated -

Issue

  • We're running the openstack overcloud deploy command to add a new DPDK Compute node on a recently updated(from 13 to) OSP 16.1.

  • In the past weeks we have added other DPDK Compute nodes and the following problem doesn't occurred.

  • The openstack overcloud deploy fails with the following error:

2021-06-29 15:43:09Z [NodeDPDKv2]: UPDATE_FAILED  resources.NodeDPDKv2: Resource CREATE failed: ResourceInError: resources[3].resources.NodeDPDKv2: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 543a571e-989
2021-06-29 15:43:09Z [overcloud]: UPDATE_FAILED  Resource UPDATE failed: resources.NodeDPDKv2: Resource CREATE failed: ResourceInError: resources[3].resources.NodeDPDKv2: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures f

 Stack overcloud/b285c925-2835-4900-a707-bfed0f75a6c7 UPDATE_FAILED 

overcloud.NodeDPDKv2.3.NodeDPDKv2:
  resource_type: OS::TripleO::NodeDPDKv2Server
  physical_resource_id: 248e9cef-e3d7-4f1c-be31-ed7d80ee5932
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.NodeDPDKv2: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932., Code: 500"
  • At the same time I observe that qemu-img exits with errors:
 [root@undercloud ~]# grep -i qemu-img /var/log/containers/nova/nova-conductor.log|tail -n1
2021-06-29 17:43:07.853 25 ERROR nova.scheduler.utils [req-29a93de0-f3e1-445b-8933-a87915242871 31183e37bac14be6b2d3016eebde8d33 53c994dbce6f44e9a6b142ce4e871f62 - default default] [instance: 248e9cef-e3d7-4f1c-be31-ed7d80ee5932] Error from last host: undercloud (node a95149fd-9d4b-4713-8904-c50329dc46df): ['Traceback (most recent call last):\n', '  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2435, in _build_and_run_instance\n    block_device_info=block_device_info)\n', '  File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1289, in spawn\n    \'node\': node_uuid})\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n', '  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n    raise value\n', '  File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1281, in spawn\n    timer.start(interval=CONF.ironic.api_retry_interval).wait()\n', '  File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait\n    result = hub.switch()\n', '  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch\n    return self.greenlet.switch()\n', '  File "/usr/lib/python3.6/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop\n    result = func(*self.args, **self.kw)\n', '  File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 558, in _wait_for_active\n    raise exception.InstanceDeployFailure(msg)\n', "nova.exception.InstanceDeployFailure: Failed to provision instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932: Failed to deploy. Exception: Unexpected error while running command.\nCommand: /usr/bin/python3 -m oslo_concurrency.prlimit --as=1073741824 -- qemu-img convert -O raw /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.part /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.converted\nExit code: -6\nStdout: ''\nStderr: 'qemu: qemu_thread_create: Resource temporarily unavailable\\n'\n", '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', '  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2161, in _do_build_and_run_instance\n    filter_properties, request_spec)\n', '  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2535, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\n', "nova.exception.RescheduledException: Build of instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932 was re-scheduled: Failed to provision instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932: Failed to deploy. Exception: Unexpected error while running command.\nCommand: /usr/bin/python3 -m oslo_concurrency.prlimit --as=1073741824 -- qemu-img convert -O raw /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.part /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.converted\nExit code: -6\nStdout: ''\nStderr: 'qemu: qemu_thread_create: Resource temporarily unavailable\\n'\n"]
  • We also observe in /var/log/messages:
 [root@undercloud ~]# grep qemu-img /var/log/messages|tail -n1
Jun 29 17:42:54 undercloud systemd-coredump[58577]: Process 58554 (qemu-img) of user 42422 dumped core.#012#012Stack trace of thread .. #(output truncated because too long)
  • The director hasn't memory pressure during the openstack overcloud deploy (in the attached file free.txt I logged the output of 'free -h' during the deploy).

  • The problem seems similar to RHBZ #1892773 but I can find the string "Stderr: 'Failed to allocate memory: Cannot allocate memory\n'" in the logs.

Environment

  • Red Hat OpenStack Platform 16.1 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content