Deploy to add new compute node fail for qemu-img error
Issue
-
We're running the openstack overcloud deploy command to add a new DPDK Compute node on a recently updated(from 13 to) OSP 16.1.
-
In the past weeks we have added other DPDK Compute nodes and the following problem doesn't occurred.
-
The openstack overcloud deploy fails with the following error:
2021-06-29 15:43:09Z [NodeDPDKv2]: UPDATE_FAILED resources.NodeDPDKv2: Resource CREATE failed: ResourceInError: resources[3].resources.NodeDPDKv2: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 543a571e-989
2021-06-29 15:43:09Z [overcloud]: UPDATE_FAILED Resource UPDATE failed: resources.NodeDPDKv2: Resource CREATE failed: ResourceInError: resources[3].resources.NodeDPDKv2: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures f
Stack overcloud/b285c925-2835-4900-a707-bfed0f75a6c7 UPDATE_FAILED
overcloud.NodeDPDKv2.3.NodeDPDKv2:
resource_type: OS::TripleO::NodeDPDKv2Server
physical_resource_id: 248e9cef-e3d7-4f1c-be31-ed7d80ee5932
status: CREATE_FAILED
status_reason: |
ResourceInError: resources.NodeDPDKv2: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932., Code: 500"
- At the same time I observe that qemu-img exits with errors:
[root@undercloud ~]# grep -i qemu-img /var/log/containers/nova/nova-conductor.log|tail -n1
2021-06-29 17:43:07.853 25 ERROR nova.scheduler.utils [req-29a93de0-f3e1-445b-8933-a87915242871 31183e37bac14be6b2d3016eebde8d33 53c994dbce6f44e9a6b142ce4e871f62 - default default] [instance: 248e9cef-e3d7-4f1c-be31-ed7d80ee5932] Error from last host: undercloud (node a95149fd-9d4b-4713-8904-c50329dc46df): ['Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2435, in _build_and_run_instance\n block_device_info=block_device_info)\n', ' File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1289, in spawn\n \'node\': node_uuid})\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1281, in spawn\n timer.start(interval=CONF.ironic.api_retry_interval).wait()\n', ' File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait\n result = hub.switch()\n', ' File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch\n return self.greenlet.switch()\n', ' File "/usr/lib/python3.6/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop\n result = func(*self.args, **self.kw)\n', ' File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 558, in _wait_for_active\n raise exception.InstanceDeployFailure(msg)\n', "nova.exception.InstanceDeployFailure: Failed to provision instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932: Failed to deploy. Exception: Unexpected error while running command.\nCommand: /usr/bin/python3 -m oslo_concurrency.prlimit --as=1073741824 -- qemu-img convert -O raw /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.part /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.converted\nExit code: -6\nStdout: ''\nStderr: 'qemu: qemu_thread_create: Resource temporarily unavailable\\n'\n", '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2161, in _do_build_and_run_instance\n filter_properties, request_spec)\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2535, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', "nova.exception.RescheduledException: Build of instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932 was re-scheduled: Failed to provision instance 248e9cef-e3d7-4f1c-be31-ed7d80ee5932: Failed to deploy. Exception: Unexpected error while running command.\nCommand: /usr/bin/python3 -m oslo_concurrency.prlimit --as=1073741824 -- qemu-img convert -O raw /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.part /var/lib/ironic/master_images/tmp5s2e5r6l/c01f7103-c6b0-4f04-969d-2dbc1b6c3295.converted\nExit code: -6\nStdout: ''\nStderr: 'qemu: qemu_thread_create: Resource temporarily unavailable\\n'\n"]
- We also observe in
/var/log/messages
:
[root@undercloud ~]# grep qemu-img /var/log/messages|tail -n1
Jun 29 17:42:54 undercloud systemd-coredump[58577]: Process 58554 (qemu-img) of user 42422 dumped core.#012#012Stack trace of thread .. #(output truncated because too long)
-
The director hasn't memory pressure during the openstack overcloud deploy (in the attached file free.txt I logged the output of 'free -h' during the deploy).
-
The problem seems similar to RHBZ #1892773 but I can find the string "Stderr: 'Failed to allocate memory: Cannot allocate memory\n'" in the logs.
Environment
- Red Hat OpenStack Platform 16.1 (RHOSP)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.