The Nova libvirt container freeze/stuck - host reboot required to correct.

Solution In Progress - Updated -

Issue

  • We just experienced two failures for two different hosts. In both cases, nova_libvirt was unresponsive and any "virsh" commands timeout. Any operation through the CLI would timeout as well.

1) The nova_libvirt container was still active.
2) We could not execute commands in the container: "sudo docker exec nova_libvirt virsh list --all" (this usually works)
3) We could not sh into the container: "sudo docker exec -it nova_libvirt virsh list bash"
4) We could not restart the container : "sudo docker restart nova_libvirt" - it hung there until I cancelled out.

  • The following logs are seen in docker:
[stack@overcloud-compute-035 ~]$ sudo docker logs nova_libvirt
2020-06-02 18:27:51.205+0000: 100337: error : virFileReadAll:1460 : Failed to open file '/sys/class/net/vethefdaeea/operstate': No such file or directory
2020-06-02 18:27:51.205+0000: 100337: error : virNetDevGetLinkInfo:2552 : unable to read: /sys/class/net/vethefdaeea/operstate: No such file or directory
2020-06-16 15:55:40.570+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 15:55:56.630+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:01:20.679+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:13:05.858+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:27:50.449+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:30:06.327+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:30:37.142+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:40:01.636+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:40:08.673+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
  • The following error is seen in /var/log/containers/nova/nova-compute.log:
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager [req-a649dd94-ad44-4ad4-ba86-063641a62066 - - - - -] Error updating resources for node overcloud-compute-040.localhost.: libvirtError: Cannot recv data: Connection reset by peer
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager Traceback (most recent call last):
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7386, in update_available_resource_for_node
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 673, in update_available_resource
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6412, in get_available_resource
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     data["vcpus_used"] = self._get_vcpu_used()
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5709, in _get_vcpu_used
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     for guest in self._host.list_guests():
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 566, in list_guests
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     only_running=only_running, only_guests=only_guests)]
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 586, in list_instance_domains
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     alldoms = self.get_connection().listAllDomains(flags)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     result = proxy_call(self._autowrap, f, *args, **kwargs)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     rv = execute(f, *args, **kwargs)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     six.reraise(c, e, tb)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     rv = meth(*args, **kwargs)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 5258, in listAllDomains
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     raise libvirtError("virConnectListAllDomains() failed", conn=self)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager libvirtError: Cannot recv data: Connection reset by peer
  • The following kernel trace is seen:
[Tue Jun 16 11:44:12 2020] INFO: task libvirtd:100289 blocked for more than 600 seconds.
[Tue Jun 16 11:44:12 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Jun 16 11:44:12 2020] libvirtd        D ffff8a87695cd140     0 100289 100202 0x00000080
[Tue Jun 16 11:44:12 2020] Call Trace:
[Tue Jun 16 11:44:12 2020]  [<ffffffffb3d68dc9>] schedule+0x29/0x70
[Tue Jun 16 11:44:12 2020]  [<ffffffffb3d668d1>] schedule_timeout+0x221/0x2d0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3661d13>] ? x2apic_send_IPI_mask+0x13/0x20
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36d68a0>] ? try_to_wake_up+0x190/0x390
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3d6917d>] wait_for_completion+0xfd/0x140
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36d6b60>] ? wake_up_state+0x20/0x20
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36ba63d>] flush_work+0xfd/0x190
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36b7430>] ? move_linked_works+0x90/0x90
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36ba759>] __cancel_work_timer+0x89/0x120
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36ba800>] cancel_work_sync+0x10/0x20
[Tue Jun 16 11:44:13 2020]  [<ffffffffc087b10a>] i40evf_remove+0x5a/0x360 [i40evf]
[Tue Jun 16 11:44:13 2020]  [<ffffffffb39c699e>] pci_device_remove+0x3e/0xc0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa8a42>] __device_release_driver+0x82/0xf0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa8ad3>] device_release_driver+0x23/0x30
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa735d>] driver_unbind+0xbd/0xe0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa6887>] drv_attr_store+0x27/0x40
[Tue Jun 16 11:44:13 2020]  [<ffffffffb38cbf72>] sysfs_kf_write+0x42/0x50
[Tue Jun 16 11:44:13 2020]  [<ffffffffb38cb54b>] kernfs_fop_write+0xeb/0x160
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3841890>] vfs_write+0xc0/0x1f0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb38426af>] SyS_write+0x7f/0xf0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3d7606b>] tracesys+0xa3/0xc9

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In