The Nova libvirt container freeze/stuck - host reboot required to correct.

Solution In Progress - Updated -

Issue

  • We just experienced two failures for two different hosts. In both cases, nova_libvirt was unresponsive and any "virsh" commands timeout. Any operation through the CLI would timeout as well.

1) The nova_libvirt container was still active.
2) We could not execute commands in the container: "sudo docker exec nova_libvirt virsh list --all" (this usually works)
3) We could not sh into the container: "sudo docker exec -it nova_libvirt virsh list bash"
4) We could not restart the container : "sudo docker restart nova_libvirt" - it hung there until I cancelled out.

  • The following logs are seen in docker:
[stack@overcloud-compute-035 ~]$ sudo docker logs nova_libvirt
2020-06-02 18:27:51.205+0000: 100337: error : virFileReadAll:1460 : Failed to open file '/sys/class/net/vethefdaeea/operstate': No such file or directory
2020-06-02 18:27:51.205+0000: 100337: error : virNetDevGetLinkInfo:2552 : unable to read: /sys/class/net/vethefdaeea/operstate: No such file or directory
2020-06-16 15:55:40.570+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 15:55:56.630+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:01:20.679+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:13:05.858+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:27:50.449+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:30:06.327+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:30:37.142+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:40:01.636+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
2020-06-16 16:40:08.673+0000: 100216: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
  • The following error is seen in /var/log/containers/nova/nova-compute.log:
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager [req-a649dd94-ad44-4ad4-ba86-063641a62066 - - - - -] Error updating resources for node overcloud-compute-040.localhost.: libvirtError: Cannot recv data: Connection reset by peer
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager Traceback (most recent call last):
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7386, in update_available_resource_for_node
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 673, in update_available_resource
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6412, in get_available_resource
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     data["vcpus_used"] = self._get_vcpu_used()
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5709, in _get_vcpu_used
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     for guest in self._host.list_guests():
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 566, in list_guests
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     only_running=only_running, only_guests=only_guests)]
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 586, in list_instance_domains
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     alldoms = self.get_connection().listAllDomains(flags)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     result = proxy_call(self._autowrap, f, *args, **kwargs)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     rv = execute(f, *args, **kwargs)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     six.reraise(c, e, tb)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     rv = meth(*args, **kwargs)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 5258, in listAllDomains
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager     raise libvirtError("virConnectListAllDomains() failed", conn=self)
2020-06-15 19:14:06.056 1 ERROR nova.compute.manager libvirtError: Cannot recv data: Connection reset by peer
  • The following kernel trace is seen:
[Tue Jun 16 11:44:12 2020] INFO: task libvirtd:100289 blocked for more than 600 seconds.
[Tue Jun 16 11:44:12 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Jun 16 11:44:12 2020] libvirtd        D ffff8a87695cd140     0 100289 100202 0x00000080
[Tue Jun 16 11:44:12 2020] Call Trace:
[Tue Jun 16 11:44:12 2020]  [<ffffffffb3d68dc9>] schedule+0x29/0x70
[Tue Jun 16 11:44:12 2020]  [<ffffffffb3d668d1>] schedule_timeout+0x221/0x2d0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3661d13>] ? x2apic_send_IPI_mask+0x13/0x20
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36d68a0>] ? try_to_wake_up+0x190/0x390
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3d6917d>] wait_for_completion+0xfd/0x140
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36d6b60>] ? wake_up_state+0x20/0x20
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36ba63d>] flush_work+0xfd/0x190
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36b7430>] ? move_linked_works+0x90/0x90
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36ba759>] __cancel_work_timer+0x89/0x120
[Tue Jun 16 11:44:13 2020]  [<ffffffffb36ba800>] cancel_work_sync+0x10/0x20
[Tue Jun 16 11:44:13 2020]  [<ffffffffc087b10a>] i40evf_remove+0x5a/0x360 [i40evf]
[Tue Jun 16 11:44:13 2020]  [<ffffffffb39c699e>] pci_device_remove+0x3e/0xc0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa8a42>] __device_release_driver+0x82/0xf0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa8ad3>] device_release_driver+0x23/0x30
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa735d>] driver_unbind+0xbd/0xe0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3aa6887>] drv_attr_store+0x27/0x40
[Tue Jun 16 11:44:13 2020]  [<ffffffffb38cbf72>] sysfs_kf_write+0x42/0x50
[Tue Jun 16 11:44:13 2020]  [<ffffffffb38cb54b>] kernfs_fop_write+0xeb/0x160
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3841890>] vfs_write+0xc0/0x1f0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb38426af>] SyS_write+0x7f/0xf0
[Tue Jun 16 11:44:13 2020]  [<ffffffffb3d7606b>] tracesys+0xa3/0xc9

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content