VM down in multiple compute

Solution In Progress - Updated -

Issue

  • We are observing multiple VMs down in different compute nodes and we have tried to reboot the problematic computes but issue remains.

  • We have also tried server power on/off but still VMs are not restoring:

(overcloud) [stack@invikl08klm1dirx01mv ~]$ openstack server show 23d3d0e4-db1a-41ff-9a72-e6f2a8bddc75 --fit
+-------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                                                          |
+-------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                                         |
| OS-EXT-AZ:availability_zone         | comp6                                                                                                                          |
| OS-EXT-SRV-ATTR:host                | overcloud-compute-0.localdomain                                                                                          |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-0.localdomain                                                                                          |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000026                                                                                                              |
| OS-EXT-STS:power_state              | NOSTATE                                                                                                                        |
| OS-EXT-STS:task_state               | None                                                                                                                           |
| OS-EXT-STS:vm_state                 | error                                                                                                                          |
| OS-SRV-USG:launched_at              | 2020-09-30T19:15:40.000000                                                                                                     |
| OS-SRV-USG:terminated_at            | None                                                                                                                           |
| accessIPv4                          |                                                                                                                                |
| accessIPv6                          |                                                                                                                                |
| addresses                           | sriov_eno2_net_Central_613=172.150.128.3; sriov_eno1_net_Central_613=172.150.128.9;                                            |
|                                     | sriov_eno1_net_STL_IPC_internal_209=172.150.146.3; sriov_eno1_net_Mgmt_Test_606=10.103.254.5;                                  |
|                                     | sriov_eno2_net_STL_IPC_internal_209=172.150.146.12; sriov_eno2_net_Mgmt_Test_606=10.103.254.12;                                |
|                                     | sriov_eno1_net_Mediation_RA_612=172.150.127.19; sriov_eno2_net_Mediation_RA_612=172.150.127.11                                 |
| config_drive                        |                                                                                                                                |
| created                             | 2020-09-30T19:15:10Z                                                                                                           |
| fault                               | {u'message': u'libvirtError', u'code': 500, u'details': u'Traceback (most recent call last):\n  File "/usr/lib/python2.7/site- |
|                                     | packages/nova/compute/manager.py", line 202, in decorated_function\n    return function(self, context, *args, **kwargs)\n      |
|                                     | File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3282, in reboot_instance\n                               |
|                                     | self._set_instance_obj_error_state(context, instance)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line  |
|                                     | 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in     |
|                                     | force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-                              |
|                                     | packages/nova/compute/manager.py", line 3257, in reboot_instance\n    bad_volumes_callback=bad_volumes_callback)\n  File       |
|                                     | "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2725, in reboot\n    block_device_info)\n  File           |
|                                     | "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2841, in _hard_reboot\n    vifs_already_plugged=True)\n   |
|                                     | File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5663, in _create_domain_and_network\n                |
|                                     | destroy_disks_on_failure)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n           |
|                                     | self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n           |
|                                     | six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line      |
|                                     | 5632, in _create_domain_and_network\n    post_xml_callback=post_xml_callback)\n  File "/usr/lib/python2.7/site-                |
|                                     | packages/nova/virt/libvirt/driver.py", line 5567, in _create_domain\n    guest.launch(pause=pause)\n  File "/usr/lib/python2.7 |
|                                     | /site-packages/nova/virt/libvirt/guest.py", line 144, in launch\n    self._encoded_xml, errors=\'ignore\')\n  File             |
|                                     | "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File             |
|                                     | "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, |
|                                     | self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 139, in launch\n    return                |
|                                     | self._domain.createWithFlags(flags)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit\n          |
|                                     | result = proxy_call(self._autowrap, f, *args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line     |
|                                     | 144, in proxy_call\n    rv = execute(f, *args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line    |
|                                     | 125, in execute\n    six.reraise(c, e, tb)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker\n |
|                                     | rv = meth(*args, **kwargs)\n  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags\n    if ret  |
|                                     | == -1: raise libvirtError (\'virDomainCreateWithFlags() failed\', dom=self)\nlibvirt

**Error: internal error: couldn\'t find      |
|                                     | IFLA_VF_INFO for VF 18 in netlink response\n', u'created': u'2021-08-30T15:13:17Z'}   **

-
- journalctl -x returns error similar to those:

Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0d.5: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:05.4: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:04.5: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.7: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.5: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:02.0 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 dbus[8391]: [system] Activating via systemd: service name='org.freedesktop.machine1' unit='dbus-org.freedesktop.machine1.service'
Aug 30 12:33:22 overcloud-compute-0 dbus[8391]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.machine1.service': Refusing activation, D-Bus is shutting down.
Aug 30 12:33:22 overcloud-compute-0 dockerd-current[17200]: 2021-08-30 07:03:22.101+0000: 836241: error : virSystemdTerminateMachine:444 : Refusing activation, D-Bus is shutting down.
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0b.2 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0a.2 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:05.4 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0d.5 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:04.5 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:02.5 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0a.7 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1091 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1092 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1093 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1094 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1095 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: Multiqueue Enabled: Queue pair count = 4
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: MAC address: ee:95:ac:56:af:66
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: GRO is enabled
Aug 30 12:33:22 overcloud-compute-0 NetworkManager[14475]: <info>  [1630307002.1646] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/341)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1096 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1097 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1098 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1099 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1100 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: Multiqueue Enabled: Queue pair count = 4
Aug 30 12:33:22 overcloud-compute-0 NetworkManager[14475]: <info>  [1630307002.4337] device (eth0): interface index 341 renamed iface from 'eth0' to 'enp216s2'
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: MAC address: 26:cc:90:57:ca:d8
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: GRO is enabled
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1111 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1112 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1113 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1114 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1115 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: Multiqueue Enabled: Queue pair count = 4
...skipping...
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:09.0: Admin queue command never completed
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.1: add vsi failed for VF 2, aq_err 0
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 1, aq_err 0
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.1: Set default VSI failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.1: Setting promiscuous on failed on PF, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:25 overcloud-compute-0 kernel: iavf 0000:1a:06.4: Device is still in reset (-16), retrying
Aug 30 12:41:25 overcloud-compute-0 kernel: iavf 0000:1a:06.2: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF reset check timeout on VF 1
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:09.0: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 1, aq_err 0
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:05.2: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF 6 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:05.3: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: Error OK, forcing overflow promiscuous on VF 6
Aug 30 12:41:27 overcloud-compute-0 kernel: iavf 0000:1a:05.1: Admin queue command never completed
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: ignoring delete macvlan error on VF 6, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:27 overcloud-compute-0 kernel: iavf 0000:1a:07.4: Admin queue command never completed
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VSI seid 440 Rx ring 145 disable timeout
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 6, aq_err 0
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VSI seid 425 Rx ring 213 disable timeout
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VF 4 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: Error OK, forcing overflow promiscuous on VF 4
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF 21 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: ignoring delete macvlan error on VF 4, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.0: Error OK, forcing overflow promiscuous on VF 21
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.0: ignoring delete macvlan error on VF 21, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: add vsi failed for VF 4, aq_err 0
Aug 30 12:41:28 overcloud-compute-0 kernel: iavf 0000:1a:06.2: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 21, aq_err 0
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:09.0: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VSI seid 435 Rx ring 253 disable timeout
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:06.4: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF 31 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:07.4: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: Error OK, forcing overflow promiscuous on VF 31
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:05.1: Admin queue command never completed
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: ignoring delete macvlan error on VF 31, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:30 overcloud-compute-0 kernel: iavf 0000:1a:05.3: Admin queue command never completed
Aug 30 12:41:30 overcloud-compute-0 kernel: iavf 0000:1a:05.2: Admin queue command never completed
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 31, aq_err 0
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VSI seid 441 Rx ring 149 disable timeout
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF reset check timeout on VF 1
Aug 30 12:41:31 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VF 5 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:31 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
  • ip link show dev eno1 is missing some VFs:
[overcloud-compute-0 ~]# ip link show eno2
3: eno2: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 84:13:9f:31:ac:54 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 5 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 7 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 8 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 9 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 10 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 11 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 13 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 14 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 15 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 16 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 17 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 18 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 19 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 20 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 22 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 23 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 24 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 25 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 26 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 27 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 28 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 29 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
[root@overcloud-compute-0 ~]# exit

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content