VM down in multiple compute

Solution In Progress - Updated -

Issue

  • We are observing multiple VMs down in different compute nodes and we have tried to reboot the problematic computes but issue remains.

  • We have also tried server power on/off but still VMs are not restoring:

(overcloud) [stack@invikl08klm1dirx01mv ~]$ openstack server show 23d3d0e4-db1a-41ff-9a72-e6f2a8bddc75 --fit
+-------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                                                          |
+-------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                                         |
| OS-EXT-AZ:availability_zone         | comp6                                                                                                                          |
| OS-EXT-SRV-ATTR:host                | overcloud-compute-0.localdomain                                                                                          |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-0.localdomain                                                                                          |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000026                                                                                                              |
| OS-EXT-STS:power_state              | NOSTATE                                                                                                                        |
| OS-EXT-STS:task_state               | None                                                                                                                           |
| OS-EXT-STS:vm_state                 | error                                                                                                                          |
| OS-SRV-USG:launched_at              | 2020-09-30T19:15:40.000000                                                                                                     |
| OS-SRV-USG:terminated_at            | None                                                                                                                           |
| accessIPv4                          |                                                                                                                                |
| accessIPv6                          |                                                                                                                                |
| addresses                           | sriov_eno2_net_Central_613=172.150.128.3; sriov_eno1_net_Central_613=172.150.128.9;                                            |
|                                     | sriov_eno1_net_STL_IPC_internal_209=172.150.146.3; sriov_eno1_net_Mgmt_Test_606=10.103.254.5;                                  |
|                                     | sriov_eno2_net_STL_IPC_internal_209=172.150.146.12; sriov_eno2_net_Mgmt_Test_606=10.103.254.12;                                |
|                                     | sriov_eno1_net_Mediation_RA_612=172.150.127.19; sriov_eno2_net_Mediation_RA_612=172.150.127.11                                 |
| config_drive                        |                                                                                                                                |
| created                             | 2020-09-30T19:15:10Z                                                                                                           |
| fault                               | {u'message': u'libvirtError', u'code': 500, u'details': u'Traceback (most recent call last):\n  File "/usr/lib/python2.7/site- |
|                                     | packages/nova/compute/manager.py", line 202, in decorated_function\n    return function(self, context, *args, **kwargs)\n      |
|                                     | File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3282, in reboot_instance\n                               |
|                                     | self._set_instance_obj_error_state(context, instance)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line  |
|                                     | 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in     |
|                                     | force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-                              |
|                                     | packages/nova/compute/manager.py", line 3257, in reboot_instance\n    bad_volumes_callback=bad_volumes_callback)\n  File       |
|                                     | "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2725, in reboot\n    block_device_info)\n  File           |
|                                     | "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2841, in _hard_reboot\n    vifs_already_plugged=True)\n   |
|                                     | File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5663, in _create_domain_and_network\n                |
|                                     | destroy_disks_on_failure)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n           |
|                                     | self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n           |
|                                     | six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line      |
|                                     | 5632, in _create_domain_and_network\n    post_xml_callback=post_xml_callback)\n  File "/usr/lib/python2.7/site-                |
|                                     | packages/nova/virt/libvirt/driver.py", line 5567, in _create_domain\n    guest.launch(pause=pause)\n  File "/usr/lib/python2.7 |
|                                     | /site-packages/nova/virt/libvirt/guest.py", line 144, in launch\n    self._encoded_xml, errors=\'ignore\')\n  File             |
|                                     | "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File             |
|                                     | "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, |
|                                     | self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 139, in launch\n    return                |
|                                     | self._domain.createWithFlags(flags)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit\n          |
|                                     | result = proxy_call(self._autowrap, f, *args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line     |
|                                     | 144, in proxy_call\n    rv = execute(f, *args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line    |
|                                     | 125, in execute\n    six.reraise(c, e, tb)\n  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker\n |
|                                     | rv = meth(*args, **kwargs)\n  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags\n    if ret  |
|                                     | == -1: raise libvirtError (\'virDomainCreateWithFlags() failed\', dom=self)\nlibvirt

**Error: internal error: couldn\'t find      |
|                                     | IFLA_VF_INFO for VF 18 in netlink response\n', u'created': u'2021-08-30T15:13:17Z'}   **

-
- journalctl -x returns error similar to those:

Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0d.5: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:05.4: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:04.5: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.7: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.5: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: enabling device (0000 -> 0002)
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:02.0 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 dbus[8391]: [system] Activating via systemd: service name='org.freedesktop.machine1' unit='dbus-org.freedesktop.machine1.service'
Aug 30 12:33:22 overcloud-compute-0 dbus[8391]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.machine1.service': Refusing activation, D-Bus is shutting down.
Aug 30 12:33:22 overcloud-compute-0 dockerd-current[17200]: 2021-08-30 07:03:22.101+0000: 836241: error : virSystemdTerminateMachine:444 : Refusing activation, D-Bus is shutting down.
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0b.2 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0a.2 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:05.4 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0d.5 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:04.5 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:d8:02.5 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: DMAR: 64bit 0000:86:0a.7 uses identity mapping
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1091 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1092 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1093 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1094 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: irq 1095 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: Multiqueue Enabled: Queue pair count = 4
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: MAC address: ee:95:ac:56:af:66
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:d8:02.0: GRO is enabled
Aug 30 12:33:22 overcloud-compute-0 NetworkManager[14475]: <info>  [1630307002.1646] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/341)
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1096 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1097 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1098 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1099 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: irq 1100 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: Multiqueue Enabled: Queue pair count = 4
Aug 30 12:33:22 overcloud-compute-0 NetworkManager[14475]: <info>  [1630307002.4337] device (eth0): interface index 341 renamed iface from 'eth0' to 'enp216s2'
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: MAC address: 26:cc:90:57:ca:d8
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0b.2: GRO is enabled
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1111 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1112 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1113 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1114 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: irq 1115 for MSI/MSI-X
Aug 30 12:33:22 overcloud-compute-0 kernel: iavf 0000:86:0a.2: Multiqueue Enabled: Queue pair count = 4
...skipping...
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:09.0: Admin queue command never completed
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.1: add vsi failed for VF 2, aq_err 0
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 1, aq_err 0
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.1: Set default VSI failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:25 overcloud-compute-0 kernel: i40e 0000:1a:00.1: Setting promiscuous on failed on PF, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:25 overcloud-compute-0 kernel: iavf 0000:1a:06.4: Device is still in reset (-16), retrying
Aug 30 12:41:25 overcloud-compute-0 kernel: iavf 0000:1a:06.2: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF reset check timeout on VF 1
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:09.0: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 1, aq_err 0
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:05.2: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF 6 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:26 overcloud-compute-0 kernel: iavf 0000:1a:05.3: Admin queue command never completed
Aug 30 12:41:26 overcloud-compute-0 kernel: i40e 0000:1a:00.0: Error OK, forcing overflow promiscuous on VF 6
Aug 30 12:41:27 overcloud-compute-0 kernel: iavf 0000:1a:05.1: Admin queue command never completed
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: ignoring delete macvlan error on VF 6, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:27 overcloud-compute-0 kernel: iavf 0000:1a:07.4: Admin queue command never completed
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VSI seid 440 Rx ring 145 disable timeout
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 6, aq_err 0
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VSI seid 425 Rx ring 213 disable timeout
Aug 30 12:41:27 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VF 4 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: Error OK, forcing overflow promiscuous on VF 4
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF 21 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: ignoring delete macvlan error on VF 4, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.0: Error OK, forcing overflow promiscuous on VF 21
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.0: ignoring delete macvlan error on VF 21, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:28 overcloud-compute-0 kernel: i40e 0000:1a:00.1: add vsi failed for VF 4, aq_err 0
Aug 30 12:41:28 overcloud-compute-0 kernel: iavf 0000:1a:06.2: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 21, aq_err 0
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:09.0: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VSI seid 435 Rx ring 253 disable timeout
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:06.4: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF 31 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:07.4: Admin queue command never completed
Aug 30 12:41:29 overcloud-compute-0 kernel: i40e 0000:1a:00.0: Error OK, forcing overflow promiscuous on VF 31
Aug 30 12:41:29 overcloud-compute-0 kernel: iavf 0000:1a:05.1: Admin queue command never completed
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: ignoring delete macvlan error on VF 31, err I40E_ERR_ADMIN_QUEUE_TIMEOUT, aq_err OK
Aug 30 12:41:30 overcloud-compute-0 kernel: iavf 0000:1a:05.3: Admin queue command never completed
Aug 30 12:41:30 overcloud-compute-0 kernel: iavf 0000:1a:05.2: Admin queue command never completed
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed for VF 31, aq_err 0
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VSI seid 441 Rx ring 149 disable timeout
Aug 30 12:41:30 overcloud-compute-0 kernel: i40e 0000:1a:00.0: VF reset check timeout on VF 1
Aug 30 12:41:31 overcloud-compute-0 kernel: i40e 0000:1a:00.1: VF 5 failed to set multicast promiscuous mode err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Aug 30 12:41:31 overcloud-compute-0 kernel: i40e 0000:1a:00.0: add vsi failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
  • ip link show dev eno1 is missing some VFs:
[overcloud-compute-0 ~]# ip link show eno2
3: eno2: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 84:13:9f:31:ac:54 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 5 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 7 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 8 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 9 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 10 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 11 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 13 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 14 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 15 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 16 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 17 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 18 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 19 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 20 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 22 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 23 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 24 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 25 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 26 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 27 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 28 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
    vf 29 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
[root@overcloud-compute-0 ~]# exit

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In