Unresponsive system and 100% hypervisor CPU usage due to misaligned PMD CPU pinning with OVS DPDK in Red Hat OpenStack Platform 10

Solution In Progress - Updated -

Issue

The host OS is unresponsive and the CPUs which should be reserved for the hypervisor show 100% CPU usage in user space.

From a look at the CPU affinity for Open vSwitch's PMD threads, one can see that the affinity overlaps with the cores which should be reserved for the host OS:

[root@overcloud-compute-2 ~]# grep Aff /etc/systemd/system.conf
#CPUAffinity=1 2
CPUAffinity=0 2 4 6 56 58 60 62
[root@overcloud-compute-2 ~]# ps -Tp `pidof ovs-vswitchd` | grep pmd | awk '{print $2}' | xargs -I {} taskset -c -p {}
pid 7421's current affinity list: 0,2,4,6,56,58,60,62
pid 7422's current affinity list: 0,2,4,6,56,58,60,62
pid 7423's current affinity list: 0,2,4,6,56,58,60,62
pid 7424's current affinity list: 0,2,4,6,56,58,60,62
pid 7425's current affinity list: 0,2,4,6,56,58,60,62
pid 7426's current affinity list: 0,2,4,6,56,58,60,62
pid 7427's current affinity list: 0,2,4,6,56,58,60,62
pid 7428's current affinity list: 0,2,4,6,56,58,60,62
pid 7429's current affinity list: 0,2,4,6,56,58,60,62
pid 7430's current affinity list: 0,2,4,6,56,58,60,62
pid 7431's current affinity list: 0,2,4,6,56,58,60,62
pid 7432's current affinity list: 0,2,4,6,56,58,60,62
pid 7433's current affinity list: 0,2,4,6,56,58,60,62
pid 7434's current affinity list: 0,2,4,6,56,58,60,62
pid 7435's current affinity list: 0,2,4,6,56,58,60,62
pid 7436's current affinity list: 0,2,4,6,56,58,60,62
pid 7437's current affinity list: 0,2,4,6,56,58,60,62
pid 7438's current affinity list: 0,2,4,6,56,58,60,62
pid 7439's current affinity list: 0,2,4,6,56,58,60,62
pid 7440's current affinity list: 0,2,4,6,56,58,60,62
pid 7441's current affinity list: 0,2,4,6,56,58,60,62
pid 7442's current affinity list: 0,2,4,6,56,58,60,62
pid 7443's current affinity list: 0,2,4,6,56,58,60,62
pid 7444's current affinity list: 0,2,4,6,56,58,60,62

Symptoms may be unresponsive commands such as lsof or ps. In a customer environment, the neutron-openvswitch-agent showed the following error message:

2018-08-03 22:49:22.073 11877 DEBUG neutron.agent.linux.utils [req-7c49d003-275b-4361-97be-e4a2d7200c29 - - - - -] Running command: ['ps', '--ppid', '11919', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2018-08-03 22:49:23.728 11877 DEBUG neutron.agent.linux.utils [req-7c49d003-275b-4361-97be-e4a2d7200c29 - - - - -] Exit code: 0 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:150
2018-08-03 22:49:23.729 11877 DEBUG neutron.agent.linux.utils [req-7c49d003-275b-4361-97be-e4a2d7200c29 - - - - -] Running command: ['ps', '--ppid', '11921', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2018-08-03 22:49:25.239 11877 DEBUG neutron.agent.linux.utils [req-7c49d003-275b-4361-97be-e4a2d7200c29 - - - - -] Running command: ['ps', '--ppid', '11919', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2018-08-03 22:49:25.385 11877 CRITICAL neutron [-] Timeout: 5 seconds
2018-08-03 22:49:25.385 11877 ERROR neutron Traceback (most recent call last):
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/bin/neutron-openvswitch-agent", line 10, in 
2018-08-03 22:49:25.385 11877 ERROR neutron     sys.exit(main())
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/cmd/eventlet/plugins/ovs_neutron_agent.py", line 20, in main
2018-08-03 22:49:25.385 11877 ERROR neutron     agent_main.main()
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/main.py", line 51, in main
2018-08-03 22:49:25.385 11877 ERROR neutron     mod.main()
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/main.py", line 35, in main
2018-08-03 22:49:25.385 11877 ERROR neutron     'neutron.plugins.ml2.drivers.openvswitch.agent.'
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 375, in run_apps
2018-08-03 22:49:25.385 11877 ERROR neutron     hub.joinall(services)
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 97, in joinall
2018-08-03 22:49:25.385 11877 ERROR neutron     t.wait()
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2018-08-03 22:49:25.385 11877 ERROR neutron     return self._exit_event.wait()
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait
2018-08-03 22:49:25.385 11877 ERROR neutron     current.throw(*self._exc)
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
2018-08-03 22:49:25.385 11877 ERROR neutron     result = function(*args, **kwargs)
2018-08-03 22:49:25.385 11877 ERROR neutron   File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 59, in _launch
2018-08-03 22:49:25.385 11877 ERROR neutron     raise e
2018-08-03 22:49:25.385 11877 ERROR neutron Timeout: 5 seconds
2018-08-03 22:49:25.385 11877 ERROR neutron
2018-08-03 22:49:25.502 11877 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=11906

And neutron agents flapped in neutron agent-list:

| <UUID> | Open vSwitch agent | overcloud-compute-0.localdomain |                   | xxx   | True           | neutron-openvswitch-agent |
| <UUID> | Open vSwitch agent | overcloud-compute-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |

Environment

Red Hat OpenStack Platform 10

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In