Dataplane outage during an upgrade with ML2/OVN mechanism driver in Red Hat OpenStack Platform 16.1

Solution Verified - Updated -

Issue

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

A dataplane outage occurs during a cloud upgrade. Neutron uses the ML2/OVN mechanism driver.

Affected traffic seems to be what would be RELATED/ESTABLISHED traffic in conntrack. Connection tracking does not seem to work any more.

OVN controller logs on the compute node show the following:

$ egrep 'WARN|ERROR' var/log/containers/openvswitch/ovn-controller.log | wc -l
1856
$ egrep 'WARN|ERROR' var/log/containers/openvswitch/ovn-controller.log | tail -n 20
2020-11-06T00:56:00.578Z|07668|lflow|WARN|Dropped 31 log messages in last 82 seconds (most recently, 45 seconds ago) due to excessive rate
2020-11-06T00:56:00.578Z|07669|lflow|WARN|error parsing match "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (outport == @pg_557094d0_6698_42c5_abda_e9e52cd0b562 && ip4 && ip4.src == $pg_557094d0_6698_42c5_abda_e9e52cd0b562_ip4)": Syntax error at `@pg_557094d0_6698_42c5_abda_e9e52cd0b562' expecting port group name.
2020-11-06T00:56:00.582Z|07670|lflow|WARN|Dropped 7 log messages in last 82 seconds (most recently, 45 seconds ago) due to excessive rate
2020-11-06T00:56:00.582Z|07671|lflow|WARN|error parsing actions "ct_commit { ct_label.blocked = 0; }; next;": Syntax error at `{' expecting `;'.
2020-11-06T00:57:30.176Z|07672|lflow|WARN|Dropped 95 log messages in last 90 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T00:57:30.176Z|07673|lflow|WARN|error parsing match "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (outport == @pg_557094d0_6698_42c5_abda_e9e52cd0b562 && ip4 && ip4.src == $pg_557094d0_6698_42c5_abda_e9e52cd0b562_ip4)": Syntax error at `@pg_557094d0_6698_42c5_abda_e9e52cd0b562' expecting port group name.
2020-11-06T00:57:30.180Z|07674|lflow|WARN|Dropped 23 log messages in last 90 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T00:57:30.181Z|07675|lflow|WARN|error parsing actions "ct_commit { ct_label.blocked = 0; }; next;": Syntax error at `{' expecting `;'.
2020-11-06T00:58:44.198Z|07676|lflow|WARN|Dropped 31 log messages in last 74 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T00:58:44.198Z|07677|lflow|WARN|error parsing match "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (outport == @pg_557094d0_6698_42c5_abda_e9e52cd0b562 && ip4 && ip4.src == $pg_557094d0_6698_42c5_abda_e9e52cd0b562_ip4)": Syntax error at `@pg_557094d0_6698_42c5_abda_e9e52cd0b562' expecting port group name.
2020-11-06T00:58:44.202Z|07678|lflow|WARN|Dropped 7 log messages in last 74 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T00:58:44.202Z|07679|lflow|WARN|error parsing actions "ct_commit { ct_label.blocked = 0; }; next;": Syntax error at `{' expecting `;'.
2020-11-06T00:59:58.219Z|07680|lflow|WARN|Dropped 31 log messages in last 74 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T00:59:58.219Z|07681|lflow|WARN|error parsing match "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (outport == @pg_557094d0_6698_42c5_abda_e9e52cd0b562 && ip4 && ip4.src == $pg_557094d0_6698_42c5_abda_e9e52cd0b562_ip4)": Syntax error at `@pg_557094d0_6698_42c5_abda_e9e52cd0b562' expecting port group name.
2020-11-06T00:59:58.221Z|07682|lflow|WARN|Dropped 7 log messages in last 74 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T00:59:58.221Z|07683|lflow|WARN|error parsing actions "ct_commit { ct_label.blocked = 0; }; next;": Syntax error at `{' expecting `;'.
2020-11-06T01:01:12.229Z|07684|lflow|WARN|Dropped 47 log messages in last 74 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T01:01:12.229Z|07685|lflow|WARN|error parsing match "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (outport == @pg_557094d0_6698_42c5_abda_e9e52cd0b562 && ip4 && ip4.src == $pg_557094d0_6698_42c5_abda_e9e52cd0b562_ip4)": Syntax error at `@pg_557094d0_6698_42c5_abda_e9e52cd0b562' expecting port group name.
2020-11-06T01:01:12.231Z|07686|lflow|WARN|Dropped 11 log messages in last 74 seconds (most recently, 37 seconds ago) due to excessive rate
2020-11-06T01:01:12.231Z|07687|lflow|WARN|error parsing actions "ct_commit { ct_label.blocked = 0; }; next;": Syntax error at `{' expecting `;'.

As an example, the impact on ICMP can be observed by spawning 2 instances on the same compute node. Then, one can run tcpdump on each instance's tap interface, and then one can run a ping from one side to another. For example, the source instance is the following. Pinging the target instance does not work:

[root@instance1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:00:00:00:00:10 brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.10/24 brd 192.168.123.255 scope global noprefixroute dynamic eth0
       valid_lft 41203sec preferred_lft 41203sec
(...)
[root@instance1 ~]# arp -d 192.168.123.20
[root@instance1 ~]# arp -d 192.168.123.10
No ARP entry for 192.168.123.10
[root@instance1 ~]# ip neigh
192.168.123.1 dev eth0 lladdr 00:00:00:00:00:01 STALE
192.168.123.2 dev eth0 lladdr 00:00:00:00:00:02 STALE
[root@instance1 ~]# ping -c 5 -W1 192.168.123.20
PING 192.168.123.20 (192.168.123.20) 56(84) bytes of data.

--- 192.168.123.20 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 3999ms
[root@instance1 ~]# ip neigh
192.168.123.1 dev eth0 lladdr 00:00:00:00:00:01 STALE
192.168.123.20 dev eth0 lladdr 00:00:00:00:00:20 REACHABLE
192.168.123.2 dev eth0 lladdr 00:00:00:00:00:02 STALE

Packet captures on the tap interface attached to instance 1 show that ARP resolves and that ICMP echo requests are sent, but replies are not received:

$ tshark -t ad -r instance-1-port-tap59863c3c-dd.pcap 'not udp and ((frame.number >= 664 and frame.number <= 714))'
  664 2020-11-06 01:55:01.687891 00:00:00:00:00:10 ff:ff:ff:ff:ff:ff 00:00:00:00:00:10 → Broadcast      ARP 42 Who has 192.168.123.20? Tell 192.168.123.10    
  665 2020-11-06 01:55:01.688951 00:00:00:00:00:20 00:00:00:00:00:10 00:00:00:00:00:20 → 00:00:00:00:00:10   ARP 42 192.168.123.20 is at 00:00:00:00:00:20    
  666 2020-11-06 01:55:01.689209 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=1/256, ttl=64       
  679 2020-11-06 01:55:02.687386 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=2/512, ttl=64      
  687 2020-11-06 01:55:03.687363 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=3/768, ttl=64    
  699 2020-11-06 01:55:04.687411 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=4/1024, ttl=64      
  707 2020-11-06 01:55:05.687339 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=5/1280, ttl=64    

Packet captures on the tap interface attached to instance 2 show that ARP resolves and that ICMP echo replies are sent back into OVS:

$ tshark -t ad -r instance-2-port-tap9719fffa-5f.pcap 'not udp and ((frame.number >= 770 and frame.number <= 849))'  
  779 2020-11-06 01:55:01.688312 00:00:00:00:00:10 ff:ff:ff:ff:ff:ff 00:00:00:00:00:10 → Broadcast      ARP 42 Who has 192.168.123.20? Tell 192.168.123.10    
  780 2020-11-06 01:55:01.688716 00:00:00:00:00:20 00:00:00:00:00:10 00:00:00:00:00:20 → 00:00:00:00:00:10   ARP 42 192.168.123.20 is at 00:00:00:00:00:20    
  781 2020-11-06 01:55:01.689855 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=1/256, ttl=64    
  782 2020-11-06 01:55:01.690219 00:00:00:00:00:20 00:00:00:00:00:10 192.168.123.20 → 192.168.123.10    ICMP 98 Echo (ping) reply    id=0x4527, seq=1/256, ttl=64 (request in 781)    0.364    
  798 2020-11-06 01:55:02.687448 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=2/512, ttl=64    
  799 2020-11-06 01:55:02.687805 00:00:00:00:00:20 00:00:00:00:00:10 192.168.123.20 → 192.168.123.10    ICMP 98 Echo (ping) reply    id=0x4527, seq=2/512, ttl=64 (request in 798)    0.357   
  807 2020-11-06 01:55:03.687391 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=3/768, ttl=64    
  808 2020-11-06 01:55:03.687719 00:00:00:00:00:20 00:00:00:00:00:10 192.168.123.20 → 192.168.123.10    ICMP 98 Echo (ping) reply    id=0x4527, seq=3/768, ttl=64 (request in 807)    0.328   
  820 2020-11-06 01:55:04.687439 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=4/1024, ttl=64    
  821 2020-11-06 01:55:04.687762 00:00:00:00:00:20 00:00:00:00:00:10 192.168.123.20 → 192.168.123.10    ICMP 98 Echo (ping) reply    id=0x4527, seq=4/1024, ttl=64 (request in 820)    0.323   
  829 2020-11-06 01:55:05.687367 00:00:00:00:00:10 00:00:00:00:00:20  192.168.123.10 → 192.168.123.20   ICMP 98 Echo (ping) request  id=0x4527, seq=5/1280, ttl=64    
  830 2020-11-06 01:55:05.687687 00:00:00:00:00:20 00:00:00:00:00:10 192.168.123.20 → 192.168.123.10    ICMP 98 Echo (ping) reply    id=0x4527, seq=5/1280, ttl=64 (request in 829)    0.32

Environment

Red Hat OpenStack Platform 16.1

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content