Random OVN NorthBound database connectivity issues (and possibly related symptoms)

Solution Verified - Updated -

Issue

  • While a particular symptom we're seeing is much likely related to a network L2 event happening at the UCS Fabric Interconnects layer, we'd like to report and request feedback on a set of items:

1) Network L2 event happening on one of the virtual IPs pcs manages. We had an outage a couple of days ago and the symptom was very similar, we had a set of virtual IPs moving from one chassis to another due to a controller blade experiencing network issues and we lost the entire cluster endpoints for roughly 15m. We found the CAM and ARP tables of UCS and evpn to mismatch, the refresh rate for the CAM and ARP tables is different (4 hours UCS, 300s the rest of the newtork) and that can be problematic.

2021-02-05T11:20:25.957Z|00010|reconnect|INFO|ssl:10.10.10.10:6641: connecting...
2021-02-05T11:20:25.958Z|00011|reconnect|INFO|ssl:10.10.10.10:6641: connection attempt failed (No route to host)
2021-02-05T11:20:26.959Z|00012|reconnect|INFO|ssl:10.10.10.10:6641: connecting...
2021-02-05T11:20:26.960Z|00013|reconnect|INFO|ssl:10.10.10.10:6641: connection attempt failed (No route to host)
2021-02-05T11:20:26.960Z|00014|reconnect|INFO|ssl:10.10.10.10:6641: waiting 2 seconds before reconnect
2021-02-05T11:20:28.961Z|00015|reconnect|INFO|ssl:10.10.10.10:6641: connecting...
2021-02-05T11:20:28.964Z|00016|reconnect|INFO|ssl:10.10.10.10:6641: connected```

2) The OVN SouthBound database appears to be stuck in read only mode. This issue was found out as a set of Neutron ports stopped routing, investigating reported the fact the chassis where specific VMs were scheduled was missing ovn-cms-options="enable-chassis-as-gw".

Trying to actually apply a fix with:

 ovn-sbctl set Chassis e812b334-0950-4ab2-9770-450acffdec4b external_ids:ovn-cms-options="enable-chassis-as-gw"

Resulted in:

ovn-sbctl: transaction error: {"details":"update operation not allowed when database server is in read only mode","error":"not allowed"}

3) We're seeing some constantly repeated database inconsistencies for the SouthBound database:

2021-02-05T11:05:15.566Z|00183|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values
 (lrp-707d1929-fc3b-4988-90a8-ec8b0f0b23b9 and \"0000::0000:0000:00:0004\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID 8906615d-de
36-4c4e-a8da-b581f0f71631, was inserted by this transaction.  Second row, with UUID 3f627f9b-2d2b-40a2-abeb-e47e6fa8054f, existed in the database before this t
ransaction and was not modified by the transaction.","error":"constraint violation"}                                                                          
2021-02-05T11:06:23.043Z|00184|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values
 (lrp-4570b5e9-cac1-4b88-8b2f-95365045cd84 and \"0000::0000:0000:00:0001\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID 11d554bf-31
4c-4ac5-8949-1b056ac7b593, was inserted by this transaction.  Second row, with UUID f5b5e046-42cf-4100-8f8b-48159ab957ca, existed in the database before this t
ransaction and was not modified by the transaction.","error":"constraint violation"}
2021-02-05T11:06:32.144Z|00185|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values
 (lrp-05c9a8cc-6432-4a51-8910-0831dbfeba5f and \"0000::0000:0000:00:0003\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID 20083c19-c1
e7-4e33-b0a1-5850537fdbcd, existed in the database before this transaction and was not modified by the transaction.  Second row, with UUID 847a63cd-4d33-4a4c-9
90a-79ac39cb482e, was inserted by this transaction.","error":"constraint violation"}
2021-02-05T11:07:29.805Z|00186|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values
 (lrp-19bfae1e-d982-4bff-9af7-eaa1733ef19b and \"0000::0000:0000:00:0002\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID 8d982a4d-e4
16-48f8-932f-188c5848e286, existed in the database before this transaction and was not modified by the transaction.  Second row, with UUID 7a6c5ce5-dec1-420c-b
e0e-b54c51c70973, was inserted by this transaction.","error":"constraint violation"}
2021-02-05T11:09:26.181Z|00187|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical value$
 (lrp-19bfae1e-d982-4bff-9af7-eaa1733ef19b and \"0000::0000:0000:00:0003\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID 614169a2-7$
77-4172-ab2c-5d9a593753d2, was inserted by this transaction.  Second row, with UUID 4570a884-9e5a-4eed-a8a3-afad20d4ee57, existed in the database before this $
ransaction and was not modified by the transaction.","error":"constraint violation"}    
  • We'd love to receive your feedback here while we keep investigating ourselves. I'm also attaching the sosreports for the 3 controllers, 3 networker nodes.

  • Some Neutron ports appear to be unroutable due to the lack of ovn-cms-options="enable-chassis-as-gw" on the target compute nodes chassis, on top of this the fact the southbound database is in read-only mode prevents us to actually apply a fix. It's also important mentioning we'd like to rule out any NorthBound/SouthBound database inconsistency at this point.

Environment

  • Red Hat OpenStack Platform 16.1 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content