ovn_controller down on 2 compute and cannot be started again

Solution In Progress - Updated -

Issue

  • On our platform, 2 ovn controller agents are in state UP but 'Alive' is XXX:
| 06eca484-ee8b-4c63-b939-f9300394b87f | OVN Controller agent         | overcloud-controller-2 | n/a               | XXX   | UP    | ovn-controller                |
| 4451bce4-76a5-4331-b4af-b0dcd282b828 | OVN Controller agent         | overcloud-controller-2 | n/a               | XXX   | UP    | ovn-controller                |
  • A sudo podman ps -a | grep ovn_controller shows the containers are "Exited" on those 2 compute Hosts. Forcing a start with podman start ovn_controller ends quickly (a second or less) with the same state :
Exited (139) About a minute ago         ovn_controller
  • No error found in ovn-controller.log

  • A look a /var/log/messages shows a segfault :

May 14 11:20:10 cpt-hci-01 podman[342547]: 2020-05-14 11:20:10.727507433 +0200 CEST m=+0.159828810 container init 4a92a9553824fafe44f994cf19b00e5d4932d45903254ab7fb296856fd3b3ccf (image=undercloud:8787/osp16_containers-ovn-controller:16.0, name=ovn_controller)
May 14 11:20:10 cpt-hci-01 podman[342547]: 2020-05-14 11:20:10.741916628 +0200 CEST m=+0.174238019 container start 4a92a9553824fafe44f994cf19b00e5d4932d45903254ab7fb296856fd3b3ccf (image=undercloud:8787/osp16_containers-ovn-controller:16.0, name=ovn_controller)
May 14 11:20:11 cpt-hci-01 kernel: ovn-controller[342642]: segfault at 0 ip 00007f3506a741e2 sp 00007ffd056188f8 error 4 in libc-2.28.so[7f350691b000+1b9000]
May 14 11:20:11 cpt-hci-01 kernel: Code: 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 89 f8 31 d2 c5 c5 ef ff 09 f0 25 ff 0f 00 00 3d 80 0f 00 00 0f 8f 52 03 00 00 c5 fe 6f 0f <c5> f5 74 06 c5 fd da c1 c5 fd 74 c7 c5 fd d7 c8 85 c9 74 7a f3 0f
May 14 11:20:11 cpt-hci-01 systemd[1]: Started Process Core Dump (PID 342655/UID 0).
May 14 11:20:11 cpt-hci-01 systemd-coredump[342656]: Process 342642 (ovn-controller) of user 0 dumped core.#012#012Stack trace of thread 7:#012#0  0x00007f3506a741e2 __strcmp_avx2 (libc.so.6)#012#1  0x0000564c53a3b655 n/a (/usr/bin/ovn-controller)
  • We migrated some instances to see if that was a memory problem : containers still refused to start.

Environment

  • Red Hat OpenStack Platform 16.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content