ovn_controller down on 2 compute and cannot be started again

Solution In Progress - Updated -

Issue

  • On our platform, 2 ovn controller agents are in state UP but 'Alive' is XXX:
| 06eca484-ee8b-4c63-b939-f9300394b87f | OVN Controller agent         | overcloud-controller-2 | n/a               | XXX   | UP    | ovn-controller                |
| 4451bce4-76a5-4331-b4af-b0dcd282b828 | OVN Controller agent         | overcloud-controller-2 | n/a               | XXX   | UP    | ovn-controller                |
  • A sudo podman ps -a | grep ovn_controller shows the containers are "Exited" on those 2 compute Hosts. Forcing a start with podman start ovn_controller ends quickly (a second or less) with the same state :
Exited (139) About a minute ago         ovn_controller
  • No error found in ovn-controller.log

  • A look a /var/log/messages shows a segfault :

May 14 11:20:10 cpt-hci-01 podman[342547]: 2020-05-14 11:20:10.727507433 +0200 CEST m=+0.159828810 container init 4a92a9553824fafe44f994cf19b00e5d4932d45903254ab7fb296856fd3b3ccf (image=undercloud:8787/osp16_containers-ovn-controller:16.0, name=ovn_controller)
May 14 11:20:10 cpt-hci-01 podman[342547]: 2020-05-14 11:20:10.741916628 +0200 CEST m=+0.174238019 container start 4a92a9553824fafe44f994cf19b00e5d4932d45903254ab7fb296856fd3b3ccf (image=undercloud:8787/osp16_containers-ovn-controller:16.0, name=ovn_controller)
May 14 11:20:11 cpt-hci-01 kernel: ovn-controller[342642]: segfault at 0 ip 00007f3506a741e2 sp 00007ffd056188f8 error 4 in libc-2.28.so[7f350691b000+1b9000]
May 14 11:20:11 cpt-hci-01 kernel: Code: 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 89 f8 31 d2 c5 c5 ef ff 09 f0 25 ff 0f 00 00 3d 80 0f 00 00 0f 8f 52 03 00 00 c5 fe 6f 0f <c5> f5 74 06 c5 fd da c1 c5 fd 74 c7 c5 fd d7 c8 85 c9 74 7a f3 0f
May 14 11:20:11 cpt-hci-01 systemd[1]: Started Process Core Dump (PID 342655/UID 0).
May 14 11:20:11 cpt-hci-01 systemd-coredump[342656]: Process 342642 (ovn-controller) of user 0 dumped core.#012#012Stack trace of thread 7:#012#0  0x00007f3506a741e2 __strcmp_avx2 (libc.so.6)#012#1  0x0000564c53a3b655 n/a (/usr/bin/ovn-controller)
  • We migrated some instances to see if that was a memory problem : containers still refused to start.

Environment

  • Red Hat OpenStack Platform 16.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In