systemd layered slices left in a dead state

Solution Verified - Updated -

Environment

  • OpenShift Container Platform 3.9
  • Red Hat Enterprise Linux (RHEL) 7
  • systemd-219-62.el7_6.7
  • Keepalived(all versions)

Issue

  • We got around 8GB of logs in the last 50 hours on tree of our OpenShift nodes. The error is:
Jun 18 12:30:48 example.com systemd[1]: Failed to set up mount unit: Invalid argument

On nodes with high pod counts we observe many inactive records:

# systemctl list-units --type slice --all first*
    UNIT                     LOAD   ACTIVE   SUB    DESCRIPTION
    first-second-third.slice loaded active   active first-second-third.slice
    first-second.slice       loaded inactive dead   first-second.slice
    first.slice              loaded active   active first.slice
# systemctl list-units --all |grep "inact.*dead" | wc -l
93000
  • Keepalived processes are not killed after systemctl stop keepalived.process, due to which VIP does not failover from master to backup node.

Resolution

Update to systemd-219-67.el7_7.1 shipped with Advisory RHBA-2019:2356 or newer.

Root Cause

The bug was fixed with backport of upstream pull request 8175.

Diagnostic Steps

  • systemd does not kill Keepalived processes (parent and child "vrrp and checkers") running with systemd systemd-219-67.el7.x86_64 when stopped using either systemctl stop keepalived.service or service keepalived stop.
[root@r77 ~]# rpm -q systemd
systemd-219-67.el7.x86_64         <<-- issue is seen with this version

[root@r77 ~]# systemctl stop keepalived

[root@r77 ~]# systemctl status keepalived.service 
keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2020-02-06 10:20:26 CET; 4min 26s ago  <<-- Inactive dead
  Process: 2359 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/keepalived.service
           ├─2360 /usr/sbin/keepalived -D     <<-- all process are still listed
           ├─2361 /usr/sbin/keepalived -D
           └─2362 /usr/sbin/keepalived -D

Feb 06 10:20:15 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:15 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:15 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:26 node1.example.com systemd[1]: Stopped LVS and VRRP High Availability Monitor.
  • With systemd systemd-219-67.el7_7.1 or later :
# rpm -qa systemd
systemd-219-67.el7_7.2.x86_64

# systemctl stop keepalived.service 
# systemctl status keepalived -l
 keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Feb 03 23:03:41 node1.example.com Keepalived_vrrp[28237]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 03 23:03:41 node1.example.com Keepalived_vrrp[28237]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 07 13:00:01 node1.example.com systemd[1]: Stopping LVS and VRRP High Availability Monitor...
Feb 07 13:00:01 node1.example.com Keepalived[28234]: Stopping
Feb 07 13:00:01 node1.example.com Keepalived_vrrp[28237]: VRRP_Instance(VI_1) sent 0 priority
Feb 07 13:00:01 node1.example.com Keepalived_vrrp[28237]: VRRP_Instance(VI_1) removing protocol VIPs.
Feb 07 13:00:01 node1.example.com Keepalived_healthcheckers[28236]: Stopped
Feb 07 13:00:02 node1.example.com Keepalived_vrrp[28237]: Stopped
Feb 07 13:00:02 node1.example.com Keepalived[28234]: Stopped Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2
Feb 07 13:00:02 node1.example.com systemd[1]: Stopped LVS and VRRP High Availability Monitor.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments