systemd layered slices left in a dead state
Environment
- OpenShift Container Platform 3.9
- Red Hat Enterprise Linux (RHEL) 7
- systemd-219-62.el7_6.7
- Keepalived(all versions)
Issue
- We got around 8GB of logs in the last 50 hours on tree of our
OpenShift
nodes. The error is:
Jun 18 12:30:48 example.com systemd[1]: Failed to set up mount unit: Invalid argument
On nodes with high pod counts we observe many inactive records:
# systemctl list-units --type slice --all first*
UNIT LOAD ACTIVE SUB DESCRIPTION
first-second-third.slice loaded active active first-second-third.slice
first-second.slice loaded inactive dead first-second.slice
first.slice loaded active active first.slice
# systemctl list-units --all |grep "inact.*dead" | wc -l
93000
Keepalived
processes are not killed aftersystemctl stop keepalived.process
, due to whichVIP
does not failover from master to backup node.
Resolution
Update to systemd-219-67.el7_7.1
shipped with Advisory RHBA-2019:2356 or newer.
Root Cause
The bug was fixed with backport of upstream pull request 8175.
Diagnostic Steps
- systemd does not kill Keepalived processes (parent and child "vrrp and checkers") running with systemd
systemd-219-67.el7.x86_64
when stopped using eithersystemctl stop keepalived.service
orservice keepalived stop
.
[root@r77 ~]# rpm -q systemd
systemd-219-67.el7.x86_64 <<-- issue is seen with this version
[root@r77 ~]# systemctl stop keepalived
[root@r77 ~]# systemctl status keepalived.service
keepalived.service - LVS and VRRP High Availability Monitor
Loaded: loaded (/usr/lib/systemd/system/keepalived.service; disabled; vendor preset: disabled)
Active: inactive (dead) since Thu 2020-02-06 10:20:26 CET; 4min 26s ago <<-- Inactive dead
Process: 2359 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
CGroup: /system.slice/keepalived.service
├─2360 /usr/sbin/keepalived -D <<-- all process are still listed
├─2361 /usr/sbin/keepalived -D
└─2362 /usr/sbin/keepalived -D
Feb 06 10:20:15 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:15 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:15 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:20 node1.example.com Keepalived_vrrp[2362]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 06 10:20:26 node1.example.com systemd[1]: Stopped LVS and VRRP High Availability Monitor.
- With systemd
systemd-219-67.el7_7.1
or later :
# rpm -qa systemd
systemd-219-67.el7_7.2.x86_64
# systemctl stop keepalived.service
# systemctl status keepalived -l
keepalived.service - LVS and VRRP High Availability Monitor
Loaded: loaded (/usr/lib/systemd/system/keepalived.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Feb 03 23:03:41 node1.example.com Keepalived_vrrp[28237]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 03 23:03:41 node1.example.com Keepalived_vrrp[28237]: Sending gratuitous ARP on eth0 for 10.0.0.1
Feb 07 13:00:01 node1.example.com systemd[1]: Stopping LVS and VRRP High Availability Monitor...
Feb 07 13:00:01 node1.example.com Keepalived[28234]: Stopping
Feb 07 13:00:01 node1.example.com Keepalived_vrrp[28237]: VRRP_Instance(VI_1) sent 0 priority
Feb 07 13:00:01 node1.example.com Keepalived_vrrp[28237]: VRRP_Instance(VI_1) removing protocol VIPs.
Feb 07 13:00:01 node1.example.com Keepalived_healthcheckers[28236]: Stopped
Feb 07 13:00:02 node1.example.com Keepalived_vrrp[28237]: Stopped
Feb 07 13:00:02 node1.example.com Keepalived[28234]: Stopped Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2
Feb 07 13:00:02 node1.example.com systemd[1]: Stopped LVS and VRRP High Availability Monitor.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments