[SOLVED] - unkown cause for systemd failure state - systemd-logind cannot start org.freedesktop.systemd1 after server reboot

Latest response

Basic information

Red Hat Enterprise Linux Server release 7.4 (Maipo)
component: systemd
Hardware: x86_64 Linux
[root@scvberpat01 log]# uname -r
3.10.0-693.21.1.el7.x86_64

General description

Hey everyone, I had quite a major issue this weekend resulting in a necessary server restart of our company's production Server.

It all corresponded around the systemd Service going into FAILURE Status and not being able to start the loginmanager org.freedesktop.login1 and org.freedesktop.systemd1 via dbus/systemd-logind again.
It seems that something that I am not able to recognise has managed to kill the systemd-logind Service. We are running a heavy production System with around 1700 Tasks/processes on an average production day simultaniously. Crons and incrons are in use.
In Addition, since the restart the modul org.freedesktop.systemd1 cannot be started by systemd causing a significant delay in logintime and user change time via su.

systemd version

[root@scvberpat01 tmp]# systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

Journal entries of the incident starting off the FAILURE state of systemd

ul 14 05:09:40 scvberpat01 sshd[31966]: Accepted password for erp_monitoring from 172.20.0.63 port 58768 ssh2
Jul 14 05:09:40 scvberpat01 systemd-logind: New session 1135476 of user erp_monitoring.
Jul 14 05:09:40 scvberpat01 systemd: Started Session 1135476 of user erp_monitoring.
Jul 14 05:09:40 scvberpat01 systemd: Starting Session 1135476 of user erp_monitoring.
Jul 14 05:09:45 scvberpat01 sshd[32537]: Accepted password for erp_monitoring from 172.20.0.63 port 58801 ssh2
Jul 14 05:09:45 scvberpat01 systemd-logind: New session 1135477 of user erp_monitoring.
Jul 14 05:09:45 scvberpat01 systemd: Started Session 1135477 of user erp_monitoring.
Jul 14 05:09:45 scvberpat01 systemd: Starting Session 1135477 of user erp_monitoring.
Jul 14 05:10:01 scvberpat01 systemd: Started Session 1135478 of user root.
Jul 14 05:10:01 scvberpat01 systemd: Starting Session 1135478 of user root.
Jul 14 05:10:02 scvberpat01 systemd: Started Session 1135479 of user paseb.
Jul 14 05:10:02 scvberpat01 systemd: Starting Session 1135479 of user paseb.
Jul 14 05:10:02 scvberpat01 systemd: Started Session 1135480 of user root.
Jul 14 05:10:02 scvberpat01 systemd: Starting Session 1135480 of user root.
Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135482 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135482 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135481 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135481 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135484 of user batchparg.
Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135484 of user batchparg.
Jul 14 05:10:04 scvberpat01 systemd: Started Session 1135485 of user batchparg.
Jul 14 05:10:04 scvberpat01 systemd: Starting Session 1135485 of user batchparg.
Jul 14 05:10:05 scvberpat01 systemd: Started Session 1135483 of user batchparg.
Jul 14 05:10:05 scvberpat01 systemd: Starting Session 1135483 of user batchparg.
Jul 14 05:10:06 scvberpat01 systemd: Started Session 1135486 of user batchparg.
Jul 14 05:10:06 scvberpat01 systemd: Starting Session 1135486 of user batchparg.
Jul 14 05:10:07 scvberpat01 systemd: Started Session 1135487 of user batchparg.
Jul 14 05:10:07 scvberpat01 systemd: Starting Session 1135487 of user batchparg.
Jul 14 05:10:08 scvberpat01 systemd: Started Session 1135488 of user batchparg.
Jul 14 05:10:08 scvberpat01 systemd: Starting Session 1135488 of user batchparg.
Jul 14 05:10:10 scvberpat01 systemd: Started Session 1135489 of user batchparg.
Jul 14 05:10:10 scvberpat01 systemd: Starting Session 1135489 of user batchparg.
Jul 14 05:10:11 scvberpat01 systemd: Started Session 1135490 of user batchparg.
Jul 14 05:10:11 scvberpat01 systemd: Starting Session 1135490 of user batchparg.
Jul 14 05:10:12 scvberpat01 systemd: Started Session 1135491 of user batchparg.
Jul 14 05:10:12 scvberpat01 systemd: Starting Session 1135491 of user batchparg.
Jul 14 05:10:13 scvberpat01 systemd: systemd-logind.service has no holdoff time, scheduling restart.
Jul 14 05:10:13 scvberpat01 systemd: Starting Login Service...
Jul 14 05:10:13 scvberpat01 systemd: Started Login Service.
Jul 14 05:10:13 scvberpat01 systemd-logind: New seat seat0.
Jul 14 05:10:13 scvberpat01 systemd-logind: Failed to read /run/systemd/users/11469: Argument list too long
Jul 14 05:10:13 scvberpat01 systemd-logind: Failed to read /run/systemd/users/0: Argument list too long
Jul 14 05:10:13 scvberpat01 systemd-logind: User enumeration failed: Argument list too long
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to start user slice user-11469.slice, ignoring: No buffer space available ((null))
Jul 14 05:11:59 scvberpat01 systemd-logind: Failed to start user slice user-0.slice, ignoring: No buffer space available ((null))
Jul 14 05:13:13 scvberpat01 systemd: systemd-logind.service watchdog timeout (limit 3min)!
Jul 14 05:13:13 scvberpat01 abrt-hook-ccpp: Process 2577 (systemd-logind) of user 0 killed by SIGABRT - dumping core
Jul 14 05:13:16 scvberpat01 ModemManager[1041]: [sleep-monitor] inhibit failed: GDBus.Error:org.freedesktop.DBus.Error.NoReply: Message did not receive a reply (timeout by message b
us)
Jul 14 05:13:16 scvberpat01 systemd: systemd-logind.service: main process exited, code=dumped, status=6/ABRT
Jul 14 05:13:16 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
Jul 14 05:13:16 scvberpat01 systemd: systemd-logind.service failed.
Jul 14 05:13:21 scvberpat01 kernel: nr_pdflush_threads exported in /proc is scheduled for removal
Jul 14 05:13:40 scvberpat01 dbus[1028]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 14 05:13:41 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
Jul 14 05:13:41 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
Jul 14 05:13:41 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 14 05:13:41 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
Jul 14 05:13:41 scvberpat01 systemd: Failed to start Login Service.
Jul 14 05:13:41 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
Jul 14 05:13:41 scvberpat01 systemd: systemd-logind.service failed.
Jul 14 05:14:05 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 14 05:14:06 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
Jul 14 05:14:06 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
Jul 14 05:14:06 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 14 05:14:06 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
Jul 14 05:14:06 scvberpat01 systemd: Failed to start Login Service.
Jul 14 05:14:06 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
Jul 14 05:14:06 scvberpat01 systemd: systemd-logind.service failed.
Jul 14 05:14:31 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
Jul 14 05:14:31 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 14 05:14:31 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
Jul 14 05:14:31 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
...
...

This went on for 1 and a half days until it impacted the Server heavily. In the end ssh, samba as well as the terminal did not respond anymore. Unfortunately it wasn't recognised beforehand so a hard reset was performed.

After the restart it seems that systemd is not able to start org.freedesktop.systemd1:

Jul 18 13:21:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
Jul 18 13:21:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33081.scope: Connection timed out
Jul 18 13:22:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
Jul 18 13:22:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33086.scope: Activation of org.freedesktop.systemd1 timed out
Jul 18 13:23:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
Jul 18 13:23:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33092.scope: Activation of org.freedesktop.systemd1 timed out
Jul 18 13:24:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Connection timed out ((null))
Jul 18 13:24:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33097.scope: Activation of org.freedesktop.systemd1 timed out
Jul 18 13:25:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Connection timed out ((null))
Jul 18 13:25:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33102.scope: Connection timed out

"busctl --list" is showing that org.freedesktop.systemd1 is not activ

[root@scvberpat01 log]# busctl --list
NAME PID PROCESS USER CONNECTION UNIT SESSION DESCRIPTION
:1.0 1023 systemd-logind root :1.0 systemd-logind.service - -
:1.10 1076 NetworkManager root :1.10 NetworkManager.service - -
:1.2 1019 avahi-daemon avahi :1.2 avahi-daemon.service - -
:1.24 1348 cupsd root :1.24 cups.service - -
:1.25 1342 tuned root :1.25 tuned.service - -
:1.26 1583 colord colord :1.26 colord.service - -
:1.27 1348 cupsd root :1.27 cups.service - -
:1.28 2268 libvirtd root :1.28 libvirtd.service - -
:1.3 1030 rtkit-daemon root :1.3 rtkit-daemon.service - -
:1.34123 15157 abrt-dbus root :1.34123 dbus.service - -
:1.34132 9585 busctl root :1.34132 sshd.service - -
:1.384 4756 packagekitd root :1.384 packagekit.service - -
:1.4 1029 ModemManager root :1.4 ModemManager.service - -
:1.5 1020 polkitd polkitd :1.5 polkit.service - -
:1.6 1088 accounts-daemon root :1.6 accounts-daemon.service - -
:1.7 1076 NetworkManager root :1.7 NetworkManager.service - -
com.redhat.RHSM1 - - - (activatable) - -
com.redhat.RHSM1.Facts - - - (activatable) - -
com.redhat.SubscriptionManager - - - (activatable) - -
com.redhat.ifcfgrh1 1076 NetworkManager root :1.10 NetworkManager.service - -
com.redhat.problems.configuration - - - (activatable) - -
com.redhat.tuned 1342 tuned root :1.25 tuned.service - -
fi.epitest.hostap.WPASupplicant - - - (activatable) - -
fi.w1.wpa_supplicant1 - - - (activatable) - -
net.reactivated.Fprint - - - (activatable) - -
org.bluez - - - (activatable) - -
org.fedoraproject.SetroubleshootFixit - - - (activatable) - -
org.fedoraproject.Setroubleshootd - - - (activatable) - -
org.freedesktop.Accounts 1088 accounts-daemon root :1.6 accounts-daemon.service - -
org.freedesktop.Avahi 1019 avahi-daemon avahi :1.2 avahi-daemon.service - -
org.freedesktop.ColorManager 1583 colord colord :1.26 colord.service - -
org.freedesktop.DBus - - - - - - -
org.freedesktop.Flatpak.SystemHelper - - - (activatable) - -
org.freedesktop.GeoClue2 - - - (activatable) - -
org.freedesktop.ModemManager1 1029 ModemManager root :1.4 ModemManager.service - -
org.freedesktop.NetworkManager 1076 NetworkManager root :1.7 NetworkManager.service - -
org.freedesktop.PackageKit 4756 packagekitd root :1.384 packagekit.service - -
org.freedesktop.PolicyKit1 1020 polkitd polkitd :1.5 polkit.service - -
org.freedesktop.RealtimeKit1 1030 rtkit-daemon root :1.3 rtkit-daemon.service - -
org.freedesktop.UDisks2 - - - (activatable) - -
org.freedesktop.UPower - - - (activatable) - -
org.freedesktop.hostname1 - - - (activatable) - -
org.freedesktop.import1 - - - (activatable) - -
org.freedesktop.locale1 - - - (activatable) - -
org.freedesktop.login1 1023 systemd-logind root :1.0 systemd-logind.service - -
org.freedesktop.machine1 - - - (activatable) - -
org.freedesktop.nm_dispatcher - - - (activatable) - -
org.freedesktop.problems 15157 abrt-dbus root :1.34123 dbus.service - -
org.freedesktop.realmd - - - (activatable) - -
org.freedesktop.systemd1 - - - (activatable) - -
org.freedesktop.timedate1 - - - (activatable) - -
org.gnome.GConf.Defaults - - - (activatable) - -
org.opensuse.CupsPkHelper.Mechanism - - - (activatable) - -

straceing a user Change Shows exactly the timout of 30 seconds line 18:34:56. This doesnt really tell me much but maybe someone can help me here. Comparing this with our testsystem which didnt face this Major incident is showing no timeout here.

18:34:56 close(3) = 0
18:34:56 munmap(0x7fee1868b000, 4096) = 0
18:34:56 socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
18:34:56 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
18:34:56 setsockopt(3, SOL_SOCKET, SO_PASSSEC, [0], 4) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
18:34:56 setsockopt(3, SOL_SOCKET, SO_RCVBUFFORCE, [8388608], 4) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
18:34:56 setsockopt(3, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = 0
18:34:56 connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_PEERSEC, 0x5645d834d4f0, 0x7ffdc6169a50) = -1 ENOPROTOOPT (Protocol not available)
18:34:56 fstat(3, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
18:34:56 getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
18:34:56 geteuid() = 0
18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(3)=[{"\0AUTH EXTERNAL ", 15}, {"30", 2}, {"\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 28}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 45
18:34:56 gettid() = 9342
18:34:56 getrandom("\375\326f\327\327UZ\222\347T\242\240\304\227\5}", 16, GRND_NONBLOCK) = 16
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"OK 0345052a3855cdc590c849e45b4b4"..., 256}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 52
18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/fre"..., 128}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 128
18:34:56 recvmsg(3, 0x7ffdc6168960, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
18:34:56 ppoll([{fd=3, events=POLLIN}], 1, {24, 999810000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {24, 999633311})
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\1\0\0\0E\0\0\0\6\1s\0\10\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{":1.25140\0\0\0\0\0\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1s\0\0"..., 77}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 77
18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1p\0\0\0\2\0\0\0\230\0\0\0\1\1o\0\27\0\0\0/org/fre"..., 168}, {"\342,\0\0~$\0\0\4\0\0\0su-l\0\0\0\0\3\0\0\0tty\0\4\0\0\0"..., 112}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 280
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1\r\0\0\0\2\0\0\0\225\0\0\0\1\1o\0\25\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0"..., 157}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 157
18:34:56 recvmsg(3, 0x7ffdc6168b10, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
18:34:56 ppoll([{fd=3, events=POLLIN}], 1, {24, 999829000}, NULL, 8) = 0 (Timeout)
18:35:21 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4
18:35:21 fstat(4, {st_mode=S_IFREG|0644, st_size=2502, ...}) = 0
18:35:21 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fee1868b000
18:35:21 read(4, "# Locale name alias data base.\n#"..., 4096) = 2502
18:35:21 read(4, "", 4096) = 0
18:35:21 close(4) = 0
18:35:21 munmap(0x7fee1868b000, 4096) = 0

#

I would Need help in both the possible root cause which caused systemd to get into FAILURE state as well as a solution to org.freedesktop.systemd1 failing to start. Which in my opinion is causing a Login/user change delay of 30 seconds.
Please consider the fact that we are talking about a 24/7 production System.

Thanks a lot in advance
Mario

@EDIT: Attached Core dump

Attachments

Responses