systemd hangs and is unresponsive with “Assertion ‘pid ›= 1’ failed at src/core/unit.c:1997, function unit_watch_pid(). Aborting.” visible in the system logs

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 7
  • systemd-219-30.el7_3.7

Issue

  • systemd is not responding and misbehaving on multiple hosts because of memory pressure. It is refusing connections on its usual sockets, so commands like systemctl and journalctl hang or fail. It is also failing to reap processes that it inherits (an important responsibility of pid 1), so tens of thousands of zombies are piling up. The following is visible within /var/log/messages
May 20 06:47:56 host.example.com systemd[1]: Failed to fork: Cannot allocate memory
May 20 06:47:56 host.example.com systemd[1]: Assertion 'pid >= 1' failed at src/core/unit.c:1997, function unit_watch_pid(). Aborting.
May 20 06:47:56 host.example.com systemd[1]: Caught <ABRT>, cannot fork for core dump: Cannot allocate memory
May 20 06:47:56 host.example.com systemd[1]: Freezing execution.

Resolution

This issue has been fixed in
systemd-219-57.el7 released with RHEL 7.5
systemd-219-42.el7_4.10 for RHEL 7.4.Z Stream
systemd-219-19.el7_2.21 for RHEL 7.2.EUS Stream

In order recover from the systemd hung state, a reboot of the system is recommended. This may require the "-f" or possibly "-ff" flags be passed to systemctl reboot. To this end, the following information is applicable:

       -f, --force
           When used with enable, overwrite any existing conflicting symlinks.

           When used with halt, poweroff, reboot or kexec, execute the selected operation without shutting down all
           units. However, all processes will be killed forcibly and all file systems are unmounted or remounted
           read-only. This is hence a drastic but relatively safe option to request an immediate reboot. If --force
           is specified twice for these operations, they will be executed immediately without terminating any
           processes or unmounting any file systems. Warning: specifying --force twice with any of these operations
           might result in data loss.

Root Cause

The systemd utility, when the system is under memory pressure, is susceptible to the issue documented in the following upstream commit:

commit 74129a127676e4f0edac0db4296c103e76ec6694
Author: lc85446 <lc85446@alibaba-inc.com>
Date:   Thu Nov 26 11:46:40 2015 +0800

    core:execute: fix fork() fail handling in exec_spawn()

        If pid < 0 after fork(), 0 is always returned because r =
        exec_context_load_environment() has exited successfully.

        This will make the caller of exec_spawn() not able to handle
        the fork() error case and make systemd abort assert() possibly.

diff --git a/src/core/execute.c b/src/core/execute.c
index 677480c..4f67a9d 100644
--- a/src/core/execute.c
+++ b/src/core/execute.c
@@ -2056,7 +2056,7 @@ int exec_spawn(Unit *unit,
                    NULL);
         pid = fork();
         if (pid < 0)
-                return log_unit_error_errno(unit, r, "Failed to fork: %m");
+                return log_unit_error_errno(unit, errno, "Failed to fork: %m");

         if (pid == 0) {
                 int exit_status;

Due to the incorrect error handling behaviour above, the pid member is not greater than or equal to 1, which results in the following assertion failing.

int unit_watch_pid(Unit *u, pid_t pid) {
        int q, r;

        assert(u);
        assert(pid >= 1);

        /* Watch a specific PID. We only support one or two units
         * watching each PID for now, not more. */

        r = set_ensure_allocated(&u->pids, NULL);
        if (r < 0)
                return r;

        r = hashmap_ensure_allocated(&u->manager->watch_pids1, NULL);
        if (r < 0)
                return r;

        r = hashmap_put(u->manager->watch_pids1, LONG_TO_PTR(pid), u);
        if (r == -EEXIST) {
                r = hashmap_ensure_allocated(&u->manager->watch_pids2, NULL);
                if (r < 0)
                        return r;

                r = hashmap_put(u->manager->watch_pids2, LONG_TO_PTR(pid), u);
        }

        q = set_put(u->pids, LONG_TO_PTR(pid));
        if (q < 0)
                return q;

        return r;
}

Diagnostic Steps

  • Verify /var/log/messages or similar general system logging files for the presence of the following message:
Assertion 'pid >= 1' failed at src/core/unit.c:1997, function unit_watch_pid(). Aborting.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments