How to track who/what is sending SIGKILL to a process?

Solution Verified - Updated -

Environment

o Red Hat Enterprise Linux 4
o Red Hat Enterprise Linux 5
o Red Hat Enterprise Linux 6
o Red Hat Enterprise Linux 7
o Red Hat Enterprise Linux 8
o Red Hat Enterprise Linux 9

Issue

We are investigating an issue with processes that are suddenly dying, and we have determined that the process receives a SIGKILL signal.
However, there is no log message explaining more about the reason for the kill. How can we check who sends the kill signal to the process?

# less /var/log/messages
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Main process exited, code=killed, status=9/KILL
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Killing process 15201 (httpd) with signal SIGKILL.
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Killing process 15202 (httpd) with signal SIGKILL.
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Killing process 15203 (httpd) with signal SIGKILL.
[..]
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Killing process 15413 (n/a) with signal SIGKILL.
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Killing process 15414 (n/a) with signal SIGKILL.
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Killing process 15415 (n/a) with signal SIGKILL.
Jul  5 11:44:45 RHEL9 systemd[1]: httpd.service: Failed with result 'signal'.

Resolution

Using audit

It's possible to configure an audit rule for capturing kill signals which helps us to identify which user has initiated the signal.

The rule should be something like this:

# auditctl  -a exit,always -F arch=b64 -F a1=9 -S kill
# auditctl  -a exit,always -F arch=b64 -F a1=9 -S tkill
# auditctl  -a exit,always -F arch=b64 -F a2=9 -S tgkill
# auditctl  -a exit,always -F arch=b32 -F a1=9 -S kill
# auditctl  -a exit,always -F arch=b32 -F a1=9 -S tkill
# auditctl  -a exit,always -F arch=b32 -F a2=9 -S tgkill

How to use audit to monitor a specific SYSCALL?

Using SystemTap

What is systemtap?
SystemTap is a tracing and probing tool that allows users to study and monitor the activities of the operating system (particularly, the kernel) in fine detail. It provides information similar to the output of tools like netstat, ps, top, and iostat; however, SystemTap is designed to provide more filtering and analysis options for collected information.

How to monitor kill signal 2, 9 and 15 using systemtap script?

Using bcc-tools

bcc-tools is available from RHEL 7.6 or later, 8, and 9. The package contains a lot of pre-built tracing scripts, that includes killsnoop script to trace what process call kill() syscalls.

How to trace signals issued by the kill() syscall?

Diagnostic Steps

We have the HTTPD service running with the PID 15200.

# systemctl status httpd.service
● httpd.service - The Apache HTTP Server
     Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
     Active: active (running) since Tue 2022-07-05 11:32:43 IST; 10min ago
       Docs: man:httpd.service(8)
   Main PID: 15200 (httpd)
     Status: "Total requests: 0; Idle/Busy workers 100/0;Requests/sec: 0; Bytes served/sec:   0 B/sec"
      Tasks: 213 (limit: 11106)
     Memory: 28.9M
        CPU: 596ms
     CGroup: /system.slice/httpd.service
             ├─15200 /usr/sbin/httpd -DFOREGROUND
             ├─15201 /usr/sbin/httpd -DFOREGROUND
             ├─15202 /usr/sbin/httpd -DFOREGROUND
             ├─15203 /usr/sbin/httpd -DFOREGROUND
             └─15204 /usr/sbin/httpd -DFOREGROUND

Jul 05 11:32:42 RHEL9.example.com systemd[1]: Starting The Apache HTTP Server...
Jul 05 11:32:43 RHEL9.example.com httpd[15200]: Server configured, listening on: port 80
Jul 05 11:32:43 RHEL9.example.com systemd[1]: Started The Apache HTTP Server.
[root@RHEL9 ~]# 
[root@RHEL9 ~]# 

Now try to kill a process with the killall command.

# killall -9 <procname>

[root@RHEL9 ~]# killall -9 httpd
[root@RHEL9 ~]# 
[root@RHEL9 ~]# systemctl status httpd.service
× httpd.service - The Apache HTTP Server
     Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
     Active: failed (Result: signal) since Tue 2022-07-05 11:44:45 IST; 2s ago
       Docs: man:httpd.service(8)
    Process: 15200 ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND (code=killed, signal=KILL)
   Main PID: 15200 (code=killed, signal=KILL)
     Status: "Total requests: 0; Idle/Busy workers 100/0;Requests/sec: 0; Bytes served/sec:   0 B/sec"
        CPU: 655ms

Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15407 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15408 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15409 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15410 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15411 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15412 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15413 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15414 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Killing process 15415 (n/a) with signal SIGKILL.
Jul 05 11:44:45 RHEL9.example.com systemd[1]: httpd.service: Failed with result 'signal'.
[root@RHEL9 ~]# 

This produces an output in /var/log/audit/audit.log like the following:

time->Tue Jul  5 11:44:45 2022
type=PROCTITLE msg=audit(1657001685.727:558): proctitle=6B696C6C616C6C002D39006874747064
type=OBJ_PID msg=audit(1657001685.727:558): opid=15200 oauid=-1 ouid=0 oses=-1 obj=system_u:system_r:httpd_t:s0 ocomm="httpd"
type=SYSCALL msg=audit(1657001685.727:558): arch=c000003e syscall=62 success=yes exit=0 a0=3b60 a1=9 a2=0 a3=a items=0 ppid=1264 pid=15588 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="killall" exe="/usr/bin/killall" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)

Where:

syscall=62 represents the code of the syscall intercepted by the audit which is a sys_kill.
pid=15588 is the PID of the process doing the call of the sys_kill function to kill the HTTPD service.

Soruce code

Let's try to understand from the source how things are running in the background.

o SYSCALL_DEFINE2 function accepts the signal and process ID which needs to be killed and then it is passing the argument to the function prepare_kill_siginfo and later on returning to the function kill_something_info where PID is first to get validated and calling the final function kill_proc_info to kill the process.

3755    static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
3756    {
3757        clear_siginfo(info);
3758        info->si_signo = sig;  <<===
3759        info->si_errno = 0;
3760        info->si_code = SI_USER;
3761        info->si_pid = task_tgid_vnr(current);  <<==
3762        info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
3763    }
3764        /**
3765         *  sys_kill - send a signal to a process
3766         *  @pid: the PID of the process
3767         *  @sig: signal to be sent
3768         */
3769    SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
3770    {
3772        struct kernel_siginfo info;
3773
3774        prepare_kill_siginfo(sig, &info); <<== (parsing the arguments)
3775
3776        return kill_something_info(sig, &info, pid);  <<=== (returning to the function)
3777    }



1586    static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid)
1587    {
1587    int ret;
1588
1588    if (pid > 0)   <<===
1589        return kill_proc_info(sig, info, pid);  <<== (retuning to the function)
1590
1591    /* -INT_MIN is undefined.  Exclude this case to avoid a UBSAN warning */
1592    if (pid == INT_MIN)
1593        return -ESRCH;
1594
1595    read_lock(&tasklist_lock);



1490    static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
1491    {
1492        int error;
1493        rcu_read_lock();
1494        error = kill_pid_info(sig, info, find_vpid(pid));
1495        rcu_read_unlock();
1496        return error;
1497    }

A list of the available syscalls and their numbers on RHEL4 are contained respectively in:

/usr/src/kernels/<version>/include/asm-i386/unistd.h for i386 systems
/usr/src/kernels/<version>/include/asm-x86_64/unistd.h for x86_64 systems

The above files are included in the kernel-devel rpm package.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments