in.telnetd hangs every 24 hours with "in.telnetd:xxxx blocked for more than 120 seconds" traces.
Environment
- Red Hat Enterprise Linux 6.2
- kernel-2.6.32-220.17.1.el6.x86_64
- telnet-server-0.17-47.el6.x86_64
- HP DL980 G7 (Intel(R) Xeon(R) CPU X7560 @ 2.27GHz)
Issue
- Messages file contains multiple entries for in.telnetd hangs every 24 hours
- If the system is rebooted every 24hrs then the messages are not seen.
Jun 27 15:35:23 server kernel: INFO: task in.telnetd:149860 blocked for more than 120 seconds.
Jun 27 15:35:23 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 27 15:35:23 server kernel: in.telnetd D 000000000000001e 0 149860 18044 0x00000000
Jun 27 15:35:23 server kernel: ffff898cc0891c08 0000000000000046 ffffffff814efc7b 0000000000000010
Jun 27 15:35:23 server kernel: 0000000000000246 ffff898cc0891b88 0000000000000018 ffff898fd7331240
Jun 27 15:35:23 server kernel: ffff898b878445f8 ffff898cc0891fd8 000000000000f4e8 ffff898b878445f8
Call Trace:
[<ffffffff814efc7b>] ? __reacquire_kernel_lock+0x5b/0x87
[<ffffffff81051ba3>] ? __wake_up+0x53/0x70
[<ffffffff814ee6fe>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814ee59b>] mutex_lock+0x2b/0x50
[<ffffffff8130e3f3>] tty_release_dev+0x1e3/0x5e0
[<ffffffff8118d55f>] ? __d_free+0x3f/0x60
[<ffffffff8130e80e>] tty_release+0x1e/0x30
[<ffffffff811781c5>] __fput+0xf5/0x210
[<ffffffff81178305>] fput+0x25/0x30
[<ffffffff81173d4d>] filp_close+0x5d/0x90
[<ffffffff8106c7bf>] put_files_struct+0x7f/0xf0
[<ffffffff8106c883>] exit_files+0x53/0x70
[<ffffffff8106e8f5>] do_exit+0x185/0x870
[<ffffffff814efb6b>] ? _spin_unlock_bh+0x1b/0x20
[<ffffffff8106f038>] do_group_exit+0x58/0xd0
[<ffffffff8106f0c7>] sys_exit_group+0x17/0x20
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
- Multiple in.telnetd monopolize 100% cpu (top says 40%).
- The problem is not limited to the telnetd task alone.
- There is seemingly some degree of contention in relation to the TTY subsystem, so anything opening ttys etc.
- For example, telnetd, login, the shell.
- This connection is based upon the sysrq-t info.
Resolution
- A work-around is to periodically clean
utmpofDEAD_PROCESSentries, to avoid the file growing without bound. - If system is rebooted every 24 hours, this has the effect of truncating the
utmpfile as part of the normal rc.sysinit boot processing. - The final solution is to upgrade the
telnet-serverpackage with the errata RHBA-2012-1312 released.
Root Cause
A problem with telnetd's utmp usage/scalability. /bin/login searches utmp for an existing entry with a matching pid by looping over utmp using getutent(). Unfortunately getutent takes fcntl(F_SETLK, {F_RDLCK, ...}) on every call, which has an unfortunate side-effect of taking the BKL (on RHEL 6). Even more unfortunately, in.telnetd and /bin/login leak DEAD_PROCESS entries in utmp. This compounds the problem as the utmp file size grows, causing /bin/login to scan more and more utmp entries. This can eventually lead to a BKL "live-lock", where many tasks are contending for the BKL but no task can hold it for long enough to make any progress, the system becomes unresponsive.
Diagnostic Steps
- Check how big the
/var/run/utmpfile is, how many entries areDEADentries, and if its a large percentage. You may need to installpsacct(fordump-utmp) andvim-common(forxxd) packages:
# yum install psacct vim-common
- Let's see how big one entry is:
$ dump-utmp --raw -n 1 /var/run/utmp | xxd | wc
24 240 1608
- Looks like 24 lines. Now from the man page of
utmp, the format of each record shows the 'type' field is the first byte, followed by the pid.
struct utmp {
short ut_type; /* Type of record */
pid_t ut_pid; /* PID of login process */
- So a
DEADprocess should have this value in the type:
#define DEAD_PROCESS 8 /* Terminated process */
- We can then dump out just the first line of each record, say the first 50 records:
$ dump-utmp --raw -n 50 /var/run/utmp | xxd | awk 'BEGIN { c = 0; } { if (c % 24 == 0) { print $0 }; c++; }'
...
0003180: 0800 0000 507b 0000 7074 732f 3235 3300 ....P{..pts/253.
0003300: 0800 0000 547b 0000 7074 732f 3300 0000 ....T{..pts/3...
0003480: 0800 0000 577b 0000 7074 732f 3334 0000 ....W{..pts/34..
0003600: 0800 0000 467e 0000 7074 732f 3939 0000 ....F~..pts/99..
- And it certainly looks like there's a high percentage of dead entries, but let's count them:
$ dump-utmp --raw /var/run/utmp | xxd | awk 'BEGIN { dead = c = 0; } { if ((c % 24 == 0)&&($2 == 0800)) { dead++; } c++; } END { print "Found "dead" dead records out of "c/24" records "100*dead/(c/24)"%\n" }'
Found 55513 dead records out of 55746 records 99.582%
- Clearly there are a lot of dead records in the
/var/run/utmpfile, which is making the file huge!
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments