in.telnetd hangs every 24 hours with "in.telnetd:xxxx blocked for more than 120 seconds" traces.

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 6.2
  • kernel-2.6.32-220.17.1.el6.x86_64
  • telnet-server-0.17-47.el6.x86_64
  • HP DL980 G7 (Intel(R) Xeon(R) CPU X7560 @ 2.27GHz)

Issue

  • Messages file contains multiple entries for in.telnetd hangs every 24 hours
  • If the system is rebooted every 24hrs then the messages are not seen.
Jun 27 15:35:23 server kernel: INFO: task in.telnetd:149860 blocked for more than 120 seconds.
Jun 27 15:35:23 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 27 15:35:23 server kernel: in.telnetd    D 000000000000001e     0 149860  18044 0x00000000
Jun 27 15:35:23 server kernel: ffff898cc0891c08 0000000000000046 ffffffff814efc7b 0000000000000010
Jun 27 15:35:23 server kernel: 0000000000000246 ffff898cc0891b88 0000000000000018 ffff898fd7331240
Jun 27 15:35:23 server kernel: ffff898b878445f8 ffff898cc0891fd8 000000000000f4e8 ffff898b878445f8

Call Trace:
[<ffffffff814efc7b>] ? __reacquire_kernel_lock+0x5b/0x87
[<ffffffff81051ba3>] ? __wake_up+0x53/0x70
[<ffffffff814ee6fe>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814ee59b>] mutex_lock+0x2b/0x50
[<ffffffff8130e3f3>] tty_release_dev+0x1e3/0x5e0
[<ffffffff8118d55f>] ? __d_free+0x3f/0x60
[<ffffffff8130e80e>] tty_release+0x1e/0x30
[<ffffffff811781c5>] __fput+0xf5/0x210
[<ffffffff81178305>] fput+0x25/0x30
[<ffffffff81173d4d>] filp_close+0x5d/0x90
[<ffffffff8106c7bf>] put_files_struct+0x7f/0xf0
[<ffffffff8106c883>] exit_files+0x53/0x70
[<ffffffff8106e8f5>] do_exit+0x185/0x870
[<ffffffff814efb6b>] ? _spin_unlock_bh+0x1b/0x20
[<ffffffff8106f038>] do_group_exit+0x58/0xd0
[<ffffffff8106f0c7>] sys_exit_group+0x17/0x20
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
  • Multiple in.telnetd monopolize 100% cpu (top says 40%).
    • The problem is not limited to the telnetd task alone.
    • There is seemingly some degree of contention in relation to the TTY subsystem, so anything opening ttys etc.
    • For example, telnetd, login, the shell.
    • This connection is based upon the sysrq-t info.

Resolution

  • A work-around is to periodically clean utmp of DEAD_PROCESS entries, to avoid the file growing without bound.
  • If system is rebooted every 24 hours, this has the effect of truncating the utmp file as part of the normal rc.sysinit boot processing.
  • The final solution is to upgrade the telnet-server package with the errata RHBA-2012-1312 released.

Root Cause

A problem with telnetd's utmp usage/scalability. /bin/login searches utmp for an existing entry with a matching pid by looping over utmp using getutent(). Unfortunately getutent takes fcntl(F_SETLK, {F_RDLCK, ...}) on every call, which has an unfortunate side-effect of taking the BKL (on RHEL 6). Even more unfortunately, in.telnetd and /bin/login leak DEAD_PROCESS entries in utmp. This compounds the problem as the utmp file size grows, causing /bin/login to scan more and more utmp entries. This can eventually lead to a BKL "live-lock", where many tasks are contending for the BKL but no task can hold it for long enough to make any progress, the system becomes unresponsive.

Diagnostic Steps

  • Check how big the /var/run/utmp file is, how many entries are DEAD entries, and if its a large percentage. You may need to install psacct (for dump-utmp) and vim-common (for xxd) packages:
# yum install psacct vim-common
  • Let's see how big one entry is:
$ dump-utmp --raw -n 1 /var/run/utmp | xxd | wc
     24     240    1608
  • Looks like 24 lines. Now from the man page of utmp, the format of each record shows the 'type' field is the first byte, followed by the pid.
          struct utmp {
               short   ut_type;              /* Type of record */
               pid_t   ut_pid;               /* PID of login process */
  • So a DEAD process should have this value in the type:
           #define DEAD_PROCESS  8 /* Terminated process */
  • We can then dump out just the first line of each record, say the first 50 records:
$ dump-utmp --raw -n 50 /var/run/utmp | xxd | awk 'BEGIN { c = 0; } { if (c % 24 == 0) { print $0 }; c++; }'
...
0003180: 0800 0000 507b 0000 7074 732f 3235 3300  ....P{..pts/253.
0003300: 0800 0000 547b 0000 7074 732f 3300 0000  ....T{..pts/3...
0003480: 0800 0000 577b 0000 7074 732f 3334 0000  ....W{..pts/34..
0003600: 0800 0000 467e 0000 7074 732f 3939 0000  ....F~..pts/99..
  • And it certainly looks like there's a high percentage of dead entries, but let's count them:
$ dump-utmp --raw /var/run/utmp | xxd | awk 'BEGIN { dead = c = 0; } { if ((c % 24 == 0)&&($2 == 0800)) { dead++; } c++; } END { print "Found "dead" dead records out of "c/24" records "100*dead/(c/24)"%\n" }'
Found 55513 dead records out of 55746 records 99.582%
  • Clearly there are a lot of dead records in the /var/run/utmp file, which is making the file huge!

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments