Message "audit: backlog limit exceeded" reported and possibly hung system due to a frozen filesystem

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 6
  • auditd

Issue

  • This message is being displayed continuously on console. I will have to power cycle to reboot.
audit: backlog limit exceeded
  • Following messages seen in system log:
audit: audit_backlog=321 > audit_backlog_limit=320
audit: audit_lost=44393 audit_rate_limit=0 audit_backlog_limit=320
audit: backlog limit exceeded
audit: audit_backlog=321 > audit_backlog_limit=320
audit: audit_lost=44394 audit_rate_limit=0 audit_backlog_limit=320
audit: backlog limit exceeded
audit: audit_backlog=321 > audit_backlog_limit=320
audit: audit_lost=44395 audit_rate_limit=0 audit_backlog_limit=320
audit: backlog limit exceeded
audit: audit_backlog=321 > audit_backlog_limit=320

Resolution

Note: To confirm this issue, you must gather a vmcore and submit it for analysis - see Diagnostic Steps for further information.

1) Unfreeze the frozen filesystem to allow the audit daemon to write out the backlog of audit data.

# fsfreeze -u <mount point>

2) Address the underlying problem which caused the filesystems to get stuck in a frozen state. Please see How to determine what process caused an ext4 filesystem to freeze? for additional information.

3) Adjust the audit subsystem settings to prevent the error messages. For more information, see https://access.redhat.com/solutions/19327

Note: if you are experiencing an issue with frozen filesystems that is not addressed in this solution, please visit How to determine what process caused an ext4 filesystem to freeze? for additional information.

Root Cause

  • The audit subsystem in kernel was reporting backlog exceeded errors because the auditd daemon was unable to write the audit data to a frozen file system and as a result the incoming stream of audit data overflowed.

  • Possibly https://access.redhat.com/site/solutions/473213.

Diagnostic Steps

General steps

Summary vmcore analysis

1) The vmcore will have one or more frozen filesystems. To check for frozen filesystems, see https://access.redhat.com/articles/1225503.

2) The vmcore will contain one or more auditd processes stuck in a backtrace indicating its waiting on a frozen filesystem to thaw. Use the 'foreach' command in crash to dump out the auditd processes

crash> foreach auditd bt
  • Note: Below are example backtraces of the two processes and have been found to a "quick check" of the signature for this issue. To fully confirm this issue, the detailed vmcore steps must be followed, which includes deeper analysis of the process stacks and data structures involved.
  • Example1:
PID: 1694   TASK: ffff880331c81540  CPU: 6   COMMAND: "auditd"
 #0 [ffff8803320a1b98] schedule at ffffffff814ec9f2
 #1 [ffff8803320a1c60] __generic_file_aio_write at ffffffff81112bd5
 #2 [ffff8803320a1d20] generic_file_aio_write at ffffffff81112fff
 #3 [ffff8803320a1d70] ext4_file_write at ffffffffa009ade1 [ext4]
 #4 [ffff8803320a1dc0] do_sync_write at ffffffff811762da
  • Example2:
 #0 [ffff88013b75dbd8] schedule at ffffffff815278d2
 #1 [ffff88013b75dc00] futex_wait_queue_me at ffffffff810ae8e9
 #2 [ffff88013b75dc60] prepare_to_wait at ffffffff8109b5de
 #3 [ffff88013b75dca0] __sb_start_write at ffffffff8118af3c
 #4 [ffff88013b75dd30] generic_file_aio_write at ffffffff81121fd9
 #5 [ffff88013b75dd80] ext4_file_write at ffffffffa00a8fd8 [ext4]
 #6 [ffff88013b75ddc0] do_sync_write at ffffffff81188c8a
 #7 [ffff88013b75def0] vfs_write at ffffffff81188f88

Detailed vmcore analysis

Looking at the vmcore:

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/841303283/crash/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Thu Aug 29 04:34:19 2013
      UPTIME: 191 days, 02:42:42
LOAD AVERAGE: 938.98, 938.19, 936.32
       TASKS: 2874
    NODENAME: <nodename>
     RELEASE: 2.6.32-220.el6.x86_64
     VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011
     MACHINE: x86_64  (2533 Mhz)
      MEMORY: 12 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)

Looking at the auditd processes:

crash> ps | grep audit
   1118      2   5  ffff8803329e8ac0  IN   0.0       0      0  [kauditd]
   1693      1   7  ffff88033395b540  IN   0.0   93248    764  auditd
   1694      1   6  ffff880331c81540  UN   0.0   93248    764  auditd

crash> bt 1693
PID: 1693   TASK: ffff88033395b540  CPU: 7   COMMAND: "auditd"
 #0 [ffff88032d845b38] schedule at ffffffff814ec9f2
 #1 [ffff88032d845c00] futex_wait_queue_me at ffffffff810a2419
 #2 [ffff88032d845c40] futex_wait at ffffffff810a34e8
 #3 [ffff88032d845dc0] do_futex at ffffffff810a4c31
 #4 [ffff88032d845ef0] sys_futex at ffffffff810a568b
 #5 [ffff88032d845f80] system_call_fastpath at ffffffff8100b0f2

That one is not stuck, try the next one:

crash> bt 1694
PID: 1694   TASK: ffff880331c81540  CPU: 6   COMMAND: "auditd"
 #0 [ffff8803320a1b98] schedule at ffffffff814ec9f2
 #1 [ffff8803320a1c60] __generic_file_aio_write at ffffffff81112bd5
 #2 [ffff8803320a1d20] generic_file_aio_write at ffffffff81112fff
 #3 [ffff8803320a1d70] ext4_file_write at ffffffffa009ade1 [ext4]
 #4 [ffff8803320a1dc0] do_sync_write at ffffffff811762da
 #5 [ffff8803320a1ef0] vfs_write at ffffffff811765d8
 #6 [ffff8803320a1f30] sys_write at ffffffff81176fe1
 #7 [ffff8803320a1f80] system_call_fastpath at ffffffff8100b0f2

Using the function just before the call to schedule, map the address to code and see where we are. That looks to be stuck on a frozen filesystem, check

 #1 [ffff8803320a1c60] __generic_file_aio_write at ffffffff81112bd5

$ eu-addr2line -e /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux 0xffffffff81112bd5
mm/filemap.c:2437
m/filemap.c
2428                                                                                                                                                                                                                                        
2429    ocount = 0;                                                                                                                                                                                                                         
2430    err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);                                                                                                                                                                  
2431    if (err)                                                                                                                                                                                                                            
2432            return err;                                                                                                                                                                                                                 
2433                                                                                                                                                                                                                                        
2434    count = ocount;                                                                                                                                                                                                                     
2435    pos = *ppos;                                                                                                                                                                                                                        
2436                                                                                                                                                                                                                                        
2437--> vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);                                                                                                                                                                                     
2438                                                                                                                                                                                                                                        

Pull the first argument to __generic_file_aio_write() off the stack so we can get to the super_block:

crash> dis -rl ffffffff81112fff
...
0xffffffff81112ff4 <generic_file_aio_write+100>:    mov    %r14,%rdi
0xffffffff81112ff7 <generic_file_aio_write+103>:    mov    %r15,%rsi
0xffffffff81112ffa <generic_file_aio_write+106>:    callq  0xffffffff81112b10 <__generic_file_aio_write>
/usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/mm/filemap.c: 2539
0xffffffff81112fff <generic_file_aio_write+111>:    mov    %rbx,%rdi

crash> dis -rl ffffffff81112bd5
/usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/mm/filemap.c: 2419
0xffffffff81112b10 <__generic_file_aio_write>:  push   %rbp
0xffffffff81112b11 <__generic_file_aio_write+1>:    mov    %rsp,%rbp
0xffffffff81112b14 <__generic_file_aio_write+4>:    push   %r15
0xffffffff81112b16 <__generic_file_aio_write+6>:    push   %r14
...

crash> bt -f
PID: 1694   TASK: ffff880331c81540  CPU: 6   COMMAND: "auditd"
...
 #1 [ffff8803320a1c60] __generic_file_aio_write at ffffffff81112bd5
    ffff8803320a1c68: ffffffff00000282 ffff8803320a1ca0 
    ffff8803320a1c78: ffff8803320a1d10 000000000831d4b7 
    ffff8803320a1c88: 00007f5bdf512274 ffff8803320a1e48 
    ffff8803320a1c98: 0000000000000129 ffff8803320a1eb8 
    ffff8803320a1ca8: ffff8803320a1dc8 0000000000000001 
    ffff8803320a1cb8: 0000000000000000 ffff880331c81540 
    ffff8803320a1cc8: ffffffff81090bf0 ffff88010ad47cd0 
    ffff8803320a1cd8: ffff880122fa9cd0 0000000000000129 
    ffff8803320a1ce8: 0000000000000000 ffff88032c3c1978 
    ffff8803320a1cf8: ffff880046425b40 000000000036c3a4 
    ffff8803320a1d08: ffff8803320a1dc8 ffff8803320a1eb8 
    ffff8803320a1d18: ffff8803320a1d68 ffffffff81112fff 
 #2 [ffff8803320a1d20] generic_file_aio_write at ffffffff81112fff
...

crash> kiocb.ki_filp ffff8803320a1dc8
  ki_filp = 0xffff880046425b40

crash> struct file.f_mapping 0xffff880046425b40
  f_mapping = 0xffff88032c3c19e0

crash> address_space.host 0xffff88032c3c19e0
  host = 0xffff88032c3c18c0

crash> inode.i_sb 0xffff88032c3c18c0
  i_sb = 0xffff880332b1b000

crash> super_block.s_frozen 0xffff880332b1b000
  s_frozen = 2

That value (2) is SB_FREEZE_TRANS and means that fsfreeze has been used to freeze the filesystem. It will require a thaw operation to bring it back to life and allow the audit daemon to continue writing it's audit stream. The audit daemon, amongst other processes, has been stuck like this for ~11 hours.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments