Mounting an ext4 file system caused panic

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux 7
    • 3.10.0-1160.11.1.el7.x86_64
  • RHEL 7 and lower RHEL versions
  • EXT4 File System

Issue

  • Mounting an ext4 file system caused crash
  • EXT4 file system is corrupted and mount command causes a panic
crash> bt
PID: 31814  TASK: ffff9d986dfcb180  CPU: 5   COMMAND: "mount"
 #0 [ffff9d9a727eb890] machine_kexec at ffffffff9a6662c4
 #1 [ffff9d9a727eb8f0] __crash_kexec at ffffffff9a722802
 #2 [ffff9d9a727eb9c0] crash_kexec at ffffffff9a7228f0
 #3 [ffff9d9a727eb9d8] oops_end at ffffffff9ad8b798
 #4 [ffff9d9a727eba00] die at ffffffff9a630a7b
 #5 [ffff9d9a727eba30] do_trap at ffffffff9ad8aee0
 #6 [ffff9d9a727eba80] do_invalid_op at ffffffff9a62d2a4
 #7 [ffff9d9a727ebb30] invalid_op at ffffffff9ad972ee
    [exception RIP: ext4_clear_journal_err+230]
    RIP: ffffffffc0b3eb66  RSP: ffff9d9a727ebbe0  RFLAGS: 00010246
    RAX: ffff9d9f15234000  RBX: ffff9d9f15230000  RCX: 00000000026448fe
    RDX: ffff9d9f0c034400  RSI: ffff9d9f0c03443a  RDI: ffff9d9f15230000
    RBP: ffff9d9a727ebc10   R8: 000000000001f0e0   R9: ffffffffc0b69f65
    R10: ffff9d9f2f09f0e0  R11: fffffd55205a9640  R12: ffff9d9f15230000
    R13: ffff9d9f16a59e80  R14: ffff9d9f15236800  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff9d9a727ebc18] ext4_load_journal at ffffffffc0b69faa [ext4]
 #9 [ffff9d9a727ebca0] ext4_fill_super at ffffffffc0b4200e [ext4]
#10 [ffff9d9a727ebd90] mount_bdev at ffffffff9a851e53
#11 [ffff9d9a727ebe00] ext4_mount at ffffffffc0b3a595 [ext4]
#12 [ffff9d9a727ebe10] mount_fs at ffffffff9a8527be
#13 [ffff9d9a727ebe58] vfs_kern_mount at ffffffff9a871467
#14 [ffff9d9a727ebe90] do_mount at ffffffff9a873b9f
#15 [ffff9d9a727ebf18] sys_mount at ffffffff9a8749f3
#16 [ffff9d9a727ebf50] system_call_fastpath at ffffffff9ad93f92

Resolution

A bugzilla has been opened to address and mitigate the issue of crash during ext4 mount.
https://bugzilla.redhat.com/show_bug.cgi?id=1933975

This issue has been reported upstream and a fix is identified. https://lore.kernel.org/linux-ext4/20200710140759.18031-1-jack@suse.cz/

Root Cause

The ext4 File System that was tried to mount seemed to have been corrupted, due to underlying storage issue (which was noted in one case study). And the vmcore analysis indicated an invalid ext4_super_block* reference which caused the BUGON condition check to trigger the crash

/*
 * If we are mounting (or read-write remounting) a filesystem whose journal
 * has recorded an error from a previous lifetime, move that error to the
 * main filesystem now.
 */
static void ext4_clear_journal_err(struct super_block *sb,
                                   struct ext4_super_block *es)
{
        journal_t *journal;
        int j_errno;
        const char *errstr;

        BUG_ON(!EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_HAS_JOURNAL));           <<  R[1]  

        journal = EXT4_SB(sb)->s_journal;

        /*
         * Now check for any error status which may have been recorded in the
         * journal by a prior ext4_error() or ext4_abort()
         */

        j_errno = jbd2_journal_errno(journal);
        if (j_errno) {
                char nbuf[16];

                errstr = ext4_decode_error(sb, j_errno, nbuf);
                ext4_warning(sb, "Filesystem error recorded "
                             "from previous mount: %s", errstr);
                ext4_warning(sb, "Marking fs in need of filesystem check.");

                EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
                es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
                ext4_commit_super(sb, 1);

                jbd2_journal_clear_err(journal);
                jbd2_journal_update_sb_errno(journal);
        }
}


#define EXT4_HAS_COMPAT_FEATURE(sb,mask)                        \
        ((EXT4_SB(sb)->s_es->s_feature_compat & cpu_to_le32(mask)) != 0

The vmcore analysis shows an invalid reference for ext4_super_block* which caused a BUGON condition

ext4_sb_info.s_es,s_es_shrinker,s_sb,journal_bdev 0xffff9d9f15234000 
  s_es = 0xffff9d9f0c034400                                                 <<  X[1]
  s_es_shrinker = {
    shrink = 0xffffffffc0b60aa0 <ext4_es_shrink>, 
    seeks = 0x2, 
    batch = 0x0, 
    list = {
      next = 0xffffffff9b295940, 
      prev = 0xffff9d9f152303c8
    }, 
    nr_in_batch = {
      counter = 0x0
    }
  }
  s_sb = 0xffff9d9f15230000
  journal_bdev = 0x0

As shown below  X[1] is not a valid kernel virtual address reference for struct ext4_super_block*   

crash> struct ext4_super_block.s_feature_compat 0xffff9d9f0c034400
struct: page excluded: kernel virtual address: ffff9d9f0c034400  type: "gdb_readmem_callback"           <<

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments