Kernel Panic with Megasas driver
I am having an kernel panic with RHEL6 on an HP Proliant DL350 Gen8. I can boot 2.6.32-573.el6.x86_64 with no issues, but I get a kernel panic with 2.6.32-642.15.1.el6.x86_64.
The DL350 has 384G of RAM and four disks in a 1+0 hardware raid. The hardware raid is a SmartArray P420i v3.54. The HP BIOS is P71, 03/01/2013.
When I boot the -642 kernel, I get about a 30-second delay with the cursor blinking in the upper-left corner of the screen. (This delay is less than 15 seconds on the -573 kernel.) After the horizontal bars show up on the bottom of the screen, they ultimately produce a kernel panic.
Call Trace:
<IRQ> [ffffffff8107c6f1>] ? warn_slowpath_common+0x91/0xe0
[ffffffff8017c75a>] ? warn_slowpath_null+0x1a/0x20
[ffffffff8103739c>] ? native_smp_send_reschedule+0x5c/0x60
[ffffffff810685a8>] ? scheduler_tick+0x208/0x260
[ffffffff810b89e0>] ? tick_sched_timer+0x0/0xc0
[ffffffff8108f15e>] ? update_process_times+0x6e/0x90
[ffffffff810b8a46>] ? tick_sched_timer+0x66/0xc0
[ffffffff810ab12e>] ? __run_hrtimer+0x8e/0x1d0
[ffffffff810ab4ce>] ? hrtimer_interrupt+0xee/0x270
[ffffffff8103879d>] ? local_apic_timer_interrupt+0x13/0x70
[ffffffff8100bc13>] ? smp_apic_timer_interrupt+0x45/0x60
<EOI> [ffffffff81548590>] ? panic+0x156/0x179
[ffffffff8154851d>] ? panic+0xe3/0x179
[ffffffff8112b080>] ? perf_event_exit_task+0xc0/0x340
[ffffffff81081f97>] ? do_exit+0x867/0x870
[ffffffff8119b915>] ? fput+0x25/0x30
[ffffffff81081ff8>] ? do_group_exit+0x58/0xd0
[ffffffff81082087>] ? sys_exit_group+0x17/0x20
[ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x16
---[ end trace 823bbcc7f0cc64c9 ]---
Attempting to resolve the issue, I also installed the -642-firmware and headers, as well as perf and python-perf mentioned in RHSA-2017:0307. (My previous kernel has only the kernel rpm, not the firmware or the headers.) The errata indicates to me that the issue may be caused by changes in the MegaSAS driver. The updated initscripts (RHBA-0127:0313) includes the workaround for /etc/init.d/halt.
Update 1
Booting with rdinitdebug, the -573 kernel boots without any issues. The -642 kernel keeps failing at the line
udevadm settle --exit-if-exists=/dev/sdb1
Did the later kernel change how disk partitions are numbered or something? Why would it work on the -573 kernel and not the -642?
End Update 1
Thank you,
David
Responses
Are you talking about "megaraid_sas" module? ...
If I grep for "megasas" from the change log, I could see this:
[root@ansible-host packages]# rpm -q --changelog kernel-2.6.32-642.15.1.el6.x86_64|grep -i megasas
- [scsi] megaraid: fix null pointer check in megasas_detach_one() (Tomas Henzl) [1294983]
- [scsi] megaraid_sas: Remove debug print from function megasas_update_span_set (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: Code cleanup-use local variable drv_ops inside megasas_ioc_init_fusion (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: swap whole register in megasas_register_aen (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: fix megasas_fire_cmd_fusion calling convention (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: move endianness conversion into caller of megasas_get_seq_num (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: megasas_complete_outstanding_ioctls() can be static (Tomas Henzl) [1248207]
- [scsi] megaraid_sas : Modify return value of megasas_issue_blocked_cmd() and wait_and_poll() to consider command status returned by firmware (Tomas Henzl) [1219105]
- [scsi] megaraid_sas: Remove unused variables in megasas_instance (Tomas Henzl) [1172980]
- [scsi] megaraid_sas: Add missing initial call to megasas_get_ld_vf_affiliation() (Tomas Henzl) [1172980]
- [scsi] megaraid_sas: Fix megasas_ioc_init_fusion (Tomas Henzl) [1059073]
- [scsi] megaraid_sas: check return value for megasas_get_pd_list() (Tomas Henzl) [1059073]
- [scsi] megaraid_sas_fusion: Return correct error value in megasas_get_ld_map_info() (Tomas Henzl) [1059073]
- [scsi] megaraid_sas: Fix instance access in megasas_reset_timer (Tomas Henzl) [759318]
- [scsi] megaraid_sas: Disable interrupts/free_irq() in megasas_shutdown() (Tomas Henzl) [613564]
- [scsi] megaraid_sas: Fix megasas_build_dcdb_fusion to use correct LUN field (Shyam Iyer) [692673]
- [scsi] megaraid_sas: Fix megasas_build_dcdb_fusion to not filter by TYPE_DISK (Shyam Iyer) [692673]
- [scsi] megaraid_sas: Enable MSI-X before calling megasas_init_fw (Shyam Iyer) [692673]
And if I compare the change log of "2.6.32-573.el6.x86_64" and "2.6.32-642.15.1.el6.x86_64" pertaining to megasas module, could see these are new changes in newer kernel:
- [scsi] megaraid: fix null pointer check in megasas_detach_one() (Tomas Henzl) [1294983]
- [scsi] megaraid_sas: Remove debug print from function megasas_update_span_set (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: Code cleanup-use local variable drv_ops inside megasas_ioc_init_fusion (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: swap whole register in megasas_register_aen (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: fix megasas_fire_cmd_fusion calling convention (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: move endianness conversion into caller of megasas_get_seq_num (Tomas Henzl) [1248207]
- [scsi] megaraid_sas: megasas_complete_outstanding_ioctls() can be static (Tomas Henzl) [1248207]
There looks some changes, but not sure if that has caused panic. I hope that the new kernel is not missing any modules when you compare with older one. ..
Have you installed any kmod-hpsa packages for your P420i controller? It provides kernel modules for controlling HP Smart Array Controllers and contains fixes and enhancements that are not present in the kernel driver.
It might not be the solution but worth a try.
Is this running in fips mode? If so, then you could run mkinitrd or dracut to get the missing image created with fips mode enabled.. Please check out this [https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security_Guide/sect-Security_Guide-Federal_Standards_And_Regulations-Federal_Information_Processing_Standard.html]
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
