"megaraid_sas: FW detected to be in fault state, restarting it..." in logs following read-only event or server crash.
Issue
- Server crashed with the following message seen in the vmcore logs:
crash> log
[...]
NMI received for unknown reason 3c
CPU 0
Modules linked in: vxodm(PFU) nfsd auth_rpcgss autofs4 smbus(U) ipmi_devintf ipmi_si ipmi_msghandler nfs nfs_acl dmpjbod(PU) dmpap(PU) dmpaa(PU) dmpalua(PU) vxspec(PFU) vxio(PFU) vxdmp(PU) lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi vxportal(PFU) fdd(PFU) vxfs(PU) exportfs dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport tpm_infineon sr_mod cdrom igb(U) sg pcspkr i2c_i801 i2c_core 8021q dca tpm_tis tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod lpfc scsi_transport_fc ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: PF ---- 2.6.18-274.12.1.el5 #1
RIP: 0010:[<ffffffff8006b9bf>] [<ffffffff8006b9bf>] mwait_idle_with_hints+0x66/0x67
RSP: 0018:ffffffff8045df88 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffff80056c7e RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000090000 R08: ffffffff8045c000 R09: 0000000000000028
R10: ffff81407ff90368 R11: 0000000000000206 R12: 000000007901394c
R13: 000000000000001f R14: 0000000000075000 R15: fffffffff00000c6
FS: 0000000000000000(0000) GS:ffffffff8042c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000009ef168 CR3: 0000003e61c9e000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffffffff8045c000, task ffffffff80315b60)
Stack: ffffffff80056c8a ffffffff80048fe2 0000000000200800 ffffffff80467809
0000000000090000 000000007901394c ffffffff804b6740 ffffffff8046722f
80008e000010019c 00000000ffffffff 0000000000000000 0000000000000000
Call Trace:
[<ffffffff80056c8a>] mwait_idle+0xc/0x20
[<ffffffff80048fe2>] cpu_idle+0x95/0xb8
[<ffffffff80467809>] start_kernel+0x220/0x225
[<ffffffff8046722f>] _sinittext+0x22f/0x236
Code: c3 41 57 41 56 49 89 f6 41 55 49 89 fd 41 54 4c 8d a7 e0 02
- Filesystem encountered read-only event with following in logs:
megasas: moving cmd[95]:ffff81407fd29240:0:ffff813d19c82b40 on the defer queue as internal reset in progress.
megaraid_sas: FW detected to be in fault state, restarting it...
megaraid_sas: FW was restarted successfully, initiating next stage...
megaraid_sas: HBA recovery state machine, state 2 starting...
megasas: Waiting for FW to come to ready state
megasas: FW in FAULT state!!
FW state [-268435456] hasn't changed in 180 secs
megaraid_sas: out: controller is not in ready state
megasas: waiting_for_outstanding: after issue OCR.
megasas: waiting_for_outstanding: before issue OCR. FW state = f0000000
megasas: moving cmd[0]:ffff8130800f3340:0:ffff810ca2edd500 on the defer queue as internal reset in progress.
megaraid_sas: ERROR while moving this cmd:ffff8130800f3340, 0 ffff810ca2edd500, it was discovered on some list?
sd 0:2:0:0: timing out command, waited 360s
sd 0:2:0:0: Unhandled error code
sd 0:2:0:0: SCSI error: return code = 0x06000000
Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
Buffer I/O error on device sda3, logical block 585840
lost page write due to I/O error on sda3
sd 0:2:0:0: timing out command, waited 360s
sd 0:2:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 585840
lost page write due to I/O error on sda3
Environment
- Red Hat Enterprise Linux 5
- Megaraid Storage Controller
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.