Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu x
Hello
I have 5 HP-DL360e-Gen8 Server running with RHEL 6.5 (command line) and two of them had today this crash.
Server was starting itself automaticaly. Both were with a uptime 6 days and few hours.
I am a newbie with RHEL...I was looking for that Error,but without success.
In one article I read that this kernel panic should be solved on the actual kernel. Strange.
As attachment you can find the kdump.log , the other logs are not really clear for me.
I hope somebody could help me to solve it. I can provide with logs.
Thanks in advance.
Attachments
Responses
Opening a case is the best recourse with this.
Typically Red Hat Support would ask for a sosreport, but that will be impossible until the system doesn't halt from kernel panics.
Just for your future reference, check out this Red Hat article/guide on unexpected halts/restarts
Kind Regards,
Remmele
Great link Remmele!
Felix - was the issue due to a bad mother board? (I just responded to your other post about replacing a system board and having new MAC addresses ;-)
Also - I am curious why you changed the crashkernel value. I have not found a very good explanation on how to decide the value (and therefore I just set it to auto and hope for the best).
Hi Felix,
I believe the fact that the system is staying up longer is a coincidence. The crashkernel value should not influence that (the only reason I could see a different crashkernel size preventing a system crash would be if you had bad memory and the bigger crashkernel was preventing you from actually using that memory).
If this system is not yet in "production", I would do the following:
* create a sosreport (and copy it to another system)
* make sure kdump is configured and enabled
* test kdump
* run a memory test from the RHEL installation media
* confirm ALL your firmware is at a level that does not have known issues
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-kdump-configuration-testing.html
On one of your other systems with crashkernel=auto - you can see what the algorithm has selected for a value:
[jradtke@devstack ~]$ cat /proc/cmdline
ro root=/dev/mapper/vg_devstack-lv_root rd_NO_LUKS KEYBOARDTYPE=pc KEYTABLE=us rd_LVM_LV=vg_devstack/lv_root LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 crashkernel=130M@0M rd_MD_UUID=d80fa059:c023d6cf:643f782f:ce20941f rd_LVM_LV=vg_devstack/lv_swap rd_NO_DM rhgb quiet
[jradtke@devstack ~]$ sudo grep crash /boot/grub/grub.conf
kernel /vmlinuz-2.6.32-431.17.1.el6.x86_64 ro root=/dev/mapper/vg_devstack-lv_root rd_NO_LUKS KEYBOARDTYPE=pc KEYTABLE=us rd_LVM_LV=vg_devstack/lv_root LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_MD_UUID=d80fa059:c023d6cf:643f782f:ce20941f rd_LVM_LV=vg_devstack/lv_swap rd_NO_DM rhgb quiet
[jradtke@devstack ~]$ free -m
total used free shared buffers cached
Mem: 15655 7517 8138 0 222 3656
-/+ buffers/cache: 3637 12018
Swap: 7871 0 7871
In that example, I have a host with 16GB and crashkernel=auto. It appears it set the value to 130M with 0M offset.
Yep, sorry - sos is the package. ;-) sosreport is the command.
kdump can be tricky. You need to have the appropriate packages installed, your crashkernel must be valid, and the kdump service must be enabled.
What happens: you "crash" the box which will start the process, it will then reboot (to the alternate kernel) and start writing to the crashdump location.
I have had problems with the volume not being available (because of an issue I had in my /etc/lvm/lvm.conf, having a bad crashkernel value (and kdump would not be able to start), the NAS share I was attempting to use not being available, not enough space in /var/crash.
I believe a sosreport output would be able to allow Red Hat to identify what might be wrong.
You should not have to manually restart (in fact, I believe that invalidates the process).
I'll research further regarding the reboot and get back to you (I'm responding in hopes that someone already knows this and can help). I have seen other folks mention that kdump doesn't work depending on what they set their crashkernel value to.
Also, I believe there is a BIOS setting (or possibly OS also) that configures how the host responds to certain triggers. (much like ctrl-alt-delete). If I recall correctly, there is actually a Sys-RQ bios setting.
Kdump is another item that I become quite familiar with and then not deal with for a long time and forget about. In my situation, I dump the crash files to a NAS share, so I can see that it is still working (by watching the dump size).
I wish I had a better answer at this point.
Great Felix - thanks for the update.
I believe a majority of the time that I see "lockups" or "hung task timeout" type errors, it was due to some sort of storage issue I was having.
Hopefully you can get your kdump configuration worked out. In general, it is pretty straight-forward. However, there are some items which have plagued me in getting it figured out. Such as having the updated RAID card drivers included in the initrd that kdump creates. Which has me curious if that might be the issue? (again, I am not very strong with kdump). I know if you update your /etc/kdump.conf OR your kernel, the next time kdump runs it will rebuild a new initrd. I do not know whether adding a new device driver would do the same thing.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
