Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu x

Latest response

Hello

I have 5 HP-DL360e-Gen8 Server running with RHEL 6.5 (command line) and two of them had today this crash.
Server was starting itself automaticaly. Both were with a uptime 6 days and few hours.
I am a newbie with RHEL...I was looking for that Error,but without success.
In one article I read that this kernel panic should be solved on the actual kernel. Strange.
As attachment you can find the kdump.log , the other logs are not really clear for me.
I hope somebody could help me to solve it. I can provide with logs.

Thanks in advance.

Attachments

Responses

It seems your not using the latest kernel: kernel-2.6.32-431.17.1.el6.x86_64
have you tried that one?

in short time new kernel...I will need to update the system every week...sounds like windows os!
but thanks for information!
I updated it already...just need to wait for crash :-(

For a kdump analysis I would recommend to open a Red Hat support case

you mean support-ticket ?

that's correct Felix. You can open one here: https://access.redhat.com/support/cases/new

thx!

Opening a case is the best recourse with this.

Typically Red Hat Support would ask for a sosreport, but that will be impossible until the system doesn't halt from kernel panics.

Just for your future reference, check out this Red Hat article/guide on unexpected halts/restarts

Kind Regards,
Remmele

thank you too, I know already that reference...and I changed the crashsize from auto to 768M because of 16GB Ram...I updated the system already,now wait to see if it works.

Great link Remmele!

Felix - was the issue due to a bad mother board? (I just responded to your other post about replacing a system board and having new MAC addresses ;-)

Also - I am curious why you changed the crashkernel value. I have not found a very good explanation on how to decide the value (and therefore I just set it to auto and hope for the best).

on the other issue, with the network ports, the mobo was changed because of sensor failure. this issue is for another server, but same version.

I changed that value because I have find somewhere in the inet a table with values for crashsize...and there was written if you have more than 8GB ram the value should be not lower than 768MB.

With the setting auto, the server get's in short time (few days) that kernel panic. After changing it to 768MB worked for loger time...but how I see, that kernel panic is still there.

HP told me to update firmware and drivers in linux.
Now I am waiting for the crashes...but if somebody knows some "tuning" settings to stop getting kernel panics, is welcome!!!

Hi Felix,
I believe the fact that the system is staying up longer is a coincidence. The crashkernel value should not influence that (the only reason I could see a different crashkernel size preventing a system crash would be if you had bad memory and the bigger crashkernel was preventing you from actually using that memory).
If this system is not yet in "production", I would do the following:
* create a sosreport (and copy it to another system)
* make sure kdump is configured and enabled
* test kdump
* run a memory test from the RHEL installation media
* confirm ALL your firmware is at a level that does not have known issues

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-kdump-configuration-testing.html

On one of your other systems with crashkernel=auto - you can see what the algorithm has selected for a value:

[jradtke@devstack ~]$ cat /proc/cmdline 
ro root=/dev/mapper/vg_devstack-lv_root rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=us rd_LVM_LV=vg_devstack/lv_root LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 crashkernel=130M@0M rd_MD_UUID=d80fa059:c023d6cf:643f782f:ce20941f rd_LVM_LV=vg_devstack/lv_swap rd_NO_DM rhgb quiet
[jradtke@devstack ~]$ sudo grep crash /boot/grub/grub.conf
    kernel /vmlinuz-2.6.32-431.17.1.el6.x86_64 ro root=/dev/mapper/vg_devstack-lv_root rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=us rd_LVM_LV=vg_devstack/lv_root LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_MD_UUID=d80fa059:c023d6cf:643f782f:ce20941f rd_LVM_LV=vg_devstack/lv_swap rd_NO_DM rhgb quiet
[jradtke@devstack ~]$ free -m
             total       used       free     shared    buffers     cached
Mem:         15655       7517       8138          0        222       3656
-/+ buffers/cache:       3637      12018
Swap:         7871          0       7871

In that example, I have a host with 16GB and crashkernel=auto. It appears it set the value to 130M with 0M offset.

thx!
few of that points I did already, but I can do that again to post information here...later that day

  • system is updated

  • sos report created; what do you mean by copy it to another system?

  • kdump.conf :

path /var/crash
core_collector makedumpfile -c --message-level 1 -d 31
debug_mem_level 3

  • at my test server is crashsize=1M

  • free -m :
    total used free shared buffers cached
    Mem: 15910 2073 13836 0 34 954
    -/+ buffers/cache: 1084 14826
    Swap: 16399 0 16399

  • kdump test is running now.......

Great. I had suggested copying the sosreport to another system (in case you had to open a case and needed that sosreport and the system might have been unavailable - purely a precaution).

Looks good though - Let's hope your systems have stabilized now!

ah you ment the program sos...I have it on the other servers installed too.

but yesterday I tried to crash the system with:

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

and it crashed...I had to reboot twice...but it wrote no crashlog to /var/crash folder. any idea why?
do I need to change some other settings for kdump?

Yep, sorry - sos is the package. ;-) sosreport is the command.

kdump can be tricky. You need to have the appropriate packages installed, your crashkernel must be valid, and the kdump service must be enabled.

What happens: you "crash" the box which will start the process, it will then reboot (to the alternate kernel) and start writing to the crashdump location.

I have had problems with the volume not being available (because of an issue I had in my /etc/lvm/lvm.conf, having a bad crashkernel value (and kdump would not be able to start), the NAS share I was attempting to use not being available, not enough space in /var/crash.

I believe a sosreport output would be able to allow Red Hat to identify what might be wrong.

sos package was/is uptodate, kdump service was/is active, maybe I will try again to crash it.

just to understand:

  1. run the two command lines
  2. server crashed - but I needed to restart it manualy
  3. server rebooted and then crashed again by itself (was that the try to reboot with alternate kernel?)
  4. needed to restart it manualy again (via iLO)
  5. now was possible to login to the OS - no crash log was written

do I have to wait after crash that it restart automaticaly or not?

You should not have to manually restart (in fact, I believe that invalidates the process).

I'll research further regarding the reboot and get back to you (I'm responding in hopes that someone already knows this and can help). I have seen other folks mention that kdump doesn't work depending on what they set their crashkernel value to.

Also, I believe there is a BIOS setting (or possibly OS also) that configures how the host responds to certain triggers. (much like ctrl-alt-delete). If I recall correctly, there is actually a Sys-RQ bios setting.

Kdump is another item that I become quite familiar with and then not deal with for a long time and forget about. In my situation, I dump the crash files to a NAS share, so I can see that it is still working (by watching the dump size).

I wish I had a better answer at this point.

Thank you for your answer.
I will try that test again without manualy rebooting.

I was thinking about another thing, could be possible to set up the OS to kill the process which is crashing the system? Do you know what I mean?

I will let you know in the next days about the crash test.

10.06.2014

since when I start that discussion here, there was no crash. I think that updating the kernel and the hw-raid-card linux driver, helped in that case. but 100% not sure...that's why I will let an eye on that case.

Great Felix - thanks for the update.

I believe a majority of the time that I see "lockups" or "hung task timeout" type errors, it was due to some sort of storage issue I was having.

Hopefully you can get your kdump configuration worked out. In general, it is pretty straight-forward. However, there are some items which have plagued me in getting it figured out. Such as having the updated RAID card drivers included in the initrd that kdump creates. Which has me curious if that might be the issue? (again, I am not very strong with kdump). I know if you update your /etc/kdump.conf OR your kernel, the next time kdump runs it will rebuild a new initrd. I do not know whether adding a new device driver would do the same thing.