How to mitigate the error "BUG: soft lockup" in KVM guests?
Environment
- Red Hat Enterprise Linux (RHEL) 5.5
- KVM using KSM
- Guests with 8 virtual CPUs
Issue
- KVM guests are often showing the error "BUG: soft lockup" in /var/log/messages.
Resolution
There are not just one answer to solve this issue. In fact, there are a lot of recommendations and configurations that can minimize or solve the problem, depending on the scenario.
Recommendations:
1. Update kernel to, at least, release 2.6.18-194.11.3
kernel-2.6.18-194.11.3.el5.x86_64.rpm can be downloaded at:
https://rhn.redhat.com/rhn/software/packages/details/Overview.do?pid=587488
2. Adjust Soft Lock-up timeout
An alternative to circumvent the problem is use /proc/sys/kernel/softlockup_thresh parameter.
As root, execute:
# echo <time> > /proc/sys/kernel/softlockup_thresh
Replace <time> with the desired number of seconds before a soft lock-up warning should be triggered. By defa* ult, this value is set to 10 (seconds) and the maximum soft lock-up time-out is now increased from 60 seconds to 300 seconds for systems that have a large number of CPUs. A soft lock-up occurs when a CPU reports a memory starvation while it is unable to access a memory node that is being accessed by other CPUs.
3. Change I/O Scheduler
Virtual machines need an special I/O scheduler as described in the documents:
https://access.redhat.com/kb/docs/DOC-5428
http://www.redhat.com/magazine/008jun05/features/schedulers/
4. Use KVM processor affinities
Set up a guest to use only 1 virtual CPUs and apply KVM processor affinities, as described in the document bellow, to avoid so much CPU overcommit:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Virtualization/ch31s08.html
5. Timing management parameter
Is recommended to use some parameters related to timing management (clock/interrupts) if your guest does not has Time Stamp Counter (TSC).
As described in the document bellow, is this case is necessary include some options in the kernel line as"divider=10 notsc lpj=n"
:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Virtualization/chap-Virtualization-KVM_guest_timing_management.html
6. KSM usage
Disable KSM usage and monitor the system to see if the error stops and load average decreases.
A wrong KSM/overcommit ratio configuration can cause some problems:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Virtualization/sect-Virtualization-Virtualization_limitations-KVM_limitations.html
7. Application errors:
Check if there are any Zombie process. It must be eliminated with a reboot:
8. Monitor the system:
Execute the commands bellow during the error to check the resource usage:
# iostat -t -x 2 &> /tmp/iostat.out
# top -b -d 2 &> /tmp/top.out
# vmstat 2 &> /tmp/vmstat.out
9. In addition to it, you can work with some extra commands used to prioritize process as described in the documents bellow:
Why does my process stop periodically? Is there any way to mitigate this?
https://access.redhat.com/kb/docs/DOC-9466How do I increase the I/O priority of some processes on Red Hat Enterprise Linux 5?
https://access.redhat.com/kb/docs/DOC-16649How do I set the real time scheduling priority of a process?
https://access.redhat.com/kb/docs/DOC-5468
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments