KVM causes process lockups

Latest response

I have got a SuperMicro X9DRI-LN4F+_R1.2A server with an Adaptec 71605 raid controller (softlayer). I am using KVM virtualisation, running 8 VMs. For the storage, the VMs are using file base images (raw format). The machine is running RHEL 6.5 (up to date).

The problem is that sometime when copying large files (for instance the images of 32GB) causes a process to freeze (100% CPU). When killing the process using kill -9 17492, no error is reported but the process cannot be killed. The process only recovers when all VMs are shutdown.

Mostly it causes the processed executed to be locked (100% cpu), but I have occasional situations where some other process freezes. I currently have a locked up sshd process.

top - 14:27:35 up 2 days,  7:15,  1 user,  load average: 4.40, 3.36, 3.35
Tasks: 478 total,   2 running, 476 sleeping,   0 stopped,   0 zombie
Cpu(s): 17.6%us,  4.7%sy,  0.0%ni, 77.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32844184k total, 32048980k used,   795204k free,    77428k buffers
Swap:  4194296k total,        0k used,  4194296k free,  9479800k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8356 qemu      20   0 5437m 4.0g 5116 S 396.6 12.7  62:44.72 qemu-kvm
17492 root      20   0 70024 3408 2612 R 100.0  0.0 802:44.09 sshd
 3154 qemu      20   0 6730m 5.9g 5096 S 10.3 18.9 198:04.79 qemu-kvm
 3055 qemu      20   0 2648m 2.0g 5524 S  5.6  6.4 463:21.61 qemu-kvm
 2497 root      20   0 1003m  15m 5348 S  5.3  0.0  13:31.13 libvirtd

I came across this post from 2004 with a similar issue:

http://www.webhostingtalk.com/showthread.php?t=1273964

Is this something that can be resolved other than replacing the raid controller? The firmware of the controller is already up to date.

Responses

Sounds like a tough issue to nail down.

Are you using LVM for the images (I assume /var/lib/libvirt/images)?
Do you use any special mount options for your images (or is the directory just on /)?
Was that directory setup at server build, or after (on separate LUNs)?

I have run in to issues with partition alignment on SAN LUNs (not sure if that would translate to a local RAID controller in a similar way).

Also - did you review this:
http://download.adaptec.com/pdfs/readme/series-7-8-controller_readme_12_2013.pdf

This guy has some good info (you'll have to sort through it)
http://wiki.mikejung.biz/index.php?title=Hardware

Thanks for your reaction. I got a new server from softlayer with a different raid controller. Same problem. I found a suggestion on the web for a similar problem on ubuntu. They had to disable the swap.

When I got the same problem on the new server (while migrating), I turned of the swap (swapoff -a) and all locked processes got "unlocked".

I think the server does not have any memory left and when a new process is asking for some memory it locks until physical memory is available. Not sure why it never recovers though.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.