RHEL6.2 KVM host becomes unresponsive or incurs "Watchdog detected hard LOCKUP" panic

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 6.2
  • KVM host with Intel processor supporting "PAUSE-loop exiting"
  • KVM host with a large number of real CPUs
  • KVM virtual machines with a large number of restrictively pinned virtual CPUs

Issue

A Red Hat Enterprise Linux 6.2 KVM host can become unresponsive or incur a

    Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu ...

if it is running virtual machines that have a large number of virtual CPUs pinned restrictively to real CPUs [1]. This issue was originally observed on a host with 80 real CPUs and two virtual machines. Each of the virtual machines was using 40 virtual CPUs that were pinned to separate real CPUs, for example:

    virtual CPU  0 of virtual machine A pinned to real CPU  0
    virtual CPU  1 of virtual machine A pinned to real CPU  1
    ...
    virtual CPU 38 of virtual machine B pinned to real CPU 78
    virtual CPU 39 of virtual machine B pinned to real CPU 79

[1] In this context, "real CPU" means either a processor core with a single thread or a hyperthread of a processor core.

This issue can only occur on KVM hosts with Intel processors that support the "PAUSE-loop exiting" VM execution control.

Resolution

This is tracked by RHBZ#827031 (bug not publicly accessible, please contact your Red Hat Support representative if more information is required).

  • Using a less restrictive pinning of virtual CPUs can work around the issue.

  • Loading the kvm-intel kernel module with the parameter ple_gap=0 disables the "PAUSE-loop exiting" feature. This may be a possible alternative to work around the issue.

Root Cause

Red Hat Enterprise Linux 6.2 introduces a change in the KVM kernel modules that can improve the performance of certain workloads, if the guest kernel utilizes the "pause" processor instruction inside of spinlock loops. A flaw in this change can entail excessive contention at run queue locks in the KVM host kernel. The locks that are affected by the contention pertain to the run queues of those real CPUs that are allocated to a virtual CPU 0 of a guest. Excessive contention can either render the host unresponsive or cause the aforementioned kernel panic.

Diagnostic Steps

  • Review the CPU configuration of the KVM virtual machines to determine whether virtual CPUs are pinned to real CPUs very restrictively.

  • If a vmcore is available, check if the stack traces of many active threads contain the _spin_lock(), double_rq_lock(), yield_to(), kvm_vcpu_on_spin() functions, similar to the following example:

    PID: 39297  TASK: ffff881ff02134c0  CPU: 2   COMMAND: "qemu-kvm"
    ...
    --- <NMI exception stack> ---
     #6 [ffff881feeee3b68] _spin_lock at ffffffff814ef341
     #7 [ffff881feeee3b70] double_rq_lock at ffffffff810519fc
     #8 [ffff881feeee3ba0] yield_to at ffffffff814ed2a1
     #9 [ffff881feeee3bf0] kvm_vcpu_on_spin at ffffffffa0328494 [kvm]
    #10 [ffff881feeee3c50] handle_pause at ffffffffa02832ce [kvm_intel]
    #11 [ffff881feeee3c70] vmx_handle_exit at ffffffffa0283ae1 [kvm_intel]
    #12 [ffff881feeee3cb0] kvm_arch_vcpu_ioctl_run at ffffffffa033d97d [kvm]
    ...

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.