RHEV hosts exhibit bad performance and high CPU usage

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Virtualization (RHEV) 3.1
  • Hosts: Red Hat Enterprise Linux (RHEL) 6
  • Guests: Red Hat Enterprise Linux 6 virtual machines

Issue

  • Why are "Clocksource tsc unstable" warnings displayed in RHEV hosts?
Apr 12 13:26:02 rhevh1 kernel: Clocksource tsc unstable (delta = -8589933168 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 14:45:14 rhevh1 kernel: Clocksource tsc unstable (delta = -8589932928 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 14:51:14 rhevh1 kernel: Clocksource tsc unstable (delta = -8589932505 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 14:58:50 rhevh1 kernel: Clocksource tsc unstable (delta = -8589932915 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
  • Why are "Clocksource tsc unstable" warnings displayed in RHEL6 virtual machines running on RHEV?
Apr 12 16:02:47 vm001 kern.warning<4>: kernel:Clocksource tsc unstable (delta = 6051148438 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 16:02:48 vm001 kern.warning<4>: kernel:Clocksource tsc unstable (delta = 2538778893 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 16:02:49 vm001 kern.warning<4>: kernel:Clocksource tsc unstable (delta = 6051141984 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 16:02:50 vm001 kern.warning<4>: kernel:Clocksource tsc unstable (delta = 2538779349 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 16:02:51 vm001 kern.warning<4>: kernel:Clocksource tsc unstable (delta = 6051142597 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
Apr 12 16:02:52 vm001 kern.warning<4>: kernel:Clocksource tsc unstable (delta = 2538778484 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
  • Why are VMs sluggish?
  • Why is the CPU usage in the host so high?
  • Why do I see storage latency warnings?

Resolution

  • For RHEL 6, update to tuned-0.2.19-13.el6 from Errata RHBA-2013:1623 which fixes bug 969491

  • The frequency of the overall load balancing is controlled by the sysctl tunable kernel.sched_migration_cost. Increasing it will prevent the host from load balancing as much and thereby reduce the spinlock contention.

    1. Please increase kernel.sched_migration_cost kernel tunable from its default value of 500,000 to 5,000,000:

      sysctl -w kernel.sched_migration_cost=5000000
      
    2. In order to make this change persistent, include it in /etc/sysctl.d:

      echo "kernel.sched_migration_cost = 5000000" >> /etc/sysctl.d/sched
      

Note: RHEV 3.2 installs tuned automatically in all RHEV hosts and configures virtual-host as the active tuned profile.

  • By increasing sched_migration_cost, it results in a change in the kernel Scheduler behavior, so it will be waiting longer to migrate processes from CPUs that are overloaded to idle CPUs, which results in fewer migrations (context switch). This is especially important if the process that is running on the overloaded CPU is just about to yield anyway.

  • On the other hand, increasing sched_migration_cost can result in waiting too long to migrate processes from CPUs that are overloaded to idle CPUs. This means only a fraction of the CPUs will be running user processes and those CPUs will have load averages greater than 1 while other CPUs are idle.

  • Additionally, a private Red Hat Bugzilla 825222 is tracking a request to have the default sched_migration_cost tunable increased. Discussion is still ongoing for this request so for additional information, please contact Red Hat Support.

Root Cause

  • The following back trace for the qemu-kvm process can be found in the vmcore:
 #6 [ffff8895221cdb68] _spin_lock at ffffffff814ef33e
 #7 [ffff8895221cdb70] double_lock_balance at ffffffff81053b9a
 #8 [ffff8895221cdb90] thread_return at ffffffff814ecdff
 #9 [ffff8895221cdc50] kvm_vcpu_block at ffffffffa0337dd5 [kvm]
#10 [ffff8895221cdcb0] kvm_arch_vcpu_ioctl_run at ffffffffa034ba0c [kvm]
#11 [ffff8895221cddb0] kvm_vcpu_ioctl at ffffffffa0335322 [kvm]
#12 [ffff8895221cde60] vfs_ioctl at ffffffff81189342
#13 [ffff8895221cdea0] do_vfs_ioctl at ffffffff8118980a
#14 [ffff8895221cdf30] sys_ioctl at ffffffff81189a61
#15 [ffff8895221cdf80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00007fbe2b1fda47  RSP: 00007fbe24ac9a88  RFLAGS: 00000246
    RAX: 0000000000000010  RBX: ffffffff8100b0f2  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 0000000000000012
    RBP: 00007fbe2d75f000   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000002  R11: 0000000000000246  R12: 00007fbe2e317420
    R13: 00007fbe2e316fe0  R14: 0000000000000000  R15: 00007fbe2e3835c0
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
  • When PLE (Pause Loop Exit) code in kvm detects that a guest is encountering spinlock contention, it would relinquish the control to the host which would suspend that guest. Then it would try to run the guest which owns the spinlock under contention. This would cause a schedule storm on the host as the number of guests and processes running in each guest would increase. When the scheduler is called and a CPU is going idle, it would call the load balancer which would find the busiest CPU and would try to pull work from that CPU onto the idle CPU. This is where double_lock_balance() would come into play. It has the runq spinlock for the CPU that is going idle but not the runq spinlock of the busiest CPU therefore it would need to drop the lock & get both in the correct order to prevent a deadlock, thereby causing the contention. This would get exponentially worse as the number of NUMA nodes increases.

Diagnostic Steps

Load average is high
perf top shows much of the time is being taken in _spinlock.

Note that you may need to install perf to run perf top.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

In my case this incorrect kernel parameter was a result of the tuned service failing on reboot. The following fixed the kernel.sched_migration_cost parameter and immediately resolved VM performance issues:

systemctl restart tuned

This is on RHEV 3.6 with RHEL7.2 hosts.