System encountered a hung status with TSC ADJUST compensate messages after live migration and hot-added CPUs

Solution Verified - Updated -

Environment

  • Huawei OpenStack Foundation OpenStack Nova
  • RHEL 7
  • Live migration and hot add CPUs

Issue

  • After live migration and hot adding CPUs , system encountered a hung status with TSC ADJUST compensate messages:
[5180237.377019] CPU8 has been hot-added
[5180237.391523] smpboot: Booting Node 0 Processor 8 APIC 0x8
[5180103.249992] kvm-clock: cpu 8, msr 4:3ff88201, secondary cpu clock
[5180103.272764] TSC ADJUST compensate: CPU8 observed 13266090348023396 warp. Adjust: 13266090348023396
[11224312.559644] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=1002915 jiffies, g=44501597, c=44501596, q=1093)
[11224312.560381] Task dump for CPU 8:
[11224312.560382] swapper/8       R  running task        0     0      1 0x00000000
[11224312.560386] Call Trace:
[11224312.583003] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=6043918845 jiffies, g=44501597, c=44501596, q=1520)
...
[11224312.583003] KVM setup async PF for cpu 8
[11224312.583003] kvm-stealtime: cpu 8, msr 42b81c040
[11224312.583003] Will online and init hotplugged CPU: 8
[11224312.583003] CPU9 has been hot-added
[11224312.583003] CPU10 has been hot-added
[11224312.583003] CPU11 has been hot-added

Resolution

  • Contact the vendor of the hypervisor to investigate this issue, take their opinions.
  • workaround
    Disable "TSC ADJUST" mechanism, the method is here.The clearcpuid is 932+ 1=289.
arch/x86/include/asm/cpufeatures.h
234:#define X86_FEATURE_TSC_ADJUST      (9*32+ 1) /* TSC adjustment MSR 0x3b 
*/

Alternatively, this issue only occurred when hot added CPUs after live migration, so avoid it could be a workaround.

Root Cause

  • The root cause is that the hypervisor doesn't do the right thing with maintaining coherent TSC values.
  • The "TSC ADJUST" mechanism has been added since kernel-3.10.0-693.21.1.el7(RHEL 7.4 latest Zstream).This mechanism recognizes that the TSC time of the newly inserted CPU is inconsistent with the time of the original CPU (In the live migration scenario,it is recognized that the TSC of the new CPU of the virtual machine is inconsistent with the original CPU),a time synchronization will be performed.During the synchronization,the virtual machine is in a stuck state.After the synchronization is completed,the virtual machine returns to normal.
  • The diagram of live migration and hot add CPUs is uploaded as an attachment,named tsc.png.The tsc of cpuN is tsc1 on VM1. VM1 was live migrated from server A to server B,then hot added cpuM to VM1.However,the tsc of cpuM is tsc2,so tsc is different between cpuN and cpuM.
  • when hot-added cpuM,if it is recognized that the tsc clock frequency of cpuM is inconsistent with the tsc clock frequency of cpuN,the tsc time of cpuM will be initialized (initialized to 0) . If not add "TSC ADJUST" mechanism,tsc of the cpuM starts counting from 0 (cpuN starts counting from the original value,and does not affect each other) .
1473         if (usdiff < USEC_PER_SEC &&
1474             vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
1475                 if (!check_tsc_unstable()) {
1476                         offset = kvm->arch.cur_tsc_offset;
1477                         pr_debug("kvm: matched tsc offset for %llu\n", data);
1478                 } else {
1479                         u64 delta = nsec_to_cycles(vcpu, elapsed);
1480                         data += delta;
1481                         offset = kvm_compute_tsc_offset(vcpu, data);
1482                         pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
1483                 }
1484                 matched = true;
1485                 already_matched = (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
1486         } else {
1487                 /*
1488                  * We split periods of matched TSC writes into generations.
1489                  * For each generation, we track the original measured
1490                  * nanosecond time, offset, and write, so if TSCs are in
1491                  * sync, we can match exact offset, and if not, we can match
1492                  * exact software computation in compute_guest_tsc()
1493                  *
1494                  * These values are tracked in kvm->arch.cur_xxx variables.
1495                  */
1496                 kvm->arch.cur_tsc_generation++;
1497                 kvm->arch.cur_tsc_nsec = ns;
1498                 kvm->arch.cur_tsc_write = data;
1499                 kvm->arch.cur_tsc_offset = offset;
1500                 matched = false;
1501                 pr_debug("kvm: new tsc generation %llu, clock %llu\n",
1502                          kvm->arch.cur_tsc_generation, data);
1503         }
  • Since RHEL 7.4 latest zstream,"TSC ADJUST" mechanism added. After hot added cpuM,VM1 perceived the difference in clock frequency between cpuN and cpuM,and adjusted this difference by executing the wrmsrl() function:
458         /*
459          * Check the number of remaining test runs. If not zero, the test
460          * failed and a retry with adjusted TSC is possible. If zero the
461          * test was either successful or failed terminally.
462          */
463         if (!atomic_read(&test_runs))
464                 return;
465 
466         /*
467          * If the warp value of this CPU is 0, then the other CPU
468          * observed time going backwards so this TSC was ahead and
469          * needs to move backwards.
470          */
471         if (!cur_max_warp)
472                 cur_max_warp = -gbl_max_warp;
473 
474         /*
475          * Add the result to the previous adjustment value.
476          *
477          * The adjustement value is slightly off by the overhead of the
478          * sync mechanism (observed values are ~200 TSC cycles), but this
479          * really depends on CPU, node distance and frequency. So
480          * compensating for this is hard to get right. Experiments show
481          * that the warp is not longer detectable when the observed warp
482          * value is used. In the worst case the adjustment needs to go
483          * through a 3rd run for fine tuning.
484          */
485         cur->adjusted += cur_max_warp;
486 
487         pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp. Adjust: %lld\n",
488                 cpu, cur_max_warp, cur->adjusted);
489 
490         wrmsrl(MSR_IA32_TSC_ADJUST, cur->adjusted);
491         goto retry;
492 
493 }
  • The tsc of cpuM will be adjusted,and it will be adjusted to be consistent with cpuN. During this process,the virtual machine is in a stuck state.

Diagnostic Steps

  • Checked kernel ring buffer log:
[5180237.377019] CPU8 has been hot-added
[5180237.391523] smpboot: Booting Node 0 Processor 8 APIC 0x8
[5180103.249992] kvm-clock: cpu 8, msr 4:3ff88201, secondary cpu clock
[5180103.272764] TSC ADJUST compensate: CPU8 observed 13266090348023396 warp. Adjust: 13266090348023396
[11224312.559644] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=1002915 jiffies, g=44501597, c=44501596, q=1093)
[11224312.560381] Task dump for CPU 8:
[11224312.560382] swapper/8       R  running task        0     0      1 0x00000000
[11224312.560386] Call Trace:
[11224312.583003] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=6043918845 jiffies, g=44501597, c=44501596, q=1520)
...
[11224312.583003] KVM setup async PF for cpu 8
[11224312.583003] kvm-stealtime: cpu 8, msr 42b81c040
[11224312.583003] Will online and init hotplugged CPU: 8
[11224312.583003] CPU9 has been hot-added
[11224312.583003] CPU10 has been hot-added
[11224312.583003] CPU11 has been hot-added

Attachments

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.