System encountered a hung status with TSC ADJUST compensate messages after live migration and hot-added CPUs
Environment
- Huawei OpenStack Foundation OpenStack Nova
- RHEL 7
- Live migration and hot add CPUs
Issue
- After live migration and hot adding CPUs , system encountered a hung status with TSC ADJUST compensate messages:
[5180237.377019] CPU8 has been hot-added
[5180237.391523] smpboot: Booting Node 0 Processor 8 APIC 0x8
[5180103.249992] kvm-clock: cpu 8, msr 4:3ff88201, secondary cpu clock
[5180103.272764] TSC ADJUST compensate: CPU8 observed 13266090348023396 warp. Adjust: 13266090348023396
[11224312.559644] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=1002915 jiffies, g=44501597, c=44501596, q=1093)
[11224312.560381] Task dump for CPU 8:
[11224312.560382] swapper/8 R running task 0 0 1 0x00000000
[11224312.560386] Call Trace:
[11224312.583003] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=6043918845 jiffies, g=44501597, c=44501596, q=1520)
...
[11224312.583003] KVM setup async PF for cpu 8
[11224312.583003] kvm-stealtime: cpu 8, msr 42b81c040
[11224312.583003] Will online and init hotplugged CPU: 8
[11224312.583003] CPU9 has been hot-added
[11224312.583003] CPU10 has been hot-added
[11224312.583003] CPU11 has been hot-added
Resolution
- Contact the vendor of the hypervisor to investigate this issue, take their opinions.
- workaround
Disable "TSC ADJUST" mechanism, the method is here.The clearcpuid is 932+ 1=289.
arch/x86/include/asm/cpufeatures.h
234:#define X86_FEATURE_TSC_ADJUST (9*32+ 1) /* TSC adjustment MSR 0x3b
*/
Alternatively, this issue only occurred when hot added CPUs after live migration, so avoid it could be a workaround.
Root Cause
- The root cause is that the hypervisor doesn't do the right thing with maintaining coherent TSC values.
- The "TSC ADJUST" mechanism has been added since kernel-3.10.0-693.21.1.el7(RHEL 7.4 latest Zstream).This mechanism recognizes that the TSC time of the newly inserted CPU is inconsistent with the time of the original CPU (In the live migration scenario,it is recognized that the TSC of the new CPU of the virtual machine is inconsistent with the original CPU),a time synchronization will be performed.During the synchronization,the virtual machine is in a stuck state.After the synchronization is completed,the virtual machine returns to normal.
- The diagram of live migration and hot add CPUs is uploaded as an attachment,named tsc.png.The tsc of cpuN is tsc1 on VM1. VM1 was live migrated from server A to server B,then hot added cpuM to VM1.However,the tsc of cpuM is tsc2,so tsc is different between cpuN and cpuM.
- when hot-added cpuM,if it is recognized that the tsc clock frequency of cpuM is inconsistent with the tsc clock frequency of cpuN,the tsc time of cpuM will be initialized (initialized to 0) . If not add "TSC ADJUST" mechanism,tsc of the cpuM starts counting from 0 (cpuN starts counting from the original value,and does not affect each other) .
1473 if (usdiff < USEC_PER_SEC &&
1474 vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
1475 if (!check_tsc_unstable()) {
1476 offset = kvm->arch.cur_tsc_offset;
1477 pr_debug("kvm: matched tsc offset for %llu\n", data);
1478 } else {
1479 u64 delta = nsec_to_cycles(vcpu, elapsed);
1480 data += delta;
1481 offset = kvm_compute_tsc_offset(vcpu, data);
1482 pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
1483 }
1484 matched = true;
1485 already_matched = (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
1486 } else {
1487 /*
1488 * We split periods of matched TSC writes into generations.
1489 * For each generation, we track the original measured
1490 * nanosecond time, offset, and write, so if TSCs are in
1491 * sync, we can match exact offset, and if not, we can match
1492 * exact software computation in compute_guest_tsc()
1493 *
1494 * These values are tracked in kvm->arch.cur_xxx variables.
1495 */
1496 kvm->arch.cur_tsc_generation++;
1497 kvm->arch.cur_tsc_nsec = ns;
1498 kvm->arch.cur_tsc_write = data;
1499 kvm->arch.cur_tsc_offset = offset;
1500 matched = false;
1501 pr_debug("kvm: new tsc generation %llu, clock %llu\n",
1502 kvm->arch.cur_tsc_generation, data);
1503 }
- Since RHEL 7.4 latest zstream,"TSC ADJUST" mechanism added. After hot added cpuM,VM1 perceived the difference in clock frequency between cpuN and cpuM,and adjusted this difference by executing the wrmsrl() function:
458 /*
459 * Check the number of remaining test runs. If not zero, the test
460 * failed and a retry with adjusted TSC is possible. If zero the
461 * test was either successful or failed terminally.
462 */
463 if (!atomic_read(&test_runs))
464 return;
465
466 /*
467 * If the warp value of this CPU is 0, then the other CPU
468 * observed time going backwards so this TSC was ahead and
469 * needs to move backwards.
470 */
471 if (!cur_max_warp)
472 cur_max_warp = -gbl_max_warp;
473
474 /*
475 * Add the result to the previous adjustment value.
476 *
477 * The adjustement value is slightly off by the overhead of the
478 * sync mechanism (observed values are ~200 TSC cycles), but this
479 * really depends on CPU, node distance and frequency. So
480 * compensating for this is hard to get right. Experiments show
481 * that the warp is not longer detectable when the observed warp
482 * value is used. In the worst case the adjustment needs to go
483 * through a 3rd run for fine tuning.
484 */
485 cur->adjusted += cur_max_warp;
486
487 pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp. Adjust: %lld\n",
488 cpu, cur_max_warp, cur->adjusted);
489
490 wrmsrl(MSR_IA32_TSC_ADJUST, cur->adjusted);
491 goto retry;
492
493 }
- The tsc of cpuM will be adjusted,and it will be adjusted to be consistent with cpuN. During this process,the virtual machine is in a stuck state.
Diagnostic Steps
- Checked kernel ring buffer log:
[5180237.377019] CPU8 has been hot-added
[5180237.391523] smpboot: Booting Node 0 Processor 8 APIC 0x8
[5180103.249992] kvm-clock: cpu 8, msr 4:3ff88201, secondary cpu clock
[5180103.272764] TSC ADJUST compensate: CPU8 observed 13266090348023396 warp. Adjust: 13266090348023396
[11224312.559644] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=1002915 jiffies, g=44501597, c=44501596, q=1093)
[11224312.560381] Task dump for CPU 8:
[11224312.560382] swapper/8 R running task 0 0 1 0x00000000
[11224312.560386] Call Trace:
[11224312.583003] INFO: rcu_sched detected stalls on CPUs/tasks: { 8} (detected by 1, t=6043918845 jiffies, g=44501597, c=44501596, q=1520)
...
[11224312.583003] KVM setup async PF for cpu 8
[11224312.583003] kvm-stealtime: cpu 8, msr 42b81c040
[11224312.583003] Will online and init hotplugged CPU: 8
[11224312.583003] CPU9 has been hot-added
[11224312.583003] CPU10 has been hot-added
[11224312.583003] CPU11 has been hot-added
Attachments
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments