Why do Multiple VMs Migrations sometimes terminate prematurely? Are there any tunables to avoid this?

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Virtualization 3.2
    • It may also happen with Red Hat Enterprise Virtualization 3.0 or later.
      Please also see the "Root Cause" section.

Issue

  • Why sometimes our Multiple VMs Migration were terminated prematurely?
  • Is there any tunable to avoid this?

We're using 16 nodes RHEV hypervisor. One day, something happened on a hypervisor and live migration started automatically.
However, the migration process was stopped before migrate all guests.

Q1: What is the root cause of this?
Q2: Is there any tunable parameter to avoid this?

Resolution

  • What is the root cause of this ?
    See "Root Cause" section.

  • Is there any tunable parameter to avoid this ?
    The following two parameters are available to control migration traffic. These can be configured in /etc/vdsm/vdsm.conf. Please also refer to the "Root Cause" section.

1) migration_max_bandwidth;

Default value : 32 (32MiBps)

migration_max_bandwidth = <X>
  Maximum bandwidth for migration, in MiBps, 0 means libvirt's default (*1)

2) max_outgoing_migrations;

Default value depends on version of vdsm;
vdsm-4.10.2-1.13 or later: 3
prior to vdsm-4.10.2-1.13 : 5

max_outgoing_migrations = <Z>
  Maximum concurrent outgoing migrations

Note:
Under "normal conditions" Red Hat does not recommend that you modify the file trying to tune migration performance.
If you have some special circumstances (like low amount of powerful hypervisors in large scale deployment) or if you need further guidance for performance tuning after you read this, please contact Red Hat(*).

Red Hat support service can only provide the method and procedure to change these values. If you need performance tuning and system design guidance for your system, our Add-on professional services may assist you. Ask your Red Hat sales representative for more information.

Root Cause

Since Migration traffic saturated the capacity of the management link, the host became unavailable from RHEV-Manager. As a result, the migration process was terminated.

Again, Red Hat doesn't recommend tuning related parameters without a specific reason. However if you have already re-configured these parameters and are experiencing problems, setting lower values may be helpful to solve the issue.
In any case, tuning these parameters without careful consideration could have negative consequences for your RHEV implementation.

Please see the following example for information about basic tuning.

The general rule of thumb for how to configure "migration_max_bandwidth" is as follows:

(Available Network speed in mbps) > "migration_max_bandwidth" * 8 * "max_outgoing_migrations"

Note: You should be aware of the units for "migration_max_bandwidth".

The units in vsdm.conf are in MiB/s (Megabytes per second) per vm but we need to convert to network speed when looking at network bandwidth which is always given in Mb/s (Megabits per second)

For example, a gigabit ethernet network with vdsm-4.10.2-1.13 or later. And you're using with default value.

migration_max_bandwidth=32
max_outgoing_migrations=3

In this case, 1GB link is capable for 3 concurrent migrations.
Because:

1000 > 768 (768 is 32 * 8 * 3)

In this case, as long as the management interface can achieve 768 Mb/s the outgoing VMs should be transferred successfully.

Looking at another example if we change "migration_max_bandwidth" to 100 we have:

1000 > 2400 (2400 is 100 * 8 * 3)

That won't be successful and is likely to lead to RHEV saying that the hypervisor has failed (requiring a reboot of the affected hypervisor) since the management network will be saturated and management communications will be almost impossible to keep going between the hypervisor and other parts of RHEV. In this case you would need a faster network to keep that setting, for example a 10G ethernet LAN interface.

Note It's just as an example to understand for basic knowledge.

  • It is important to also understand the throughput available via a fast network interface. Depending on network topology you may not be able to achieve the required throughput and the networking interface and the networking stack may require tuning to achieve the desired result.

  • We also should know that network capability has other factor such as other traffic on the link or performance of network equipment.
    Before implementing changes to "migration_max_bandwidth" (remember to convert from Mb/s back to MiB/s if you change it) or "max_outgoing_migrations" in vsdm.conf you should test multiple live migration to ensure that it is successful with the bandwidth available to you. You should also retest after making changes to either value to ensure that the transfers will still be successful after the changes.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments