Virtualization: RHEL/KVM Fault Tolerance

Latest response

Why do we need fault tolerance in KVM?

Just like in traditional server fault-tolerance situations, there is a need for RHEL/KVM virtualization to be fault tolerant so that virtual machines are not dropped. Virtualization in a data-center environment demands fault tolerance. Ideally, an application should be fault tolerant and protect its data in the event of a failure (versus, for example, a 1TB server being redundant and replicating data). However, the reality is that there are a lot of legacy applications that we have to protect that are not fault tolerant. Our recommendation for new application development would be to author fault-tolerance applications.

 

What are we considering?

We are evaluating Kemari Fault Tolerance in RHEL/KVM to create fault-tolerant virtualized environments.

 

How does it work?
The goal of Kemari is to provide a fault-tolerant platform for virtualization environments so that in the event of a hardware failure, a virtual machine will fail over from compromised to properly operating hardware (a physical machine) in a way that is completely transparent to the guest operating system. In contrast to hardware-based fault-tolerant servers and HA servers, by abstracting hardware using virtualization, Kemari can be used on off-the-shelf hardware with no application modifications.

 

Kemari runs paired virtual machines in an active-passive configuration and achieves whole-system replication by continuously copying the state of the system (dirty pages and the state of the virtual devices) from the active node to the passive node. One interesting result of this is that during normal operation, only the active node is actually executing code.

 

Feedback requested:

Do you consider virtualization fault-tolerance to be necessary, or do you believe applications need to be fault tolerant?

 

Responses

This is another one of those features that would be extremely useful to have, but not an absolute requirement. We achieve fault tolerance in our environment through the use of horizontal scalability (application and database replication, load balancing, etc.)

I've used VMware FT and think it is useful when your application does not natively support fault tolerance or horizontal scalability. VMware's implementation was fraught with limitations (IIRC, ESX/ESXi 4 FT VMs can only have a single vCPU) and hopefully an alternative such as Kemari will provide fewer limitations.

The above posts are on the money.

I think this Kemari thing is good addition to RH virtualization features in order to close the feature gaps with VMware. For it to be available with RHEL is a bigger plus in order to commoditize this feature.

Looking forward to this technology to be in RHEL 6.x Tech preview and hopefully fully supported in RHEL 7.

I think RHEV DR is more important. In case a data center goes offline, RHEV should be able to move and activate VM's onto a second fysical data center.

Request for clarification: I'm used to using fault tolerant computing in the context of systems like Stratus or Tandem, where the system continues correct operation in the presence of hardware failures. In this case, fault tolerant means that transaction processing will be continued, with no transactions lost or aborted and there are no gaps in system availability.

 

This capability looks like fail-over - the workload will be moved to a new system or instance, but transactions in process may be lost or aborted, and there will be a loss of system availability when the workload is moved to the new server or instance.

If we're to compete with VMWare, I believe need this too.