Data Availability and Integrity considerations for replica 2 pools in Red Hat OpenShift Data Foundation

Updated -

Purpose of this document

In the following document we will talk about the differences between writing data two or more times in a modern storage system. For this document we will focus on the OpenShift Data Foundation (ODF) product, which is based on the Ceph technology, but most of the below content applies to any modern storage system.

Executive summary

Red Hat’s main objective as a Data Foundation is to never lose data. When comparing the reliability of today’s disk models we see that SSDs are much more reliable than HDDs. Thus we generally restrict support to SSD-based storage deployments only.
Furthermore we see the ask from customers for more capacity-optimized deployments with reduced data replication and are aware that other vendors support these deployments. Nevertheless, based on our data, we are aware of the high risk of such deployments to both: Data availability (even if just intermittently) and Data integrity.
Thus we don’t recommend running production workloads with reduced replication modes unless the risks are well understood and mitigated.

Importance of Data Availability and Integrity

Only when we understand what is at stake, we can better understand the risk and mitigation strategies. Thus we want to quickly illustrate why Data Availability and Integrity is so important for us at Red Hat.

Data Availability

We define data availability as the state in which we can freely read and write data. Modern storage systems that are configured correctly can accept failures of any components without impacting the data availability. Usually it is up to the human operator to define how much failure can be sustained before the data becomes unavailable. The data is accessible only if the stored data is reachable and enough of the storage’s systems are online.
In a Ceph cluster we can control the data availability during failures through the replication level of any given storage pool. Different pools can have different replication levels, depending on the importance of the stored data within.
Even if enough disks are reporting to ensure data is available, we also need to ensure that the services that deliver it as object, block or file are online.
In a Ceph cluster each of these components can be horizontally scaled independently of each other with no single point of failure. In a default ODF installation each of the critical components is deployed at least twice to ensure the data remains accessible during upgrades or in the presence of failures.
Ceph provides built in self healing mechanisms to ensure the data is always available. We distribute the data across all the disks in the cluster to reduce the traffic during recovery, increase its throughput and reduce the time the data is at risk. With replica 2, the data can only be recovered from a single source increasing the load of the recovery traffic on the remaining disks in the cluster. As the risk of data loss is higher (we have only a single copy of the data) the priority of the recovery needs to be higher.
In addition ODF is using OpenShift’s high availability features to provide resilience to failures for Ceph services and daemons. For example, the object gateway is run on a different node in case of a node failure.

Data Integrity

Data integrity is achieved if all data that is stored within the cluster can be read out again without changes. This sounds like an easy effort, but is made harder by the fact that we have a repeated influx of errors on the hardware level that need to be mitigated.
One of the main issues that can happen are so called bit flips, where a stored bit can change over time. This can be measured in UBER (Unrecoverable Bit Error Rate) which is the percentage of bits that have errors relative to the total number of bits that have been read after error-correction has been applied.
Modern file systems and applications can accept bit flips to a certain degree before they fail to function. Once they do, the data can usually not be repaired.
Another hardware fault is disk faults, where a whole disk is offline. The likelihood of a disk failure is predicted with the MTBF (Mean Time Between Failures) which can be used to predict how long a disk will last until it fails. The MTBF can be calculated for disk models or disk classes as a whole.
In ODF we protect data against disk faults by replicating all incoming data several times. We then compare the replicas with each other over time to protect against bit flips.
As a comparison, while regular RAID cards also protect against disk failures via replication, they don’t compare the written data and thus do not protect against bit flips.

ODF support decision basis

When deciding if we want to support a given architecture with OpenShift Data Foundation (ODF) we look at the MTBF and UBER numbers and calculate a risk for this architecture. You can see a summary of such a calculation in the table below.

Risk Replication Rule Disk Media used Supported for Internal ODF Supported for External ODF
High 2x Enterprise HDD+SSD No No
High 2x Enterprise SSD Yes Yes
Medium 4+2 Enterprise HDD+SSD No Yes**
Medium 3x Enterprise HDD+SSD No Yes*
Medium 8+3 Enterprise HDD+SSD No Yes**
Low 4+2 Enterprise SSD No Yes**
Low 3x Enterprise SSD Yes (Default) Yes (Default)
Low 8+3 Enterprise SSD No Yes**

* Only recommended for object storage pools
** Only supported for object storage pools
Only supported for non-default RBD pools. See Managing and allocating Storage Resources, section 2.1
Currently no support for Ceph FS or RGW Pools.

We can see that changing the disk type to include HDDs alongside SSDs already increases the risk significantly. This is due to the fact that HDDs are more prone to failure than flash-based disks.
If we then also change the replication rule to reduce the replica count, we are immediately at a high risk. This is also the case with reduced replication and SSD disks. This is due to the fact that with 2x replication, where data is always only written twice, we cannot ensure data integrity any more. If any one of the replicas differs, the storage system can not know which data is correct. In addition to that, any failure in the storage system makes it very likely to also cause data availability issues due to cascading events.

Comments