Chapter 18. Backing up and restoring Data Grid clusters

Data Grid Operator lets you back up and restore Data Grid cluster state for disaster recovery and to migrate Data Grid resources between clusters.

18.1. Backup and Restore CRs

Backup and Restore CRs save in-memory data at runtime so you can easily recreate Data Grid clusters.

Applying a Backup or Restore CR creates a new pod that joins the Data Grid cluster as a zero-capacity member, which means it does not require cluster rebalancing or state transfer to join.

For backup operations, the pod iterates over cache entries and other resources and creates an archive, a .zip file, in the /opt/infinispan/backups directory on the persistent volume (PV).

Note

Performing backups does not significantly impact performance because the other pods in the Data Grid cluster only need to respond to the backup pod as it iterates over cache entries.

For restore operations, the pod retrieves Data Grid resources from the archive on the PV and applies them to the Data Grid cluster.

When either the backup or restore operation completes, the pod leaves the cluster and is terminated.

Reconciliation

Data Grid Operator does not reconcile Backup and Restore CRs which mean that backup and restore operations are "one-time" events.

Modifying an existing Backup or Restore CR instance does not perform an operation or have any effect. If you want to update .spec fields, you must create a new instance of the Backup or Restore CR.

18.2. Backing up Data Grid clusters

Create a backup file that stores Data Grid cluster state to a persistent volume.

Prerequisites

  • Create an Infinispan CR with spec.service.type: DataGrid.
  • Ensure there are no active client connections to the Data Grid cluster.

    Data Grid backups do not provide snapshot isolation and data modifications are not written to the archive after the cache is backed up.
    To archive the exact state of the cluster, you should always disconnect any clients before you back it up.

Procedure

  1. Name the Backup CR with the metadata.name field.
  2. Specify the Data Grid cluster to backup with the spec.cluster field.
  3. Configure the persistent volume claim (PVC) that adds the backup archive to the persistent volume (PV) with the spec.volume.storage and spec.volume.storage.storageClassName fields.

    apiVersion: infinispan.org/v2alpha1
    kind: Backup
    metadata:
      name: my-backup
    spec:
      cluster: source-cluster
      volume:
        storage: 1Gi
        storageClassName: my-storage-class
  4. Optionally include spec.resources fields to specify which Data Grid resources you want to back up.

    If you do not include any spec.resources fields, the Backup CR creates an archive that contains all Data Grid resources. If you do specify spec.resources fields, the Backup CR creates an archive that contains those resources only.

    spec:
      ...
      resources:
        templates:
          - distributed-sync-prod
          - distributed-sync-dev
        caches:
          - cache-one
          - cache-two
        counters:
          - counter-name
        protoSchemas:
          - authors.proto
          - books.proto
        tasks:
          - wordStream.js

    You can also use the * wildcard character as in the following example:

    spec:
      ...
      resources:
        caches:
          - "*"
        protoSchemas:
          - "*"
  5. Apply your Backup CR.

    oc apply -f my-backup.yaml

Verification

  1. Check that the status.phase field has a status of Succeeded in the Backup CR and that Data Grid logs have the following message:

    ISPN005044: Backup file created 'my-backup.zip'
  2. Run the following command to check that the backup is successfully created:

    oc describe Backup my-backup

18.3. Restoring Data Grid clusters

Restore Data Grid cluster state from a backup archive.

Prerequisites

  • Create a Backup CR on a source cluster.
  • Create a target Data Grid cluster of Data Grid service pods.

    Note

    If you restore an existing cache, the operation overwrites the data in the cache but not the cache configuration.

    For example, you back up a distributed cache named mycache on the source cluster. You then restore mycache on a target cluster where it already exists as a replicated cache. In this case, the data from the source cluster is restored and mycache continues to have a replicated configuration on the target cluster.

  • Ensure there are no active client connections to the target Data Grid cluster you want to restore.

    Cache entries that you restore from a backup can overwrite more recent cache entries.
    For example, a client performs a cache.put(k=2) operation and you then restore a backup that contains k=1.

Procedure

  1. Name the Restore CR with the metadata.name field.
  2. Specify a Backup CR to use with the spec.backup field.
  3. Specify the Data Grid cluster to restore with the spec.cluster field.

    apiVersion: infinispan.org/v2alpha1
    kind: Restore
    metadata:
      name: my-restore
    spec:
      backup: my-backup
      cluster: target-cluster
  4. Optionally add the spec.resources field to restore specific resources only.

    spec:
      ...
      resources:
        templates:
          - distributed-sync-prod
          - distributed-sync-dev
        caches:
          - cache-one
          - cache-two
        counters:
          - counter-name
        protoSchemas:
          - authors.proto
          - books.proto
        tasks:
          - wordStream.js
  5. Apply your Restore CR.

    oc apply -f my-restore.yaml

Verification

  • Check that the status.phase field has a status of Succeeded in the Restore CR and that Data Grid logs have the following message:

    ISPN005045: Restore 'my-backup' complete

You should then open the Data Grid Console or establish a CLI connection to verify data and Data Grid resources are restored as expected.

18.4. Backup and restore status

Backup and Restore CRs include a status.phase field that provides the status for each phase of the operation.

StatusDescription

Initializing

The system has accepted the request and the controller is preparing the underlying resources to create the pod.

Initialized

The controller has prepared all underlying resources successfully.

Running

The pod is created and the operation is in progress on the Data Grid cluster.

Succeeded

The operation has completed successfully on the Data Grid cluster and the pod is terminated.

Failed

The operation did not successfully complete and the pod is terminated.

Unknown

The controller cannot obtain the status of the pod or determine the state of the operation. This condition typically indicates a temporary communication error with the pod.

18.4.1. Handling failed backup and restore operations

If the status.phase field of the Backup or Restore CR is Failed, you should examine pod logs to determine the root cause before you attempt the operation again.

Procedure

  1. Examine the logs for the pod that performed the failed operation.

    Pods are terminated but remain available until you delete the Backup or Restore CR.

    oc logs <backup|restore_pod_name>
  2. Resolve any error conditions or other causes of failure as indicated by the pod logs.
  3. Create a new instance of the Backup or Restore CR and attempt the operation again.