Testing disaster recovery with odf dr command - OpenShift Data Foundation 4.19 Developer Preview

Updated -

Important: A developer preview feature is subject to Developer preview support limitations. Developer preview features are not intended to be run in production environments. The clusters deployed with the developer preview features are considered to be development clusters and are not supported through the Red Hat Customer Portal case management system. Development Preview features are meant for customers who are willing to evaluate new products or releases of products in an early stage of product development. If you need assistance with developer preview features, reach out to the ocs-devpreview@redhat.com mailing list and a member of the Red Hat Development Team will assist you as quickly as possible based on availability and work schedules. To know more about the support scope refer to the KCS.

Overview

How to test if disaster recovery works in my clusters? Deploying and configuring clusters for disaster recovery is complicated. The system has many moving parts and many things can go wrong.

The best way to verify that the system is configured correctly is to deploy a simple application and test real disaster recovery flow. The odf dr command makes this task easy.

Environment

For disaster recovery you must have a hub cluster and 2 managed clusters, configured for Regional DR using OpenShift Data Foundation clusters.

NOTE: The odf dr tool is not compatible yet with metro DR.

Preparing a configuration file

odf dr uses a configuration file to access the clusters and the related resources needed for testing. To create the configuration file run the following command:

$ odf dr init

✅ Created config file "config.yaml" - please modify for your clusters

The command creates the file config.yaml in the current directory. Edit the file to adapt it to your clusters.

Configuring clusters

Edit the clusters section and update the kubeconfig to point to the kubeconfig files for your clusters:

clusters:
  hub:
    kubeconfig: mykubeconfigs/hub
  c1:
    kubeconfig: mykubeconfigs/primary-cluster
  c2:
    kubeconfig: mykubeconfigs/secondary-cluster

Configuring drPolicy

Edit the drPolicy section to match your DR configuration

 drPolicy: drpolicy-1m

TIP: For quicker test, use a policy with 1 minute interval.

Configuring clusterSet

Edit clusterSet to match your ACM configuration:

clusterSet: submariner 

Configuring pvcSpecs

Edit the pvcSpecs section to use the right storage class names for your clusters:

pvcSpecs:
- name: rbd
  storageClassName: ocs-storagecluster-ceph-rbd
  accessModes: ReadWriteOnce
- name: cephfs
  storageClassName: ocs-storagecluster-cephfs
  accessModes: ReadWriteMany

TIP You can add more pvcSpecs for testing other storage classes as needed. Modify the test to refer to your own pvcSpec names.

Configuring tests

The default tests use a busybox deployment with one PVC using rbd storage class, deployed using ApplicationSet. You can modify the test to use your preferred deployment and storage, and add more tests as needed.

The available options are:

  • workloads
    • deploy: busybox deployment with one PVC
  • deployers
    • appset: ACM managed application deployed using ApplicationSet.
    • subscr: ACM managed application deployed using Subscription.
    • disapp: ACM discovered application deployed by the test command.`
  • pvcSpecs:
    • rbd: Ceph RBD storage
    • cephfs: CephFS storage

Running a test

This section shows how to run a test and inspect the test report.

Run a test and store the test report in the directory test :

$ odf dr test run -o test
  Using report "test"
  Using config "config.yaml"

   Validate config ...
     ✅ Config validated

   Setup environment ...
     ✅ Environment setup

   Run tests ...
     ✅ Application "appset-deploy-rbd" deployed
     ✅ Application "appset-deploy-rbd" protected
     ✅ Application "appset-deploy-rbd" failed over
     ✅ Application "appset-deploy-rbd" relocated
     ✅ Application "appset-deploy-rbd" unprotected
     ✅ Application "appset-deploy-rbd" undeployed

   ✅ passed (1 passed, 0 failed, 0 skipped)

To clean up after the test use the clean command.

The test flow

When running the run command odf dr prepares the clusters for the tests and run all tests specified in the configuration file.

Preparing the clusters

Create a namespace test-gitops on the hub cluster and add a channel for: https://github.com/RamenDR/ocm-ramen-samples.

Procedure

For every test specified in the configuration file perform the following steps:

  1. deploy: Deploy the application in namespace
    test-{deployer}-{workload}-{pvcSpec} in the primary cluster.
  2. protect: Create a drpc resource for the application and wait until the application is protected.
  3. failover: Fail over the application to to the secondary cluster and wait until the application is protected.
  4. relocate: Relocate the application back to the primary cluster and wait until the application is protected.
  5. unprotect: Delete the drpc resource for the application and wait until the drpc is deleted.
  6. undeploy: Undeploy the application from the managed clusters and wait until the application is deleted.

The test report

The odf-dr command stores test-run.yaml and test-run.log in the specified output directory:

$ tree test
test
├── test-run.log
└── test-run.yaml

IMPORTANT: When reporting DR related issues, create an archive with the output directory and upload it to the issue tracker.

The test-run.yaml

The test-run.yaml is a machine and human readable description of the test run:

config:
  channel:
    name: https-github-com-ramendr-ocm-ramen-samples-git
    namespace: test-gitops
  clusterSet: submariner
  clusters:
    c1:
      kubeconfig: mykubeconfigs/primary-cluster
    c2:
      kubeconfig: mykubeconfigs/secondary-cluster
    hub:
      kubeconfig: mykubeconfigs/hub
  distro: ocp
  drPolicy: drpolicy-1m
  namespaces:
    argocdNamespace: openshift-gitops
    ramenDRClusterNamespace: openshift-dr-system
    ramenHubNamespace: openshift-operators
    ramenOpsNamespace: openshift-dr-ops
  pvcSpecs:
  - accessModes: ReadWriteOnce
    name: rbd
    storageClassName: ocs-storagecluster-ceph-rbd
  - accessModes: ReadWriteMany
    name: cephfs
    storageClassName: ocs-storagecluster-cephfs
  repo:
    branch: main
    url: https://github.com/RamenDR/ocm-ramen-samples.git
  tests:
  - deployer: appset
    pvcSpec: rbd
    workload: deploy
created: "2025-04-24T16:33:28.800757+05:30"
duration: 695.608462543
host:
  arch: arm64
  cpus: 12
  os: darwin
name: test-run
status: passed
steps:
- duration: 0.022823334
  name: validate
  status: passed
- duration: 0.009449584
  name: setup
  status: passed
- duration: 695.576189625
  items:
    duration: 695.576118209
    name: appset-deploy-rbd
    status: passed
  name: tests
  status: passed
summary:
  canceled: 0
  failed: 0
  passed: 1
  skipped: 0

You can query it with tools like yq:

$ yq .status < test/test-run.yaml
passed

The test-run.log

The test-run.log contains detailed logs of the test progress.

To extract single test major events use:

grep -E '(INFO|ERROR).+appset-deploy-rbd' test/test-run.log

Example output:

2025-03-29T23:56:24.547+0300    INFO    appset-deploy-rbd   deployers/appset.go:23  Deploying applicationset app "test-appset-deploy-rbd/busybox" in cluster "primary-cluster"
2025-03-29T23:56:25.060+0300    INFO    appset-deploy-rbd   deployers/appset.go:41  Workload deployed
2025-03-29T23:56:25.383+0300    INFO    appset-deploy-rbd   dractions/actions.go:51 Protecting workload "test-appset-deploy-rbd/busybox" in cluster "primary-cluster"
2025-03-29T23:59:16.414+0300    INFO    appset-deploy-rbd   dractions/actions.go:93 Workload protected
2025-03-29T23:59:16.892+0300    INFO    appset-deploy-rbd   dractions/actions.go:157    Failing over workload "test-appset-deploy-rbd/busybox" from cluster "primary-cluster" to cluster "secondary-cluster"
2025-03-30T00:05:03.748+0300    INFO    appset-deploy-rbd   dractions/actions.go:165    Workload failed over
2025-03-30T00:05:04.226+0300    INFO    appset-deploy-rbd   dractions/actions.go:190    Relocating workload "test-appset-deploy-rbd/busybox" from cluster "secondary-cluster" to cluster "primary-cluster"
2025-03-30T00:10:50.940+0300    INFO    appset-deploy-rbd   dractions/actions.go:198    Workload relocated
2025-03-30T00:10:51.260+0300    INFO    appset-deploy-rbd   dractions/actions.go:121    Unprotecting workload "test-appset-deploy-rbd/busybox" in cluster "primary-cluster"
2025-03-30T00:11:17.561+0300    INFO    appset-deploy-rbd   dractions/actions.go:136    Workload unprotected
2025-03-30T00:11:17.882+0300    INFO    appset-deploy-rbd   deployers/appset.go:61  Undeploying applicationset app "test-appset-deploy-rbd/busybox" in cluster "primary-cluster"
2025-03-30T00:11:18.379+0300    INFO    appset-deploy-rbd   deployers/appset.go:80  Workload undeployed

Cleaning up

To clean up after a test, removing resources created by the test, run:

$ odf dr test clean -o test
   Using report "test"
   Using config "config.yaml"

 Validate config ...
   ✅ Config validated

 Clean tests ...
   ✅ Application "appset-deploy-rbd" unprotected
   ✅ Application "appset-deploy-rbd" undeployed

 Clean environment ...
   ✅ Environment cleaned

✅ passed (1 passed, 0 failed, 0 skipped)

The clean command adds test-clean.log and test-clean.yaml to the output directory:

$ tree test
test
├── test-clean.log
├── test-clean.yaml
├── test-run.log
└── test-run.yaml

The clean flow

When running the clean command odf dr deletes all the tests applications specified in the configuration file and cleans up the clusters.

For every test specified in the configuration file perform the following steps:

  1. unprotect: Delete the drpc resource for the application and wait until the drpc is deleted.

  2. undeploy: Undeploy the application from the managed clusters and wait until the application is deleted.

To cleaning up the clusters:

Delete the channel and the namespace test-gitops on the hub cluster.

Failed tests

When a test fails, the test command gathers data related to the failed tests in the output directory. The gathered data can help you or developers to diagnose the issue.

The following example shows a test run with a failed test, and how to inspect the failure.

TIP: In this example, to fail the test, the rbd-mirror deployment is scaled down on the primary cluster after the application reached the protected state.

  • Run the test:
$ odf dr test run -o example-failure
  Using report "example-failure"
  Using config "config.yaml"

 Validate config ...
   ✅ Config validated

 Setup environment ...
   ✅ Environment setup

 Run tests ...
   ✅ Application "appset-deploy-rbd" deployed
   ✅ Application "appset-deploy-rbd" protected
   ❌ failed to failover application "appset-deploy-rbd"

 Gather data ...
   ✅ Gathered data from cluster "hub"
   ✅ Gathered data from cluster "secondary-cluster"
   ✅ Gathered data from cluster "primary-cluster"

❌ failed (0 passed, 1 failed, 0 skipped)

The command stores gathered data in the test-run.gather directory:

$ tree -L2 example-failure
example-failure
├── test-run.gather
│   ├── hub
│   ├── primary-cluster
│   └── secondary-cluster
├── test-run.log
└── test-run.yaml

IMPORTANT: When reporting DR related issues, create an archive with the output directory and upload it to the issue tracker.

Inspecting gathered data

The command gathers all the namespaces related to the failed test, and related cluster scope resources such as storage classes and persistent volumes.

$ tree -L3 example-failure/test-run.gather
example-failure/test-run.gather
├── hub
│   ├── cluster
│   │   └── namespaces
│   └── namespaces
│       ├── openshift-gitops
│       └── openshift-operators
├── primary-cluster
│   ├── cluster
│   │   ├── namespaces
│   │   ├── persistentvolumes
│   │   └── storage.k8s.io
│   └── namespaces
│       ├── openshift-dr-system
│       ├── openshift-gitops
│       ├── openshift-operators
│       └── test-appset-deploy-rbd
└── secondary-cluster
    ├── cluster
    │   ├── namespaces
    │   ├── persistentvolumes
    │   └── storage.k8s.io
    └── namespaces
        ├── openshift-dr-system
        ├── openshift-gitops
        ├── openshift-operators
        └── test-appset-deploy-rbd

Change directory into the gather directory to simplify the next steps:

$ example-failure/test-run.gather

You can start by looking at the DRPC:

$ cat hub/namespaces/openshift-gitops/ramendr.openshift.io/drplacementcontrols/appset-deploy-rbd.yaml
...
status:
  actionStartTime: "2025-04-01T18:02:06Z"
  conditions:
  - lastTransitionTime: "2025-04-01T18:02:41Z"
    message: Completed
    observedGeneration: 2
    reason: FailedOver
    status: "True"
    type: Available
  - lastTransitionTime: "2025-04-01T18:02:06Z"
    message: Started failover to cluster "secondary-cluster"
    observedGeneration: 2
    reason: NotStarted
    status: "False"
    type: PeerReady
  - lastTransitionTime: "2025-04-01T18:03:11Z"
    message: VolumeReplicationGroup (test-appset-deploy-rbd/appset-deploy-rbd) on
      cluster secondary-cluster is not reporting any lastGroupSyncTime as primary, retrying
      till status is met
    observedGeneration: 2
    reason: Progressing
    status: "False"
    type: Protected
  lastUpdateTime: "2025-04-01T18:08:11Z"
  observedGeneration: 2
  phase: FailedOver
  preferredDecision:
    clusterName: primary-cluster
    clusterNamespace: primary-cluster
  progression: Cleaning Up

You can see that the application is stuck in Cleaning Up progression and the VRG in cluster secondary-cluster is not reporting lastGroupSyncTime value.

Note the VRG in cluster secondary-cluster:

$ cat secondary-cluster/namespaces/test-appset-deploy-rbd/ramendr.openshift.io/volumereplicationgroups/appset-deploy-rbd.yaml
...
status:
  conditions:
  - lastTransitionTime: "2025-04-01T18:02:48Z"
    message: PVCs in the VolumeReplicationGroup are ready for use
    observedGeneration: 2
    reason: Ready
    status: "True"
    type: DataReady
  - lastTransitionTime: "2025-04-01T18:02:37Z"
    message: VolumeReplicationGroup is replicating
    observedGeneration: 2
    reason: Replicating
    status: "False"
    type: DataProtected
  - lastTransitionTime: "2025-04-01T18:02:37Z"
    message: Restored 0 volsync PVs/PVCs and 2 volrep PVs/PVCs
    observedGeneration: 2
    reason: Restored
    status: "True"
    type: ClusterDataReady
  - lastTransitionTime: "2025-04-01T18:02:47Z"
    message: Cluster data of all PVs are protected
    observedGeneration: 2
    reason: Uploaded
    status: "True"
    type: ClusterDataProtected
  - lastTransitionTime: "2025-04-01T18:02:37Z"
    message: Kube objects restored
    observedGeneration: 2
    reason: KubeObjectsRestored
    status: "True"
    type: KubeObjectsReady
  kubeObjectProtection: {}
  lastUpdateTime: "2025-04-01T18:12:48Z"
  observedGeneration: 2
  ...

You can see that DataProtected is False.

Note the VR resource in the same namespace:

$ cat secondary-cluster/namespaces/test-appset-deploy-rbd/replication.storage.openshift.io/volumereplications/busybox-pvc.yaml
...
status:
  conditions:
  - lastTransitionTime: "2025-04-01T18:02:48Z"
    message: volume is promoted to primary and replicating to secondary
    observedGeneration: 1
    reason: Promoted
    status: "True"
    type: Completed
  - lastTransitionTime: "2025-04-01T18:02:48Z"
    message: volume is healthy
    observedGeneration: 1
    reason: Healthy
    status: "False"
    type: Degraded
  - lastTransitionTime: "2025-04-01T18:02:47Z"
    message: volume is not resyncing
    observedGeneration: 1
    reason: NotResyncing
    status: "False"
    type: Resyncing
  - lastTransitionTime: "2025-04-01T18:02:47Z"
    message: volume is validated and met all prerequisites
    observedGeneration: 1
    reason: PrerequisiteMet
    status: "True"
    type: Validated
  lastCompletionTime: "2025-04-01T18:12:48Z"
  message: volume is marked primary
  observedGeneration: 1
  state: Primary

You can see that the VR is primary and replicating to the other cluster.

You can also inspect ramen-dr-cluster-operator logs:

% tree secondary-cluster/namespaces/openshift-dr-system/pods/ramen-dr-cluster-operator-5dd448864d-78x8l/manager/
secondary-cluster/namespaces/openshift-dr-system/pods/ramen-dr-cluster-operator-5dd448864d-78x8l/manager/
├── current.log
└── previous.log

In this case, the gathered data tells that ramen is not the root cause, and you need to inspect the storage.

If more information is needed, you can use the standard must-gather with OpenShift Data Foundation` or ACM images to do a full gather.

When you finish debugging the failed test you need to cleanup up:

$ odf dr test clean -o example-failure
   Using report "example-failure"
   Using config "config.yaml"

 Validate config ...
   ✅ Config validated

 Clean tests ...
   ✅ Application "appset-deploy-rbd" unprotected
   ✅ Application "appset-deploy-rbd" undeployed

 Clean environment ...
   ✅ Environment cleaned

✅ passed (1 passed, 0 failed, 0 skipped)

Canceling tests

The run or clean command may take up to 10 minutes to complete the current test step. To get all the information about failed tests, you should wait until the command completes and gathers data for failed tests.

You can cancel the command by pressing Ctrl+C. This saves the current tests progress but does not gather data for incomplete tests.

Comments