Chapter 5. Troubleshooting disaster recovery

5.1. Troubleshooting Metro-DR

5.1.1. A statefulset application stuck after failover

Problem

While relocating to a preferred cluster, DRPlacementControl is stuck reporting PROGRESSION as "MovingToSecondary".

Previously, before Kubernetes v1.23, the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. This activity was left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods are deleted. This prevents Ramen from relocating an application to its preferred cluster.

Resolution
  1. If the workload uses StatefulSets, and relocation is stuck with PROGRESSION as "MovingToSecondary", then run:

    $ oc get pvc -n <namespace>
  2. For each bounded PVC for that namespace that belongs to the StatefulSet, run

    $ oc delete pvc <pvcname> -n namespace

    Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.

  3. Run the following command

    $ oc get drpc -n <namespace> -o wide

    After a few seconds to a few minutes, the PROGRESSION reports "Completed" and relocation is complete.

Result
The workload is relocated to the preferred cluster

BZ reference: [2118270]

5.1.2. DR policies protect all applications in the same namespace

Problem
While only single application is selected to be used by a DR policy, all applications in the same namespace will be protected. This results in PVCs, that match the DRPlacementControl spec.pvcSelector across multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.
Resolution
Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.

BZ reference: [2111163]

5.1.3. During failback of an application stuck in Relocating state

Problem
This issue might occur after performing failover and failback of an application (all nodes or cluster are up). When performing failback application stuck in the Relocating state with a message of Waiting for PV restore to complete.
Resolution
Use S3 client or equivalent to clean up the duplicate PV objects from the s3 store. Keep only the one that has a timestamp closer to the failover or relocate time.

BZ reference: [2120201]

5.1.4. Unable to apply DRPolicy to the subscription workload with RHACM 2.8

Problem
The Red Hat Advanced Cluster Management (RHACM) 2.8 console has deprecated the PlacementRule type and moved to the Placement type for Subscription applications. So, when a user creates a Subscription application using RHACM 2.8 console, the application is created with Placement only. Since OpenShift Data Foundation 4.12 disaster recovery user interface and Ramen operator do not support Placement for Subscription applications, the Disaster Recovery user interface is unable to detect the applications and display the details for assigning a policy.
Resolution

Since the RHACM 2.8 console is still able to detect the PlacementRule which is created using command-line interface (CLI), do the following steps for creating the Subscription application in RHACM 2.8 with PlacementRule:

  1. Create a new project with application namespace. (for example: busybox-application)
  2. Find the label for your managed cluster where you want to deploy the application. (for example, drcluster1-jul-6)
  3. Create a PlacementRule CR on the application-namespace with managed cluster label which was created in the previous step:

    apiVersion: apps.open-cluster-management.io/v1
    kind: PlacementRule
    metadata:
      labels:
        app: busybox-application
      name: busybox-application-placementrule-1
      namespace: busybox-application
    spec:
      clusterSelector:
        matchLabels:
          name: drcluster1-jul-6
  4. While creating the application using the RHACM console on the Subscription application page, choose this new PlacementRule.
  5. Delete the PlacementRule from the YAML editor, so it can re-use the chosen one.

BZ reference: [2216190]

5.2. Troubleshooting Regional-DR

5.2.1. RBD mirroring scheduling is getting stopped for some images

Problem

There are a few common causes for RBD mirroring scheduling getting stopped for some images.

After marking the applications for mirroring, for some reason, if it is not replicated, use the toolbox pod and run the following command to see which image scheduling is stopped.

$ rbd snap ls <poolname/imagename> –all
Resolution
  • Restart the manager daemon on the primary cluster
  • Disable and immediately re-enable mirroring on the affected images on the primary cluster

BZ reference: [2067095 and 2121514]

5.2.2. rbd-mirror daemon health is in warning state

Problem

There appears to be numerous cases where WARNING gets reported if mirror service ::get_mirror_service_status calls Ceph monitor to get service status for rbd-mirror.

Following a network disconnection, rbd-mirror daemon health is in the warning state while the connectivity between both the managed clusters is fine.

Resolution

Run the following command in the toolbox and look for leader:false

rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'

If you see the following in the output:

leader: false

It indicates that there is a daemon startup issue and the most likely root cause could be due to problems reliably connecting to the secondary cluster.

Workaround: Move the rbd-mirror pod to a different node by simply deleting the pod and verify that it has been rescheduled on another node.

leader: true or no output

Contact Red Hat Support.

BZ reference: [2118627]

5.2.3. A statefulset application stuck after failover

Problem

While relocating to a preferred cluster, DRPlacementControl is stuck reporting PROGRESSION as "MovingToSecondary".

Previously, before Kubernetes v1.23, the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. This activity was left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods are deleted. This prevents Ramen from relocating an application to its preferred cluster.

Resolution
  1. If the workload uses StatefulSets, and relocation is stuck with PROGRESSION as "MovingToSecondary", then run:

    $ oc get pvc -n <namespace>
  2. For each bounded PVC for that namespace that belongs to the StatefulSet, run

    $ oc delete pvc <pvcname> -n namespace

    Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.

  3. Run the following command

    $ oc get drpc -n <namespace> -o wide

    After a few seconds to a few minutes, the PROGRESSION reports "Completed" and relocation is complete.

Result
The workload is relocated to the preferred cluster

BZ reference: [2118270]

5.2.4. Application is not running after failover

Problem
After failing over an application, workload pods do not reach running state with errors MountVolume.MountDevice failed for volume <PV name> : rpc error: code = Internal desc = fail to check rbd image status: (cannot map image <image description> it is not primary)
Note

Execute these steps on the cluster where the workload is being failed over to.

Resolution
  1. Scale down the RBD mirror daemon deployment to 0 until the application pods can recover from the above error.

    $ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=0
  2. Post recovery, scale the RBD mirror daemon deployment back to 1.

    $ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=1

BZ reference: [2134936]

5.2.5. volsync-rsync-src pods are in error state

Problem

volsync-rsync-src pods are in error state as they are unable to connect to volsync-rsync-dst. The VolSync source pod logs might exhibit persistent error messages over an extended duration similar to the log snippet.

Run the following command to check the logs.

$ oc logs volsync-rsync-src-<app pvc name>-<suffix>

Example output

VolSync rsync container version: ACM-0.6.0-ce9a280
Syncing data to volsync-rsync-dst-busybox-pvc-9.busybox-workloads-1.svc.clusterset.local:22
Syncronization failed. Retrying in 2 seconds. Retry 1/5.
rsync: connection unexpectedly closed (7 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
Resolution

You can reconfigure the Maximum Transmission Unit (MTU) size to fix this issue using the following steps:

  1. Annotate the nodes which have submariner gateway labels.

    $ oc annotate node -l submariner.io/gateway submariner.io/tcp-clamp-mss=1340 --overwrite

    Example output

    node/compute-0 annotated
    node/compute-2 annotated
  2. Delete submariner route agent pods.

    $ oc delete pods -n submariner-operator -l app=submariner-routeagent

    Example output

    pod "submariner-routeagent-4r66z" deleted
    pod "submariner-routeagent-4tn6d" deleted
    pod "submariner-routeagent-9r42l" deleted
    pod "submariner-routeagent-bg5wq" deleted
    pod "submariner-routeagent-gzqdj" deleted
    pod "submariner-routeagent-j77jq" deleted
  3. Check for any error in the vol-sync-src pod.

    $ oc logs volsync-rsync-src-dd-io-pvc-3-nwn8h

    Example output

    VolSync rsync container version: ACM-0.6.0-ce9a280
    Syncing data to volsync-rsync-dst-dd-io-pvc-3.busybox-workloads-8.svc.clusterset.local:22 …
    .d..tp..... ./
    <f+++++++++ 07-12-2022_13-03-04-dd-io-3-5d6b4b84df-v9bhc

BZ reference: [2136864]

5.2.6. volsync-rsync-src pod is in error state as it is unable to resolve the destination hostname

Problem

VolSync source pod is unable to resolve the hostname of the VolSync destination pod. The log of the VolSync Pod consistently shows an error message over an extended period of time similar to the following log snippet.

$ oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rz

Example output

VolSync rsync container version: ACM-0.6.0-ce9a280
Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ...
ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not known
Resolution

Restart submariner-lighthouse-agent on both nodes.

$ oc delete pod -l app=submariner-lighthouse-agent -n submariner-operator

5.2.7. Unable to apply DRPolicy to the subscription workload with RHACM 2.8

Problem
The Red Hat Advanced Cluster Management (RHACM) 2.8 console has deprecated the PlacementRule type and moved to the Placement type for Subscription applications. So, when a user creates a Subscription application using RHACM 2.8 console, the application is created with Placement only. Since OpenShift Data Foundation 4.12 disaster recovery user interface and Ramen operator do not support Placement for Subscription applications, the Disaster Recovery user interface is unable to detect the applications and display the details for assigning a policy.
Resolution

Since the RHACM 2.8 console is still able to detect the PlacementRule which is created using command-line interface (CLI), do the following steps for creating the Subscription application in RHACM 2.8 with PlacementRule:

  1. Create a new project with application namespace. (for example: busybox-application)
  2. Find the label for your managed cluster where you want to deploy the application. (for example, drcluster1-jul-6)
  3. Create a PlacementRule CR on the application-namespace with managed cluster label which was created in the previous step:

    apiVersion: apps.open-cluster-management.io/v1
    kind: PlacementRule
    metadata:
      labels:
        app: busybox-application
      name: busybox-application-placementrule-1
      namespace: busybox-application
    spec:
      clusterSelector:
        matchLabels:
          name: drcluster1-jul-6
  4. While creating the application using the RHACM console on the Subscription application page, choose this new PlacementRule.
  5. Delete the PlacementRule from the YAML editor, so it can re-use the chosen one.

BZ reference: [2216190]