Chapter 5. Troubleshooting disaster recovery
5.1. Troubleshooting Metro-DR
5.1.1. A statefulset application stuck after failover
- Problem
While relocating to a preferred cluster, DRPlacementControl is stuck reporting PROGRESSION as "MovingToSecondary".
Previously, before Kubernetes v1.23, the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. This activity was left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods are deleted. This prevents Ramen from relocating an application to its preferred cluster.
- Resolution
If the workload uses StatefulSets, and relocation is stuck with PROGRESSION as "MovingToSecondary", then run:
$ oc get pvc -n <namespace>
For each bounded PVC for that namespace that belongs to the StatefulSet, run
$ oc delete pvc <pvcname> -n namespace
Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.
Run the following command
$ oc get drpc -n <namespace> -o wide
After a few seconds to a few minutes, the PROGRESSION reports "Completed" and relocation is complete.
- Result
- The workload is relocated to the preferred cluster
BZ reference: [2118270]
5.1.2. DR policies protect all applications in the same namespace
- Problem
-
While only single application is selected to be used by a DR policy, all applications in the same namespace will be protected. This results in PVCs, that match the
DRPlacementControl
spec.pvcSelector
across multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individualDRPlacementControl
actions. - Resolution
-
Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl
spec.pvcSelector
to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify thespec.pvcSelector
field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.
BZ reference: [2111163]
5.1.3. During failback of an application stuck in Relocating state
- Problem
-
This issue might occur after performing failover and failback of an application (all nodes or cluster are up). When performing failback application stuck in the
Relocating
state with a message ofWaiting
for PV restore to complete. - Resolution
- Use S3 client or equivalent to clean up the duplicate PV objects from the s3 store. Keep only the one that has a timestamp closer to the failover or relocate time.
BZ reference: [2120201]
5.1.4. Unable to apply DRPolicy to the subscription workload with RHACM 2.8
- Problem
-
The Red Hat Advanced Cluster Management (RHACM) 2.8 console has deprecated the
PlacementRule
type and moved to thePlacement type
for Subscription applications. So, when a user creates a Subscription application using RHACM 2.8 console, the application is created with Placement only. Since OpenShift Data Foundation 4.12 disaster recovery user interface and Ramen operator do not support Placement for Subscription applications, the Disaster Recovery user interface is unable to detect the applications and display the details for assigning a policy. - Resolution
Since the RHACM 2.8 console is still able to detect the
PlacementRule
which is created using command-line interface (CLI), do the following steps for creating the Subscription application in RHACM 2.8 withPlacementRule
:-
Create a new project with application namespace. (for example:
busybox-application
) -
Find the label for your managed cluster where you want to deploy the application. (for example,
drcluster1-jul-6
) Create a
PlacementRule
CR on theapplication-namespace
with managed cluster label which was created in the previous step:apiVersion: apps.open-cluster-management.io/v1 kind: PlacementRule metadata: labels: app: busybox-application name: busybox-application-placementrule-1 namespace: busybox-application spec: clusterSelector: matchLabels: name: drcluster1-jul-6
-
While creating the application using the RHACM console on the Subscription application page, choose this new
PlacementRule
. -
Delete the
PlacementRule
from the YAML editor, so it can re-use the chosen one.
-
Create a new project with application namespace. (for example:
BZ reference: [2216190]
5.2. Troubleshooting Regional-DR
5.2.1. RBD mirroring scheduling is getting stopped for some images
- Problem
There are a few common causes for RBD mirroring scheduling getting stopped for some images.
After marking the applications for mirroring, for some reason, if it is not replicated, use the toolbox pod and run the following command to see which image scheduling is stopped.
$ rbd snap ls <poolname/imagename> –all
- Resolution
- Restart the manager daemon on the primary cluster
- Disable and immediately re-enable mirroring on the affected images on the primary cluster
5.2.2. rbd-mirror
daemon health is in warning state
- Problem
There appears to be numerous cases where WARNING gets reported if mirror service
::get_mirror_service_status
callsCeph
monitor to get service status forrbd-mirror
.Following a network disconnection,
rbd-mirror
daemon health is in thewarning
state while the connectivity between both the managed clusters is fine.- Resolution
Run the following command in the toolbox and look for
leader:false
rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'
If you see the following in the output:
leader: false
It indicates that there is a daemon startup issue and the most likely root cause could be due to problems reliably connecting to the secondary cluster.
Workaround: Move the
rbd-mirror
pod to a different node by simply deleting the pod and verify that it has been rescheduled on another node.leader: true
or no output
BZ reference: [2118627]
5.2.3. A statefulset application stuck after failover
- Problem
While relocating to a preferred cluster, DRPlacementControl is stuck reporting PROGRESSION as "MovingToSecondary".
Previously, before Kubernetes v1.23, the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. This activity was left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods are deleted. This prevents Ramen from relocating an application to its preferred cluster.
- Resolution
If the workload uses StatefulSets, and relocation is stuck with PROGRESSION as "MovingToSecondary", then run:
$ oc get pvc -n <namespace>
For each bounded PVC for that namespace that belongs to the StatefulSet, run
$ oc delete pvc <pvcname> -n namespace
Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.
Run the following command
$ oc get drpc -n <namespace> -o wide
After a few seconds to a few minutes, the PROGRESSION reports "Completed" and relocation is complete.
- Result
- The workload is relocated to the preferred cluster
BZ reference: [2118270]
5.2.4. Application is not running after failover
- Problem
-
After failing over an application, workload pods do not reach running state with errors
MountVolume.MountDevice failed for volume <PV name> : rpc error: code = Internal desc = fail to check rbd image status: (cannot map image <image description> it is not primary)
Execute these steps on the cluster where the workload is being failed over to.
- Resolution
Scale down the RBD mirror daemon deployment to
0
until the application pods can recover from the above error.$ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=0
Post recovery, scale the RBD mirror daemon deployment back to
1
.$ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=1
BZ reference: [2134936]
5.2.5. volsync-rsync-src
pods are in error state
- Problem
volsync-rsync-src
pods are in error state as they are unable to connect tovolsync-rsync-dst
. TheVolSync
source pod logs might exhibit persistent error messages over an extended duration similar to the log snippet.Run the following command to check the logs.
$ oc logs volsync-rsync-src-<app pvc name>-<suffix>
Example output
VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-busybox-pvc-9.busybox-workloads-1.svc.clusterset.local:22 Syncronization failed. Retrying in 2 seconds. Retry 1/5. rsync: connection unexpectedly closed (7 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
- Resolution
You can reconfigure the Maximum Transmission Unit (MTU) size to fix this issue using the following steps:
Annotate the nodes which have submariner gateway labels.
$ oc annotate node -l submariner.io/gateway submariner.io/tcp-clamp-mss=1340 --overwrite
Example output
node/compute-0 annotated node/compute-2 annotated
Delete submariner route agent pods.
$ oc delete pods -n submariner-operator -l app=submariner-routeagent
Example output
pod "submariner-routeagent-4r66z" deleted pod "submariner-routeagent-4tn6d" deleted pod "submariner-routeagent-9r42l" deleted pod "submariner-routeagent-bg5wq" deleted pod "submariner-routeagent-gzqdj" deleted pod "submariner-routeagent-j77jq" deleted
Check for any error in the
vol-sync-src pod
.$ oc logs volsync-rsync-src-dd-io-pvc-3-nwn8h
Example output
VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-dd-io-pvc-3.busybox-workloads-8.svc.clusterset.local:22 … .d..tp..... ./ <f+++++++++ 07-12-2022_13-03-04-dd-io-3-5d6b4b84df-v9bhc
BZ reference: [2136864]
5.2.6. volsync-rsync-src
pod is in error state as it is unable to resolve the destination hostname
- Problem
VolSync
source pod is unable to resolve the hostname of the VolSync destination pod. The log of the VolSync Pod consistently shows an error message over an extended period of time similar to the following log snippet.$ oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rz
Example output
VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ... ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not known
- Resolution
Restart
submariner-lighthouse-agent
on both nodes.$ oc delete pod -l app=submariner-lighthouse-agent -n submariner-operator
5.2.7. Unable to apply DRPolicy to the subscription workload with RHACM 2.8
- Problem
-
The Red Hat Advanced Cluster Management (RHACM) 2.8 console has deprecated the
PlacementRule
type and moved to thePlacement type
for Subscription applications. So, when a user creates a Subscription application using RHACM 2.8 console, the application is created with Placement only. Since OpenShift Data Foundation 4.12 disaster recovery user interface and Ramen operator do not support Placement for Subscription applications, the Disaster Recovery user interface is unable to detect the applications and display the details for assigning a policy. - Resolution
Since the RHACM 2.8 console is still able to detect the
PlacementRule
which is created using command-line interface (CLI), do the following steps for creating the Subscription application in RHACM 2.8 withPlacementRule
:-
Create a new project with application namespace. (for example:
busybox-application
) -
Find the label for your managed cluster where you want to deploy the application. (for example,
drcluster1-jul-6
) Create a
PlacementRule
CR on theapplication-namespace
with managed cluster label which was created in the previous step:apiVersion: apps.open-cluster-management.io/v1 kind: PlacementRule metadata: labels: app: busybox-application name: busybox-application-placementrule-1 namespace: busybox-application spec: clusterSelector: matchLabels: name: drcluster1-jul-6
-
While creating the application using the RHACM console on the Subscription application page, choose this new
PlacementRule
. -
Delete the
PlacementRule
from the YAML editor, so it can re-use the chosen one.
-
Create a new project with application namespace. (for example:
BZ reference: [2216190]