Migration Toolkit for Containers (MTC) restore big size volume failed with signal killed error

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Service on Azure (ARO)
    • 4

Issue

  • When migrate large PV(>4TB)using MTC, the restore phase failed.
  • If we lower the size of PV to 1TB for migration and test again, the restore phase is successful.

Resolution

  • Use the below command to modify requests, limit memory to a larger size, and try to restore again.

(After the update, restic related pod will restart)

oc edit  daemonsets restic -n openshift-migration

...
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: 100m
            memory: 2Gi
...

Root Cause

  • From the error log, it indicated restic daemonset pod had been signal: killed.
  • It is possible the restic pod got OOM (out-of-memory) issue when trying to restore large-size data.
rror='pod volume restore failed: error restoring volume: error running restic restore,'
stderr=: signal: killed' logSource='/remote-source/velero/app/pkg/restore/restore.go:1464'
restore=openshift-migration/migration-xxxxx-stage-xxxxx

Diagnostic Steps

  • Check velero-pod logs
$ oc -n openshift-migration logs -f velero-pod

level=error msg='unable to successfully complete restic restores of pod's volumes'
error='pod volume restore failed: error restoring volume: error running restic restore,'
cmd=restic restore --repo=azure:velero:/velero/restic/xxx-xxxxx-xxxx
--password-file=/tmp/credentials/openshift-migration/velero-restic-credentials-repository-password
--cache-dir=/scratch/.cache/restic xxxx --target=. --delete --skip-unchanged,.
stdout=Skip Unchanged True\nrestoring (Snapshot xxxxx of [/host_pods/xxxxx-xxxx-xxxx-xxxx-
xxxxx/volumes/kubernetes.io~cinder/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxx]
at 2023-03-27 07:11:20.949008456 +0000 UTC by root@velero) to . \{n,.
stderr=: signal: killed' logSource='/remote-source/velero/app/pkg/restore/restore.go:1464'
restore=openshift-migration/migration-xxxxx-stage-xxxxx

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments