Backing up RHEV VMs on NetApp SAN

Latest response

 

Hi there,

 

We are about to deploy a RHEV environment with three hypervisors on our NetApp SAN via Fibre Channel.

 

One thing I have not been able to find any information about in the official docs or otherwise, is how to use LUN snapshots for backing up and restoring the virtual machines.

 

As I see it, you would need to put similar VMs in a 1-LUN storage domain, make sure they are in a consistent state at once, and then create a snapshot of the underlying LUN?

 

How would you recover just one of these VMs later?

 

Is there a more clever way of doing this?

 

 

/Martin

Responses

You're not clearly saying what your virtualization solution is (VMware, HyperV, KVM, Xen, etc.) or what your available backup technologies are (NetBackup, Veeam, Legato, SnapManager, etc.). If you're using NetBackup for your backup system, it supports several hypervisors and methods for providing good, consistent-state save-sets and granular restore capabilities. If you're using VMware for your hypervisor, you've got the whole vSphere API available to you to create consistent-state save-sets.

In general, unless the your backup system is designed to support your particular virtualization environment, most of your backups are going to be "crash-consistent". This tends to be sub-ideal, but can be adequate.
 

 

Hi Thomas,

 

The virtualization solution is RHEV, i.e. the KVM-based solution that Red Hat is selling.

 

We were hoping to use NetApp snapdrive to take consistent snapshots of applications and operating systems, but this is made somewhat difficult because RHEV shares an underlying LUN among many VMs (by creating a LVM volume group of the LUN and splitting it into logical volumes which are then assigned to VMs as disks).

 

If we were using "vanilla" KVM, we could assign a LUN directly to a VM, but apparently this is not an option with RHEV.

Unfortunately, with applications that have dynamic data (e.g., a database), your backup solutions need to consist of more than just snapshotting the filesystem. You need to have a backup agent (such as SnapManager for Oracle) or a scripted process that quiesces/pauses the application, takes a snapshot, resumes the application and then backs up the snapshot. Otherwise, all you have is a snapshot of active data. This typically results in a "crash consistent" copy of the data. In the case of a database, when you go to recover from such a backup, you'll have to do a data restore and then recover to the database's last consistency point before you'll be able to resume operations.

 

I mentioned SnapManager since you made no mention of other backup technologies available to you. SnapManager's generally the way NetApp tries to get their storage customers to back things up. Dunno whether/how well it works in a KVM context, however (only ever used it on physical servers that were presented storage directly from the NetApp).

Hi Martin,

 

NetApp, in cooperation with Red Hat, has actually released a whitepaper titled "Designing a Shared Storage Infrastructure with Red Hat Enterprise Virtualization and NetApp Storage" that outlines a few suggested and supported snapshot use-cases for backups.  It can be found here:  http://www.netapp.com/us/library/technical-reports/tr-3914.html 

 

The information described in the whitepaper was also presented by NetApp at Red Hat's 2011 Summit, and a copy of the slide deck used in that presentation can be found here:  http://www.redhat.com/summit/2011/presentations/summit/whats_next/friday/benedict_f_945_enabling_rhev.pdf

 

As you'll find, there's no supported way to backup a single VM at the storage layer in RHEV.  In the future you should be able to assign specific storage to a VM similar to how you can with KVM/libvirt now. 

 

Hi guys,

 

Thanks alot for your comments and links.

 

So, best practice is to snapshot the entire RHEV Storage Domain at once, after quiescing applications and flushing file system buffers.

 

What is not entirely clear to me, is how to recover a single VM from this snapshot. Do you import the snapshot as a new Storage Domain in RHEV, and then move the VMs to the production Storage Domain?

Hi Martin, we're not intentionally leaving you hanging on your question--I'm just trying to locate the right resource to give you an answer...

 

Henry

 

Community Manager

Online User Groups, Red Hat Customer Portal

I've been looking more into this, and right now my best guess would be something like this:

 

1. Quiesce apps/VMs

2. Take snapshot of storage domain (which is a LVM volume group)

3. Map snapshot to hypervisor host

4. Generate new UUID for the snapshot-VG and mount it on the hypervisor host so it will not interfere with the production-VG

5a. To restore a full VM 1-to-1: Use dd or similar to clone the VM logical volume on the snapshot to the production LV to restore a complete VM

5b. To restore single files within a VM: mount the snapshot-LV on the hypervisor host and copy single files from the snapshot

 

How does that sound? I'm hoping RHEV will not mind that you mess with the LVs behind it's back as long as the VM we are restoring is shut down?

 

Best regards,

Martin

Hi Martin,

I *think* your plan sounds sane, but I'm not a RHEV expert, so I'm trying to get someone from that group to give it a read over and make sure there are no issues there.

 

From a storage perspective what you are trying to do is fine.  Also, in step #4 you can use 'vgimportclone'.  This will take a volume group that has a duplicate name to an active vg, and change the metadata so it does not conflict.  This was only added in later updates of RHEL 5 (starting with update 5, if I remember correctly).

 

Hopefully we should have an answer for you soon on whether RHEV would have any issues with what you're doing.

 

Regards,

John Ruemker, RHCA

Red Hat Technical Account Manager

Online User Groups Moderator

A little follow-up to this story:

 

We implemented the SAN-based snapshots of the storage domain volume groups as described earlier, and we can now restore both whole disks/LVs and single files. Of course this is still "only" crash-consistent, as discussed. Each storage domain consists of just one LUN to simplify things.

 

It gets a little hairy when you are restoring single files from a Linux guest with it's own LVM, because the hypervisor LVM gets in the way if you have clashing VG names (f.x. both my hypervisors and guest have a base VG called 'vg_local').

The best solution is to give the hypervisors unique VG names, like <hostname>_vg_local. Alternatively you can rename the VG on the snapshot.

 

Mounting Windows NTFS filesystems also works fine, but you need the NTFS-3G driver from EPEL.

 

Hope this helps someone. I can write a more thorough guide if there's any interest for it.

 

/Martin

Thanks for sharing Martin, information on use cases and best practices is always appreciated. 

 

I'd like to hear more about the restore process, things might definitely get hairy there, because RHEV 2.x relies on raw data, and after a restore the extent offsets might move about.

 

The current supported solution is to use Export Domains, but this is an offline process. There's also an option of just backing up the VMs using guest based backup agents. 

1. Map the snapshot LUN to one of your hypervisor hosts

 

2. Rename the VG on the snapshot and give it a new UUID, so it doesn't clash with the original VG:

 

vgimportclone -n snap_test /dev/<snap LUN>

3. Find the LV you want to restore from the snapshot VG:

 

[root@jytuhyp2 lvm]# lvs /dev/snap_test
  LV                                   VG        Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  42109cad-06e0-49cd-9ce7-8361c636e4ef snap_test -wi-a- 400.00G
  5df6418a-035c-49ee-863e-eafea8667f34 snap_test -wi-ao  20.00G
  6053fcf5-b4fc-44cd-b4b5-8e798116e105 snap_test -wi-a- 700.00G
  79219f65-9833-4dc4-841a-1994330fcd6a snap_test -wi-a-  20.00G
  ids                                  snap_test -wi-a- 128.00M
  inbox                                snap_test -wi-a- 128.00M
  leases                               snap_test -wi-a-   2.00G
  master                               snap_test -wi-a-   1.00G
  metadata                             snap_test -wi-a- 512.00M
  outbox                               snap_test -wi-a- 128.00M

[root@jytuhyp2 lvm-tmp]# fdisk -l /dev/snap_test/5df6418a-035c-49ee-863e-eafea8667f34

Disk /dev/snap_test/5df6418a-035c-49ee-863e-eafea8667f34: 21.4 GB, 21474836480 bytes
255 heads, 63 sectors/track, 2610 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

                                               Device Boot      Start         End      Blocks   Id  System
/dev/snap_test/5df6418a-035c-49ee-863e-eafea8667f34p1   *           1          26      204800   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/snap_test/5df6418a-035c-49ee-863e-eafea8667f34p2              26        2611    20766156   8e  Linux LVM

 

(The "Partition 1 does not ..." message occurs because the partitions are aligned to the NetApp 4k block layout)

 

4. Create device nodes for the partitions contained within the LV:

 

[root@jytuhyp2 lvm]# kpartx -a /dev/snap_test/5df6418a-035c-49ee-863e-eafea8667f34
5df6418a-035c-49ee-863e-eafea8667f34p1 : 0 409600 /dev/snap_test/5df6418a-035c-49ee-863e-eafea8667f34 128
5df6418a-035c-49ee-863e-eafea8667f34p2 : 0 41532312 /dev/snap_test/5df6418a-035c-49ee-863e-eafea8667f34 410728

 

5. Create a temporary LVM configuration containing only the VM PV(s):

 

[root@jytuhyp2 lvm]# mkdir /tmp/lvm-tmp-vm
[root@jytuhyp2 lvm]# cp -rp /tmp/lvm-tmp /tmp/lvm-tmp-vm/
[root@jytuhyp2 lvm]# export LVM_SYSTEM_DIR=/tmp/lvm-tmp-vm/

 

Partition p1 contains the /boot filesystem, p2 contains the VM PV - add it to the filter in /tmp/lvm-tmp-vm/lvm.conf:

 

filter = [ "a|^/dev/mapper/5df6418a-035c-49ee-863e-eafea8667f34p2$|", "r/.*/" ]
[root@jytuhyp2 lvm-tmp-vm]# lvs
  LV      VG       Attr   LSize Origin Snap%  Move Log Copy%  Convert
  lv_home vg_local -wi--- 6.03G
  lv_root vg_local -wi--- 6.78G
  lv_swap vg_local -wi--- 1.97G
  lv_var  vg_local -wi--- 5.00G

 

6. Rename the VM "vg_local" VG to "vm_vg_local", so it doesn't conflict with the local hypervisor's "vg_local" VG:

 

[root@jytuhyp2 lvm-tmp-vm]# vgdisplay vg_local
  --- Volume group ---
  VG Name               vg_local
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                4
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               19.78 GB
  PE Size               32.00 MB
  Total PE              633
  Alloc PE / Size       633 / 19.78 GB
  Free  PE / Size       0 / 0
  VG UUID               KeqG6w-870Q-B3yg-i19R-jID3-eExv-VxSVhZ

[root@jytuhyp2 lvm-tmp-vm]# vgrename KeqG6w-870Q-B3yg-i19R-jID3-eExv-VxSVhZ vm_vg_local
  Volume group "vg_local" successfully renamed to "vm_vg_local"

7. Activate the LV to restore files from:

[root@jytuhyp2 lvm-tmp-vm]# lvchange -ay vm_vg_local/lv_var
[root@jytuhyp2 lvm-tmp-vm]# ls -l /dev/vm_vg_local/
total 0
lrwxrwxrwx 1 root root 30 Aug  7 16:18 lv_var -> /dev/mapper/vm_vg_local-lv_var

 

8. Mount the LV and get the files you need:

 

[root@jytuhyp2 lvm-tmp-vm]# mount -o ro /dev/mapper/vm_vg_local-lv_var /mnt/recover-test/var/
[root@jytuhyp2 lvm-tmp-vm]# ls -l /mnt/recover-test/var/
total 156
drwxr-xr-x  2 root root  4096 Jul  8 17:06 account
drwxr-xr-x  7 root root  4096 Jul  8 17:06 cache
drwxr-xr-x  3 root root  4096 Jul  8 17:06 db
drwxr-xr-x  3 root root  4096 Jul  8 17:06 empty
drwxr-xr-x  2 root root  4096 Oct  1  2009 games
drwxr-xr-x 19 root root  4096 Jul  8 17:08 lib
drwxr-xr-x  2 root root  4096 Oct  1  2009 local
drwxrwxr-x  6 root lock  4096 Aug  4 04:02 lock
drwxr-xr-x  9 root root  4096 Aug  4 04:02 log
drwx------  2 root root 16384 Jul  8 15:05 lost+found
lrwxrwxrwx  1 root root    10 Jul  8 17:05 mail -> spool/mail
drwxr-xr-x  2 root root  4096 Oct  1  2009 nis
drwxr-xr-x  2 root root  4096 Oct  1  2009 opt
drwxr-xr-x  2 root root  4096 Oct  1  2009 preserve
drwxr-xr-x  2 root root  4096 Aug 20  2010 racoon
drwxr-xr-x 16 root root  4096 Aug  5 01:28 run
drwxr-xr-x 11 root root  4096 Jul  8 17:06 spool
drwxrwxrwt  2 root root  4096 Jul  8 17:06 tmp
drwxr-xr-x  3 root root  4096 Jul  8 17:06 yp

[root@jytuhyp2 lvm-tmp-vm]# tail /mnt/recover-test/var/log/messages
Jul 31 04:02:02 jytlpmp1 syslogd 1.4.1: restart.
Aug  4 17:43:38 jytlpmp1 kernel: usb 1-2: reset full speed USB device using uhci_hcd and address 2

 

9. Clean up :)

We still haven't got an answer on how to do a complete restore of a Virtual Machine, including configuration that fx. defines the path's to virtual disk devices etc.

 

We are at current point, in a posistion where we really need to do a full VM restore.

 

I can have a snapshot from NetApp presented from the StorageDomain, which should give access to files and configurations - but can we just replace the files on the Production StorageDomain with the ones from the Snapshot?

 

/Yngve

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.