Cannot remove Snapshot. Low disk space on target Storage Domain

Latest response

I am in shock.

I have a storage domain with 500 GB total, 121 GB available. And I have a thin provisioned virtual disk with 200 GB used and allocated, plus a snapshot I took last week before some upgrades at about 40 GB. The upgrade is done, now it's time to remove the snapshot. This fails with:

Cannot remove Snapshot. Low disk space on target Storage Domain

Looking at this article:
https://access.redhat.com/solutions/897013

I apparently need (original disk size + snapshot size) free space in the LUN to get rid of a snapshot????

The article says, "So in actual no data is removed..." This cannot be true. The snapshot contains old copies of blocks replaced by new copies of blocks. Removing a snapshot gets rid of old copies of blocks. So by definition, lots of data is removed.

If I'm reading the article correctly, the real issue is, snapshot removal is wildly, insanely inefficient because it makes a whole new copy of all the blocks. To delete this snapshot, I will need to provision a whole new LUN with 500+ GB of storage, copy my virtual disks to it, then remove the snapshot. And I may need to do all this with my VM - my company email server - down. And this only works if I have enough SAN space to handle something like 2.5X of what I'm actually using. If I am low on SAN space, I'm stuck with this snapshot forever. I can never get rid of it because I don't have anywhere to put the new merged virtual disk.

And it gets worse. I can find no documentation about this behavior in any of the standard docs - only in articles I stumble across after I run into the problem.

I'm not happy. C'mon guys, you gotta do better than that.

  • Greg Scott

Responses

Hi Greg. Sorry for the frustration on this, both with the product and the knowledge/documentation. I'll pass this feedback along and see what we can do to address this. We'll follow up here soon.

Thanks David -

I'm calmer now. And after thinking about it, I even get the idea. With VMware snapshots, when you delete the snapshot, you're still left with a bunch of physical .vmdk files that VMware presents as one virtual disk. You have to go to storage and copy the whole thing to a different directory to merge it all back together. RHEV does the merging automatically when deleting a snapshot, so you're left with one clean disk image. The tradeoff is, you need enough free space in the storage domain to make this happen. And it takes a long time while the VM is down. Fair enough - but this behavior should be officially documented.

And ideally, users should have some choices.

When I create a snapshot, changed blocks go into the snapshot file, right? And there must be some kind of index matching up blocks in the snapshot file with blocks in the original disk image? So when I get rid of a snapshot, why not give me a choice to just copy the updated snapshot blocks over the top of the corresponding original blocks? If there's a safety or other tradeoff, tell me about it and I can make a decision as an admin.

Dear Greg,

I totally agree with you on the need to reflect the storage needs of merge operation.
Rather than documenting this, which you suggested, I'd go with a warning dialogue implementation
that would display pre-calculated storage requirements to users so that they could review it and
proactively act before actual run of the merge operation.

I filed following bug report to track this: https://bugzilla.redhat.com/show_bug.cgi?id=1117231

Thank you very much for valuable feedback. Do not hesitate contacting Red Hat support in case you wish
to keep track on progress of the bug via a support case.

Kind regards

Tomas Dosek
Senior Software Maintenance Engineer - Virtualization

Woops, I was logged in as my customer in my comment above Tom's.

For Tom, great thoughts. But by the time I'm ready to delete a snapshot, putting in that warning dialog that deleting it will require a bunch of space is kind of like closing the barn door after the horses are already gone. That warning dialog should go before ever creating the snapshot so I go into it with my eyes open. And the docs should have a writeup on how all this works so everyone knows what they're getting into before starting down the snapshot road.

  • Greg (And the earlier post from Mark Mata was also really me.)

No problem Greg, I've modified the post so it's assigned to you.

hi,

can you share vdsm.log for analysis on this .

Hi Dhinakaran - by now it's more than a month later. The snapshot behavior I complained about is apparently well known - what good would having a vdsm.log do?

Here's a summary of the rest of the story. It's embarrassing but maybe there's a lesson in here for anyone else doing similar projects. I almost created a disaster. Almost. The story has a good ending but there were some gut-check moments.

In addition to the LUNs for my application virtual machines inside the RHEV environment, I also set up a LUN for my RHEV-M virtual machine. RHEV-M at this site is a RHEL KVM virtual machine on a bare metal host, but its virtual disk is in the SAN. HA (sort of) on a budget.

When I set up this SAN 3 years ago, I should have restricted which hosts could see which LUNs. But I didn't. I opened everything up to the whole subnet. After all, nobody but me would ever go inside my iSCSI subnet and it was just easier this way. Bad decision.

Fast forward to last month when I needed free space in my SAN LUN to get rid of a snapshot. I noticed the bare metal host with my RHEV-M VM also had iSCSI connections and LVM PVs and VGs to the iSCSI LUN holding my customer database virtual machine. The VM that ran the whole company. The one with the snapshot I wanted to get rid of.

On that bare metal host, I did vgremove and pvremove and used iscsiadm to log it out of those iSCSI LUNs. A few seconds after that, my database VM paused because it couldn't find its storage any more.

Yup, vgremove and pvremove don't just remove a connection to a volume group and PV. They write a bunch of metadata on the LUN to destroy them. The commands really should be named vgdestroy and pvdestroy because that's what they do.

And I should have known better. With a couple commands from one bare metal host, I destroyed the storage my RHEV-H hosts used for the key VM that ran the whole company.

When you destroy a critical piece of your customer's IT infrastructure, you will experience symptoms including, but not limited to, dizziness, nausea, excessive sweating, nightmares, headaches, and an overwhelming urge to scream and hide under the bed.

Since I can joke about it now suggests everything turned out well. And it did. Although the stress symptoms are real.

I had a good backup and it was a weekend. So I provisioned a whole new iSCSI LUN, created a new VM, and restored it all from my backup. Nobody lost any data or any work time. I dodged a bullet.

And I can promise you, all my iSCSI LUNs are now protected so that only the proper hosts can touch them.

- Greg

I am now running into this issue.

I have a storage domain of 1 TB, with a VM disk of 500 GB with one snapshot. I guess this means that even if I manage to free the domain of everything else, I still won't be able to remove the snapshot without adding extra storage.

The strange thing is, that this VM disk used to have two snapshots, one of which cannot be removed due to insufficient space, but the other one could in fact be removed without error, even though it took 13,5 hours. How is that even possible, considering that there was only about 100 GB available?

Cheers,
Martijn.