RHEV 3.0 to 3.1 upgrade experience

Latest response

Just want to share my experience upgrading from v3.0 to v3.1.

Lab upgrade

Our RHEV lab consists of 3 hypervisors, fibre channel storage, no vlans, no bonds, 70 active VMs. The rhevm machine is running as a KVM virtual machine under RHEL6. 

  1. First we upgraded all hypervisors to v6.3-20121212, as we were uncertain about the hypervisor requirement for the upgrade.
  2. To have a good rollback possibility we shut down the rhevm virtual machine, and copied the disk image to another machine.
  3. Then did a full yum update of the rhevm server, rebooted, added the jbappplatform-6-x86_64-server-6 and rhel-x86_64-server-6-rhevm-3.1 repos, "yum update rhevm-setup" and ran "rhevm-upgrade" to do the upgrade.
  4. Failed on "Error: The current system contains a block (iSCSI/Fibre Channel) Export Storage Domain which is no longer supported.". This was an inactive export domain that we didn't need anymore, so I deleted it and started over.
  5. Tried a new rhevm-upgrade, and failed again. This time was a much worse failure. I had failed to remove the old rhel-x86_64-server-6-rhevm-3 and jbappplatform-5-x86_64-server-6-rpm repositories, so the installation had gotten wrong packages installed. I wasn't able to re-run the upgrade at this point, and was told to roll back to old versions. This was non-trivial, so I ended up going back to my backup rhevm diskimage and start from scratch.
  6. Repeat step 1,2,3, remove rhevm3.0 and jboss5 repos, remove no longer needed storage domain again, start new rhevm-upgrade.
  7. Success! 
  8. Then did the "yum install rhevm-dwh ; rhevm-dwh-setup" to upgrade the history service. This failed several times because it kept running out of space doing a huge database dump to /var/lib/ovirt-engine/backups/. But the failures were unproblematic and could be restarted after I had added more disk space to /var/.
  9. Then upgraded the reports package doing "yum install rhevm-reports ; rhevm-reports-setup".

And finally upgraded cluster and datacenter compatibility to v3.1 using the rhevm webui. No problem.

The fact that I could easily roll back to the previous disk-image made the me feel confident that the upgrade would be safe to do to our production environment also.

 

Production upgrade

The lab was running on v3.1 for about 2 weeks before we did the production upgrade. Our production environment is a bit more complex than the lab. 22 hypervisors, iSCSI storage from NetApp and Storwize, active/passive bonds for rhevm and production networks, lots of VLANs, 2 hypervisors doing local storage.

Before doing the upgrade we made sure all hypervisors were running 6.3-something (mix of versions from 201206xx-20121212). The backup strategy were the same as in the lab, copy the rhevm disk image to a safe place.

Then we ran the same steps as for the lab upgrade, avoiding the rhevm v3.0 and jboss5 repository problems, so the main upgrade webt without any issues. Unfortunately we were unable to avoid running out of disk space during the reports package upgrade. It needed quite insane amounts of disk (18GB for the dump) and ran out of disk several times.

Other than that the upgrade was completely unproblematic, we noticed that we couldn't upgrade the cluster/datacenter compatibility level before all hypervisors were upgraded to 6.3-201212, did the upgrade and we're now happily running v3.1 everywhere!

Responses

Awesome! Thanks for sharing. It's great to see you had a reasonably smooth experience, and hopefully your account of the issues you did run into will assist others following the same process.

Great post indeed. Do you have any suggestions for improvments from your experience? I can obviously see chancesfor  improvement to calculate exact requirement for available disk space upfront and requesting to remove rhev3 and jboss5 channels before starting upgrade.

Anything other than that?

Thanks for the comments. Did you refer to this document when you did the upgrade?: https://access.redhat.com/knowledge/node/269333

About a week ago I added the commands to the upgrade procedure to disable the older repos. Excellent idea about being more specific about space requirements.

The history database dump should be written compressed, then it would have been much less of a problem. Also it would be good if it cleaned up afterwards. I still have the 18GB /var/lib/ovirt-engine/backups/ovirt-dwh_db_backup_2013_02_13_11_33_17.sql laying there uncompressed, as I have no idea if it's safe to delete this or not. It probably is, but I'd very much like to see it stated somewhere..

Yes I was following that document when upgrading, and believe I did check it when I was trying to roll back the failed upgrade. Unfortunately my rhevm-server isn't connected to rhn, but to a local mrepo mirror of the needed channels. This mirror only holds the latest versions of packages, so I couldn't do a simple yum install/downgrade of the specific packages needed for the rollback since these package versions were not available there.

BTW: it would be great if that document also mentioned if it was OK to delete the /var/lib/ovirt-engine/backups/ovirt-dwh_db_backup*sql files after the upgrade. This directory is part of the recommended rhevm backup procedure, so it's quite wasteful to keep copying out that huge dump in the daily backups.