Appendix B. Restoring the overcloud

B.1. Restoring the overcloud control plane services

The following procedure restores backups of the overcloud databases and configuration. In this situation, it is recommended to open three terminal windows so that you can perform certain operations simultaneously on all three Controller nodes. It is also recommended to select a Controller node to perform high availability operations. This procedure refers to this Controller node as the bootstrap Controller node.

Important

This procedure only restores control plane services. It does not include restore Compute node workloads nor data on Ceph Storage nodes.

Procedure

  1. Stop Pacemaker and remove all containerized services:

    1. Log into the bootstrap Controller node and stop the pacemaker cluster:

      # sudo pcs cluster stop --all
    2. Wait until the cluster shuts down completely:

      # sudo pcs status
    3. On all Controller nodes, remove all containers for OpenStack services:

      # docker stop $(docker ps -a -q)
      # docker rm $(docker ps -a -q)
  2. If you are restoring from a failed major version upgrade, you might need to reverse any yum transactions that occurred on all nodes. This involves the following on each node:

    1. Enable the repositories for previous versions. For example:

      # sudo subscription-manager repos --enable=rhel-7-server-openstack-10-rpms
      # sudo subscription-manager repos --enable=rhel-7-server-openstack-11-rpms
      # sudo subscription-manager repos --enable=rhel-7-server-openstack-12-rpms
    2. Enable the following Ceph repositories:

      # sudo subscription-manager repos --enable=rhel-7-server-rhceph-2-tools-rpms
      # sudo subscription-manager repos --enable=rhel-7-server-rhceph-2-mon-rpms
    3. Check the yum history:

      # sudo yum history list all

      Identify transactions that occurred during the upgrade process. Most of these operations will have occurred on one of the Controller nodes (the Controller node selected as the bootstrap node during the upgrade). If you need to view a particular transaction, view it with the history info subcommand:

      # sudo yum history info 25
      Note

      To force yum history list all to display the command ran from each transaction, set history_list_view=commands in your yum.conf file.

    4. Revert any yum transactions that occurred since the upgrade. For example:

      # sudo yum history undo 25
      # sudo yum history undo 24
      # sudo yum history undo 23
      ...

      Make sure to start from the last transaction and continue in descending order. You can also revert multiple transactions in one execution using the rollback option. For example, the following command rolls back transaction from the last transaction to 23:

      # sudo yum history rollback 23
      Important

      It is recommended to use undo for each transaction instead of rollback so that you can verify the reversal of each transaction.

    5. Once the relevant yum transaction have reversed, enable only the original OpenStack Platform repository on all nodes. For example:

      # sudo subscription-manager repos --disable=rhel-7-server-openstack-*-rpms
      # sudo subscription-manager repos --enable=rhel-7-server-openstack-10-rpms
    6. Disable the following Ceph repositories:

      # sudo subscription-manager repos --enable=rhel-7-server-rhceph-3-tools-rpms
      # sudo subscription-manager repos --enable=rhel-7-server-rhceph-3-mon-rpms
  3. Restore the database:

    1. Copy the database backups to the bootstrap Controller node.
    2. Stop external connections to the database port on all Controller nodes:

      # MYSQLIP=$(hiera -c /etc/puppet/hiera.yaml mysql_bind_host)
      # sudo /sbin/iptables -I INPUT -d $MYSQLIP -p tcp --dport 3306 -j DROP

      This isolates all the database traffic to the nodes.

    3. Temporarily disable database replication. Edit the /etc/my.cnf.d/galera.cnf file on all Controller nodes.

      # vi /etc/my.cnf.d/galera.cnf

      Make the following changes:

      • Comment out the wsrep_cluster_address parameter.
      • Set wsrep_provider to none
    4. Save the /etc/my.cnf.d/galera.cnf file.
    5. Make sure the MariaDB database is disabled on all Controller nodes. During the upgrade to OpenStack Platform 13, the MariaDB service moves to a containerized service, which you disabled earlier. Make sure the service isn’t running as a process on the host as well:

      # mysqladmin -u root shutdown
      Note

      You might get a warning from HAProxy that the database is disabled.

    6. Move existing MariaDB data directories and prepare new data directories on all Controller nodes,

      # mv /var/lib/mysql/ /var/lib/mysql.old
      # mkdir /var/lib/mysql
      # chown mysql:mysql /var/lib/mysql
      # chmod 0755 /var/lib/mysql
      # mysql_install_db --datadir=/var/lib/mysql --user=mysql
      # chown -R mysql:mysql /var/lib/mysql/
      # restorecon -R /var/lib/mysql
    7. Start the database manually on all Controller nodes:

      # mysqld_safe --skip-grant-tables --skip-networking --wsrep-on=OFF &
    8. Get the old password Reset the database password on all Controller nodes:

      # OLDPASSWORD=$(sudo cat .my.cnf | grep -m1 password | cut -d'=' -f2 | tr -d "'")
      # mysql -uroot -e"use mysql;update user set password=PASSWORD($OLDPASSWORD)"
    9. Stop the database on all Controller nodes:

      # /usr/bin/mysqladmin -u root shutdown
    10. Start the database manually on the bootstrap Controller node without the --skip-grant-tables option:

      # mysqld_safe --skip-networking --wsrep-on=OFF &
    11. On the bootstrap Controller node, restore the OpenStack database. This will be replicated to the other Controller nodes later:

      # mysql -u root < openstack_database.sql
    12. On the bootstrap controller node, restore the users and permissions:

      # mysql -u root < grants.sql
    13. Shut down the bootstrap Controller node with the following command:

      # mysqladmin shutdown
    14. Enable database replication. Edit the /etc/my.cnf.d/galera.cnf file on all Controller nodes.

      # vi /etc/my.cnf.d/galera.cnf

      Make the following changes:

      • Uncomment out the wsrep_cluster_address parameter.
      • Set wsrep_provider to /usr/lib64/galera/libgalera_smm.so
    15. Save the /etc/my.cnf.d/galera.cnf file.
    16. Run the database on the bootstrap node:

      # /usr/bin/mysqld_safe --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysql_cluster.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm:// &

      The lack of nodes in the --wsrep-cluster-address option will force Galera to create a new cluster and make the bootstrap node the master node.

    17. Check the status of the node:

      # clustercheck

      This command should report Galera cluster node is synced.. Check the /var/log/mysql_cluster.log file for errors.

    18. On the remaining Controller nodes, start the database:

      $ /usr/bin/mysqld_safe --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysql_cluster.log  --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 &

      The inclusion of the nodes in the --wsrep-cluster-address option adds nodes to the new cluster and synchronizes content from the master.

    19. Periodically check the status of each node:

      # clustercheck

      When all nodes have completed their synchronization operations, this command should report Galera cluster node is synced. for each node.

    20. Stop the database on all nodes:

      $ mysqladmin shutdown
    21. Remove the firewall rule from each node for the services to restore access to the database:

      # sudo /sbin/iptables -D INPUT -d $MYSQLIP -p tcp --dport 3306 -j DROP
  4. Restore the Pacemaker configuration

    1. Copy the Pacemaker archive to the bootstrap node.
    2. Log into the bootstrap node.
    3. Run the configuration restoration command:

      # pcs config restore pacemaker_controller_backup.tar.bz2
  5. Restore the redis resource:

    1. Copy the Redis dump to each Controller node.
    2. Move the Redis dump to the original location on each Controller:

      # mv dump.rdb /var/lib/redis/dump.rdb
    3. Restore permissions to the Redis directory:

      # chown -R redis: /var/lib/redis
  6. Restore the filesystem:

    1. Copy the backup tar file for each Controller node to a temporary directory and uncompress all the data:

      # mkdir /var/tmp/filesystem_backup/
      # cd /var/tmp/filesystem_backup/
      # mv <backup_file>.tar.gz .
      # tar -xvzf --xattrs <backup_file>.tar.gz
      Note

      Do not extract directly to the / directory. This overrides your current filesystem. It is recommended to extract the file in a temporary directory.

    2. Restore the os-*-config files and restart os-collect-config:

      # cp -rf /var/tmp/filesystem_backup/var/lib/os-collect-config/* /var/lib/os-collect-config/.
      # cp -rf /var/tmp/filesystem_backup/usr/libexec/os-apply-config/* /usr/libexec/os-apply-config/.
      # cp -rf /var/tmp/filesystem_backup/usr/libexec/os-refresh-config/* /usr/libexec/os-refresh-config/.
      # systemctl restart os-collect-config
    3. Restore the Puppet hieradata files:

      # cp -r /var/tmp/filesystem_backup/etc/puppet/hieradata /etc/puppet/hieradata
      # cp -r /var/tmp/filesystem_backup/etc/puppet/hiera.yaml /etc/puppet/hiera.yaml
    4. Retain this directory in case you need any configuration files.
  7. Remove the contents of any of the following directories:

    # rm -rf /var/lib/config-data/puppet-generated/*
    # rm /root/.ffu_workaround
  8. Restore the permissions for the OpenStack Object Storage (swift) service:

    # chown -R swift: /srv/node
    # chown -R swift: /var/lib/swift
    # chown -R swift: /var/cache/swift
  9. Log into the undercloud and run the original openstack overcloud deploy command from your OpenStack Platform 10 deployment. Make sure to include all environment files relevant to your original deployment.
  10. Wait until the deployment completes.
  11. After restoring the overcloud control plane data, check each relevant service is enabled and running correctly:

    1. For high availability services on controller nodes:

      # pcs resource enable [SERVICE]
      # pcs resource cleanup [SERVICE]
    2. For System services on controller and compute nodes:

      # systemctl start [SERVICE]
      # systemctl enable [SERVICE]

The next few sections provide a reference of services that should be enabled.

B.2. Restored High Availability Services

The following is a list of high availability services that should be active on OpenStack Platform 10 Controller nodes after a restore. If any of these service are disabled, use the following commands to enable them:

# pcs resource enable [SERVICE]
# pcs resource cleanup [SERVICE]
Controller Services

galera

haproxy

openstack-cinder-volume

rabbitmq

redis

B.3. Restored Controller Services

The following is a list of core Systemd services that should be active on OpenStack Platform 10 Controller nodes after a restore. If any of these service are disabled, use the following commands to enable them:

# systemctl start [SERVICE]
# systemctl enable [SERVICE]
Controller Services

httpd

memcached

neutron-dhcp-agent

neutron-l3-agent

neutron-metadata-agent

neutron-openvswitch-agent

neutron-ovs-cleanup

neutron-server

ntpd

openstack-aodh-evaluator

openstack-aodh-listener

openstack-aodh-notifier

openstack-ceilometer-central

openstack-ceilometer-collector

openstack-ceilometer-notification

openstack-cinder-api

openstack-cinder-scheduler

openstack-glance-api

openstack-glance-registry

openstack-gnocchi-metricd

openstack-gnocchi-statsd

openstack-heat-api-cfn

openstack-heat-api-cloudwatch

openstack-heat-api

openstack-heat-engine

openstack-nova-api

openstack-nova-conductor

openstack-nova-consoleauth

openstack-nova-novncproxy

openstack-nova-scheduler

openstack-swift-account-auditor

openstack-swift-account-reaper

openstack-swift-account-replicator

openstack-swift-account

openstack-swift-container-auditor

openstack-swift-container-replicator

openstack-swift-container-updater

openstack-swift-container

openstack-swift-object-auditor

openstack-swift-object-expirer

openstack-swift-object-replicator

openstack-swift-object-updater

openstack-swift-object

openstack-swift-proxy

openvswitch

os-collect-config

ovs-delete-transient-ports

ovs-vswitchd

ovsdb-server

pacemaker

B.4. Restored Overcloud Compute Services

The following is a list of core Systemd services that should be active on OpenStack Platform 10 Compute nodes after a restore. If any of these service are disabled, use the following commands to enable them:

# systemctl start [SERVICE]
# systemctl enable [SERVICE]
Compute Services

neutron-openvswitch-agent

neutron-ovs-cleanup

ntpd

openstack-ceilometer-compute

openstack-nova-compute

openvswitch

os-collect-config