Chapter 3. Configuring Database Failover

The failover monitor daemon must run on all of the non-database CloudForms appliances to check for failures. In case of a database failure, it modifies the database configuration accordingly.

Important

This configuration is crucial for high availability to work in your environment. If the database failover monitor is not configured, the standby database-only appliance will not react and take over operations in case of a primary database failure.

3.1. Configuring the Failover Monitor

Configure the failover monitor only on the non-database CloudForms appliances with the following steps:

  1. In the appliance console menu, select Configure Application Database Failover Monitor.
  2. Select Start Database Failover Monitor.

3.2. Testing Database Failover

Test that failover is working correctly between your databases with the following steps:

  1. Simulate a failure by stopping the database on the primary server:

    # systemctl stop rh-postgresql95-postgresql
  2. To check the status of the database, run:

    # systemctl status rh-postgresql95-postgresql
    Note

    You can check the status of the simulated failure by viewing the most recent ha_admin.log log on the engine appliances:

    # tail -f /var/www/miq/vmdb/log/ha_admin.log
  3. Check the appliance console summary screen for the primary database. If configured correctly, the CFME Database value in the appliance console summary should have switched from the hostname of the old primary database to the hostname of the new primary on all CloudForms appliances.
Important

Upon database server failover, the standby server becomes the primary. However, the failed node cannot switch to standby automatically and must be manually configured. Data replication from the new primary to the failed and recovered node does not happen by default, so the failed node must be reintroduced into the configuration.

3.3. Reintroducing the Failed Node

Manual steps are required to reintroduce the failed primary database node back into the cluster as a standby. This allows for greater control over the configuration, and to diagnose the failure.

To reintroduce the failed node:

  1. Stop the local pgsql service:

    # systemctl stop $APPLIANCE_PG_SERVICE
  2. Run pg_rewind on the failed primary server as the postgres user:

    # su - postgres
    $ pg_rewind -D $APPLIANCE_PG_DATA --source-server="host=<new_master_ip> user=root password=<db_password> dbname=vmdb_production"
  3. Add the following lines for standby configuration to the /etc/repmgr.conf file:

    failover=automatic
    promote_command='repmgr standby promote'
    follow_command='repmgr standby follow'
    logfile=/var/log/repmgr/repmgrd.log
  4. Copy the /var/lib/pgsql/.pgpass file from the new primary server to the failed primary server. Change the file’s ownership to the postgres user and group, and the permissions to 600:

    # chown postgres:postgres /var/lib/pgsql/.pgpass
    # chmod 600 /var/lib/pgsql/.pgpass
  5. Run repmgr standby follow as the postgres user on the failed primary server to add it as a standby server:

    # su - postgres
    $ repmgr -f /etc/repmgr.conf -D $APPLIANCE_PG_DATA -h <new_master_ip> -U root -d vmdb_production standby follow

    If the repmgr standby follow command times out and postgresql.log reports a message similar to requested WAL segment 000000020000000000000004 has already been removed, you can correct this by removing the contents of the data directory and reinitializing the standby to re-add the node. This occurs when the write ahead log (WAL) required to catch up the standby server is no longer available on the primary.

    Correct this by running the following steps:

    1. Delete all database data and the replication manager configuration file from failed node:

      # rm -rf /var/opt/rh/rh-postgresql95/lib/pgsql/data/*
      # rm /etc/repmgr.conf
    2. Delete the failed database node entry from new primary database-only appliance:

      1. Check the failed node ID from the output and delete the entry; the ID is the same as the cluster ID provided during installation.
      2. Delete the node. For example, if the cluster ID is 1, run vmdb_production=#delete from repl_nodes where id = 1 to delete the failed node:

        # psql vmdb_production
        vmdb_production=#select * from repl_nodes;
        vmdb_production=#delete from repl_nodes table where id = $cluster_node_id_of_failed node;
    3. Delete the replication slot information for the failed database node.

      1. Check the replication_slot information for the failed node in the pg_replication_slots table:

        • The failed node is marked with f in the active column in the pg_replication_slots table. If it is t, there is a standby database-only appliance connected with ID 1; this appliance must be shut down before deleting the replication slot.
        • When the standby node is disconnected, the active column will be changed to f.
      2. Delete the slot:

        • The slot number is the same as the cluster_id; then delete the replication slot. For example, if the cluster ID is 1, run vmdb_production=#select pg_drop_replication_slot('repmgr_slot_1') to delete the slot:

          # psql vmdb_production
          vmdb_production=#select * from pg_replication_slots;
          vmdb_production=#select pg_drop_replication_slot('repmgr_slot_$failed_node_cluster_id');
    4. Reinitialize the standby database as described in Section 2.5, “Configuring the Standby Database-Only Appliance” to re-add the node.

      For more information on the WAL log, see the PostgreSQL documentation.

  6. Start and enable repmgrd for automatic failover:

    # systemctl start rh-postgresql95-repmgr
    # systemctl enable rh-postgresql95-repmgr
  7. To apply the changes, stop the cluster, then restart the service as the postgres user:

    # su - postgres
    $ pg_ctl -D $APPLIANCE_PG_DATA stop
    $ pg_ctl -D $APPLIANCE_PG_DATA status
    $ exit
    
    # systemctl start $APPLIANCE_PG_SERVICE

Your CloudForms environment is now re-configured for high availability.