Chapter 3. Configuring Database Failover
The failover monitor daemon must run on all of the non-database CloudForms appliances to check for failures. In case of a database failure, it modifies the database configuration accordingly.
This configuration is crucial for high availability to work in your environment. If the database failover monitor is not configured, the standby database-only appliance will not react and take over operations in case of a primary database failure.
3.1. Configuring the Failover Monitor
Configure the failover monitor only on the non-database CloudForms appliances with the following steps:
- In the appliance console menu, select Configure Application Database Failover Monitor.
- Select Start Database Failover Monitor.
3.2. Testing Database Failover
Test that failover is working correctly between your databases with the following steps:
Simulate a failure by stopping the database on the primary server:
# systemctl stop rh-postgresql95-postgresql
To check the status of the database, run:
# systemctl status rh-postgresql95-postgresql
NoteYou can check the status of the simulated failure by viewing the most recent
ha_admin.loglog on the engine appliances:# tail -f /var/www/miq/vmdb/log/ha_admin.log
- Check the appliance console summary screen for the primary database. If configured correctly, the CFME Database value in the appliance console summary should have switched from the hostname of the old primary database to the hostname of the new primary on all CloudForms appliances.
Upon database server failover, the standby server becomes the primary. However, the failed node cannot switch to standby automatically and must be manually configured. Data replication from the new primary to the failed and recovered node does not happen by default, so the failed node must be reintroduced into the configuration.
3.3. Reintroducing the Failed Node
Manual steps are required to reintroduce the failed primary database node back into the cluster as a standby. This allows for greater control over the configuration, and to diagnose the failure.
To reintroduce the failed node:
Stop the local pgsql service:
# systemctl stop $APPLIANCE_PG_SERVICE
Run
pg_rewindon the failed primary server as thepostgresuser:# su - postgres $ pg_rewind -D $APPLIANCE_PG_DATA --source-server="host=<new_master_ip> user=root password=<db_password> dbname=vmdb_production"
Add the following lines for standby configuration to the
/etc/repmgr.conffile:failover=automatic promote_command='repmgr standby promote' follow_command='repmgr standby follow' logfile=/var/log/repmgr/repmgrd.log
Copy the
/var/lib/pgsql/.pgpassfile from the new primary server to the failed primary server. Change the file’s ownership to thepostgresuser and group, and the permissions to 600:# chown postgres:postgres /var/lib/pgsql/.pgpass # chmod 600 /var/lib/pgsql/.pgpass
Run
repmgr standby followas thepostgresuser on the failed primary server to add it as a standby server:# su - postgres $ repmgr -f /etc/repmgr.conf -D $APPLIANCE_PG_DATA -h <new_master_ip> -U root -d vmdb_production standby follow
If the
repmgr standby followcommand times out andpostgresql.logreports a message similar torequested WAL segment 000000020000000000000004 has already been removed, you can correct this by removing the contents of the data directory and reinitializing the standby to re-add the node. This occurs when the write ahead log (WAL) required to catch up the standby server is no longer available on the primary.Correct this by running the following steps:
Delete all database data and the replication manager configuration file from failed node:
# rm -rf /var/opt/rh/rh-postgresql95/lib/pgsql/data/* # rm /etc/repmgr.conf
Delete the failed database node entry from new primary database-only appliance:
- Check the failed node ID from the output and delete the entry; the ID is the same as the cluster ID provided during installation.
Delete the node. For example, if the cluster ID is
1, runvmdb_production=#delete from repl_nodes where id = 1to delete the failed node:# psql vmdb_production vmdb_production=#select * from repl_nodes; vmdb_production=#delete from repl_nodes table where id = $cluster_node_id_of_failed node;
Delete the replication slot information for the failed database node.
Check the
replication_slotinformation for the failed node in thepg_replication_slotstable:-
The failed node is marked with
fin theactivecolumn in thepg_replication_slotstable. If it ist, there is a standby database-only appliance connected with ID 1; this appliance must be shut down before deleting the replication slot. -
When the standby node is disconnected, the active column will be changed to
f.
-
The failed node is marked with
Delete the slot:
The slot number is the same as the
cluster_id; then delete the replication slot. For example, if the cluster ID is1, runvmdb_production=#select pg_drop_replication_slot('repmgr_slot_1')to delete the slot:# psql vmdb_production vmdb_production=#select * from pg_replication_slots; vmdb_production=#select pg_drop_replication_slot('repmgr_slot_$failed_node_cluster_id');
Reinitialize the standby database as described in Section 2.5, “Configuring the Standby Database-Only Appliance” to re-add the node.
For more information on the WAL log, see the PostgreSQL documentation.
Start and enable
repmgrdfor automatic failover:# systemctl start rh-postgresql95-repmgr # systemctl enable rh-postgresql95-repmgr
To apply the changes, stop the cluster, then restart the service as the
postgresuser:# su - postgres $ pg_ctl -D $APPLIANCE_PG_DATA stop $ pg_ctl -D $APPLIANCE_PG_DATA status $ exit # systemctl start $APPLIANCE_PG_SERVICE
Your CloudForms environment is now re-configured for high availability.

Where did the comment section go?
Red Hat's documentation publication system recently went through an upgrade to enable speedier, more mobile-friendly content. We decided to re-evaluate our commenting platform to ensure that it meets your expectations and serves as an optimal feedback mechanism. During this redesign, we invite your input on providing feedback on Red Hat documentation via the discussion platform.