Appendix C. Troubleshooting

The following section covers database and cluster troubleshooting steps.

C.1. Cluster Configuration

To check the attributes of a cluster resource, execute:

#pcs resource show RESOURCE`

Attribute values can be changed using:

#pcs resource update RESOURCE_NAME ATTR_NAME=ATTR_VALUE

If a resource fails to start and pcs status shows an error for the resource, run the following command to start it on the local node and get more details about the error:

#pcs resource debug-start RESOURCE
Note

Use caution if/when executing debug-restart . It is advised to disable the resource first to prevent conflicts, possible corruption, or resource failures. Consult Red Hat Support as needed.

To stop and start a cluster resource, execute:

#pcs resource disable RESOURCE

#pcs resource enable RESOURCE

While working on the configuration, the resource may fail to start so often that the cluster disables it permanently. This can be checked with:

#pcs resource failcount show postgresql

If the failcounts are shown as INFINITY , you can reset them with:

#pcs resource cleanup postgresql

C.2. Replication in a Cluster Environment

The cluster resource agent script automatically determines which of the two nodes should be the primary and which should be the standby node. The current status can be viewed with:

#crm_mon -Afr -1

If the primary and standby are both active, the output should appear as:

Node Attributes:
* Node cf-db1.example.com:

+ master-postgresql                 : 1000
+ postgresql-data-status            : LATEST
+ postgresql-master-baseline        : 0000000010000080
+ postgresql-status                 : PRI
+ postgresql-xlog-loc               : 0000000010000080

* Node cf-db2.example.com:

+ master-postgresql                 : 100
+ postgresql-data-status            : STREAMING|ASYNC
+ postgresql-status                 : HS:async

In this case, cf-db1 is the primary, and cf-db2 is the standby server, with streaming asynchronous replication.

If the standby lost the connection to the primary for too long and requires its database to be restored from a backup done on the primary, the output will appear as:

Node Attributes:
* Node cf-db1.example.com:

+ master-postgresql                 : -INFINITY
+ postgresql-data-status            : DISCONNECT
+ postgresql-status                 : HS:alone

* Node cf-db2.example.com:

+ master-postgresql                 : 1000
+ postgresql-data-status            : LATEST
+ postgresql-master-baseline        : 0000000011000080
+ postgresql-status                 : PRI
+ postgresql-xlog-loc               : 0000000011000080

Here, cf-db2 is the primary, and cf-db1 is unable to start because its database is out-of-date.

This can be caused by connection problems. Check the firewalls for both database systems, and check that pg_hba.conf has the same content on both systems.

If a problem is found and fixed, disable and enable the postgresql resource, run tail -f /var/log/messages and some time after enabling the resource, one database system becomes the primary and the other one the standby.

C.3. Restoring the Standby Database from a Backup

If the standby is still unable to start after checking the firewall, PostgreSQL access permissions and the NFS mount for archived Write Ahead Logs, take a backup of the primary and restore it on the standby database.

To do this, run the following commands on the standby cluster node:

#pcs cluster standby $HOSTNAME

#su - postgres

$rm -rf /tmp/pgbackup

$mkdir /tmp/pgbackup

$scl enable rh-postgresql94 -- pg_basebackup -h REPLICATION_VIP -U \ replicator -D /tmp/pgbackup -x

$rm -rf /var/opt/rh/rh-postgresql94/lib/pgsql/data/*

$mv /tmp/pgbackup/* /var/opt/rh/rh-postgresql94/lib/pgsql/data

$chown -R postgres:postgres \ /var/opt/rh/rh-postgresql94/lib/pgsql/data

#pcs cluster unstandby $HOSTNAME

C.4. Simulating a Node Failure

To test fencing and automating failover, trigger a kernel panic by running the command below. Before doing this, ensure access to the system console and power control.

#echo c >/proc/sysrq-trigger

Watching /var/log/messages on the surviving node, the crashed node is fenced, and the surviving node becomes the primary database (if it was not already).

The crashed node should boot after the power off/power on cycle, automatically join the cluster and start the database as standby. If it was the primary before, PGSQL.lock needs to be removed as described above.

C.5. Red Hat CloudForms UI Failover

To simulate a UI failure by stopping the Web server on one of the UI appliances, run the following command:

#service httpd stop

When done testing, start the Web server again with:

#service httpd start

To verify which CFME appliance serves requests, check: /var/www/miq/vmdb/log/apache/ssl_access.log .