JON 3.2 upgrade issue - The Cluster status of the 2nd node shows "DOWN"

Latest response

We upgraded JON from 3.1.2 to 3.2 recently in a HA environment. The cluster status of the 2nd node has been down since the upgrade.

We tried a few things but no luck so far.

1) Updated cassandra.yaml & rhq-storage-auth.conf and ensure both IPs of the 2 nodes in the 2 config files (that didn't work)

  1. Ran CLI StorageNodeManager.runClusterMaintenance() (didn't work)

  2. Ran the following nodetool repair commands and didn't work either:

nodetool -p 7299 repair system_auth
nodetool -p 7299 repair rhq

It seems that no matter what we try, the Cluster of the 2nd node always shows "DOWN". Might the data on the 2nd node be corrupted ? Is it possible to scrub the 2nd storage node and rebuild from scratch ?

Sherry

Responses

Sherry,

I have heard one other report of a similar situation. I suspect the issue is related to the upgrade process itself. When performing the upgrade of the second JBoss ON server in the HA environment, a second storage node is installed. However, I believe that in the situation you describe, the second storage node did not get properly added to the storage cluster.

The problem is that both your JBoss ON servers may have actually been reported data to the two separate storage nodes. Therefore, it isn't advisable to simply remove the second node and re-install from scratch. Instead, we must asses the specific state of each node and take appropriate actions based on that state.

This would probably be best handled via a support case, but to make sure that we have the basics:

  1. Make sure that the Endpoint Address value for each node, as listed in the storage node topology administration page, is resolvable from each of the JBoss ON servers and storage node host machines.
  2. Make sure that the Endpoint Address for both storage nodes are listed in <RHQ_SERVER_HOME>/rhq-storage/conf/rhq-storage-auth.conf. This file is found in both storage node installations.
  3. Make sure that the seeds: property in <RHQ_SERVER_HOME>/rhq-storage/conf/cassandra.yaml contains the Endpoint Address for both storage nodes. Again, this file is found in both storage node installations.
  4. Verify that when you start up a storage node, you see the other storage node join its cluster. This can be seen by reviewing <RHQ_SERVER_HOME>/logs/rhq-storage.log for the storage node that is running and looking for the following message while the other storage node is starting up:

    Handshaking version with /192.168.1.2
    

    where 192.168.1.2 in the above example would be the IP or host address of the other storage node.

  5. With storage node 1 running and storage node 2 shutdown, restart JBoss ON server 1. Review its log during startup to see if it produces any error or warning related to communicating with the storage cluster.

  6. With sotrage node 2 running and storage node 1 shutdown, restart JBoss ON server 1. Again, review its log during startup to see if it produces any error or warning related to communicating with the storage cluster.

You can reference this response in your ticket with Red Hat Global Support Services.

Kind regards,
Larry O'Leary
Red Hat Global Support Services
https://access.redhat.com/site/support/contact/technicalSupport/

Thanks a lot, Larry.

I'm working with RedHat Global Support Engineer on this issue. Yesterday, we updated replication_factor from 1 to 2 and resolved the cluster data replication issue.

ALTER KEYSPACE rhq WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
ALTER KEYSPACE system_auth WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};

We're still working on bringing the 2nd node back to "NORMAL" state.

Sherry