How to recover galera cluster?

Solution Verified - Updated -

Issue

  • How to recover galera
  • how to restart galera
[root@os-controller1 ~]# pcs status | grep -i galera -A2
 ip-galera-pub-10.144.70.106    (ocf::heartbeat:IPaddr2):       Started pcmk-os-controller1
 Master/Slave Set: galera-master [galera]
     Masters: [ pcmk-os-controller1 pcmk-os-controller3 ]
     Slaves: [ pcmk-os-controller2 ]
--
 galera_monitor_10000 on pcmk-os-controller1 'unknown error' (1): call=343, status=complete, exitreason='local node <pcmk-os-controller1> is started, but not in primary mode. Unknown state.',
    last-rc-change='Mon Dec  7 13:23:47 2015', queued=0ms, exec=0ms
  • The galera cluster suffers from network communication issues, which either:

    1. causes loss of galera quorum on a node and makes pacemaker stop the node,
    2. prevents a node from starting due to inability to connect to the galera cluster
  • For instance, cluster was last bootstrapped on controller 3:

  151203 13:46:16 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer ''
  • Examples of network issue in the logs on controller-3 follow...
151203 13:46:23 [Note] WSREP: (47ccc208-9a07-11e5-8fac-337607ce7f52, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.144.70.131:4567 
151203 13:46:38 [Warning] WSREP: Send action {(nil), 409, TORDERED} returned -107 (Transport endpoint is not connected)
151203 13:46:40 [Warning] WSREP: Quorum: No node with complete state:
  • When the node cannot contact other nodes anynmore, it considers it is "partitionned", without "galera quorum". This is called being in a "Non-Primary partition" in galera:
151203 14:01:23 [Note] WSREP: New cluster view: global state: e7c57526-3194-11e5-b8a8-9237c5f89c96:24216772, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3
  • This is a fatal condition for the galera resource agent, so pacemaker stops the node. From crm_mon error log history:
* galera_monitor_10000 on pcmk-os-controller3 'unknown error' (1): call=616, status=complete, exitreason='local node <pcmk-os-controller3> is started, but not in primary mode. Unknown state.',
    last-rc-change='Thu Dec  3 14:01:24 2015', queued=0ms, exec=0ms
  • controller-2 failed to start due to a network port already in use:
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://pcmk-os-controller1,pcmk-os-controller2,pcmk-os-controller3': -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs connect failed: Address already in use

Environment

  • Red Hat OpenStack 6.0

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In