How to recover galera cluster?
Issue
- How to recover galera
- how to restart galera
[root@os-controller1 ~]# pcs status | grep -i galera -A2
ip-galera-pub-10.144.70.106 (ocf::heartbeat:IPaddr2): Started pcmk-os-controller1
Master/Slave Set: galera-master [galera]
Masters: [ pcmk-os-controller1 pcmk-os-controller3 ]
Slaves: [ pcmk-os-controller2 ]
--
galera_monitor_10000 on pcmk-os-controller1 'unknown error' (1): call=343, status=complete, exitreason='local node <pcmk-os-controller1> is started, but not in primary mode. Unknown state.',
last-rc-change='Mon Dec 7 13:23:47 2015', queued=0ms, exec=0ms
-
The galera cluster suffers from network communication issues, which either:
- causes loss of galera quorum on a node and makes pacemaker stop the node,
- prevents a node from starting due to inability to connect to the galera cluster
-
For instance, cluster was last bootstrapped on controller 3:
151203 13:46:16 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer ''
- Examples of network issue in the logs on controller-3 follow...
151203 13:46:23 [Note] WSREP: (47ccc208-9a07-11e5-8fac-337607ce7f52, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.144.70.131:4567
151203 13:46:38 [Warning] WSREP: Send action {(nil), 409, TORDERED} returned -107 (Transport endpoint is not connected)
151203 13:46:40 [Warning] WSREP: Quorum: No node with complete state:
- When the node cannot contact other nodes anynmore, it considers it is "partitionned", without "galera quorum". This is called being in a "Non-Primary partition" in galera:
151203 14:01:23 [Note] WSREP: New cluster view: global state: e7c57526-3194-11e5-b8a8-9237c5f89c96:24216772, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3
- This is a fatal condition for the galera resource agent, so pacemaker stops the node. From crm_mon error log history:
* galera_monitor_10000 on pcmk-os-controller3 'unknown error' (1): call=616, status=complete, exitreason='local node <pcmk-os-controller3> is started, but not in primary mode. Unknown state.',
last-rc-change='Thu Dec 3 14:01:24 2015', queued=0ms, exec=0ms
- controller-2 failed to start due to a network port already in use:
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://pcmk-os-controller1,pcmk-os-controller2,pcmk-os-controller3': -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs connect failed: Address already in use
Environment
- Red Hat OpenStack 6.0
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.