How to recover galera cluster?
Issue
- How to recover galera
- how to restart galera
[root@os-controller1 ~]# pcs status | grep -i galera -A2
ip-galera-pub-10.144.70.106 (ocf::heartbeat:IPaddr2): Started pcmk-os-controller1
Master/Slave Set: galera-master [galera]
Masters: [ pcmk-os-controller1 pcmk-os-controller3 ]
Slaves: [ pcmk-os-controller2 ]
--
galera_monitor_10000 on pcmk-os-controller1 'unknown error' (1): call=343, status=complete, exitreason='local node <pcmk-os-controller1> is started, but not in primary mode. Unknown state.',
last-rc-change='Mon Dec 7 13:23:47 2015', queued=0ms, exec=0ms
-
The galera cluster suffers from network communication issues, which either:
- causes loss of galera quorum on a node and makes pacemaker stop the node,
- prevents a node from starting due to inability to connect to the galera cluster
-
For instance, cluster was last bootstrapped on controller 3:
151203 13:46:16 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer ''
- Examples of network issue in the logs on controller-3 follow...
151203 13:46:23 [Note] WSREP: (47ccc208-9a07-11e5-8fac-337607ce7f52, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.144.70.131:4567
151203 13:46:38 [Warning] WSREP: Send action {(nil), 409, TORDERED} returned -107 (Transport endpoint is not connected)
151203 13:46:40 [Warning] WSREP: Quorum: No node with complete state:
- When the node cannot contact other nodes anynmore, it considers it is "partitionned", without "galera quorum". This is called being in a "Non-Primary partition" in galera:
151203 14:01:23 [Note] WSREP: New cluster view: global state: e7c57526-3194-11e5-b8a8-9237c5f89c96:24216772, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3
- This is a fatal condition for the galera resource agent, so pacemaker stops the node. From crm_mon error log history:
* galera_monitor_10000 on pcmk-os-controller3 'unknown error' (1): call=616, status=complete, exitreason='local node <pcmk-os-controller3> is started, but not in primary mode. Unknown state.',
last-rc-change='Thu Dec 3 14:01:24 2015', queued=0ms, exec=0ms
- controller-2 failed to start due to a network port already in use:
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://pcmk-os-controller1,pcmk-os-controller2,pcmk-os-controller3': -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs connect failed: Address already in use
Environment
- Red Hat OpenStack 6.0
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
