How to recover galera cluster?

Solution Verified - Updated -

Issue

  • How to recover galera
  • how to restart galera
[root@os-controller1 ~]# pcs status | grep -i galera -A2
 ip-galera-pub-10.144.70.106    (ocf::heartbeat:IPaddr2):       Started pcmk-os-controller1
 Master/Slave Set: galera-master [galera]
     Masters: [ pcmk-os-controller1 pcmk-os-controller3 ]
     Slaves: [ pcmk-os-controller2 ]
--
 galera_monitor_10000 on pcmk-os-controller1 'unknown error' (1): call=343, status=complete, exitreason='local node <pcmk-os-controller1> is started, but not in primary mode. Unknown state.',
    last-rc-change='Mon Dec  7 13:23:47 2015', queued=0ms, exec=0ms
  • The galera cluster suffers from network communication issues, which either:

    1. causes loss of galera quorum on a node and makes pacemaker stop the node,
    2. prevents a node from starting due to inability to connect to the galera cluster
  • For instance, cluster was last bootstrapped on controller 3:

  151203 13:46:16 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer ''
  • Examples of network issue in the logs on controller-3 follow...
151203 13:46:23 [Note] WSREP: (47ccc208-9a07-11e5-8fac-337607ce7f52, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.144.70.131:4567 
151203 13:46:38 [Warning] WSREP: Send action {(nil), 409, TORDERED} returned -107 (Transport endpoint is not connected)
151203 13:46:40 [Warning] WSREP: Quorum: No node with complete state:
  • When the node cannot contact other nodes anynmore, it considers it is "partitionned", without "galera quorum". This is called being in a "Non-Primary partition" in galera:
151203 14:01:23 [Note] WSREP: New cluster view: global state: e7c57526-3194-11e5-b8a8-9237c5f89c96:24216772, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3
  • This is a fatal condition for the galera resource agent, so pacemaker stops the node. From crm_mon error log history:
* galera_monitor_10000 on pcmk-os-controller3 'unknown error' (1): call=616, status=complete, exitreason='local node <pcmk-os-controller3> is started, but not in primary mode. Unknown state.',
    last-rc-change='Thu Dec  3 14:01:24 2015', queued=0ms, exec=0ms
  • controller-2 failed to start due to a network port already in use:
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://pcmk-os-controller1,pcmk-os-controller2,pcmk-os-controller3': -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs connect failed: Address already in use

Environment

  • Red Hat OpenStack 6.0

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content