How to recover galera cluster?

Issue

How to recover galera
how to restart galera

[root@os-controller1 ~]# pcs status | grep -i galera -A2
 ip-galera-pub-10.144.70.106    (ocf::heartbeat:IPaddr2):       Started pcmk-os-controller1
 Master/Slave Set: galera-master [galera]
     Masters: [ pcmk-os-controller1 pcmk-os-controller3 ]
     Slaves: [ pcmk-os-controller2 ]
--
 galera_monitor_10000 on pcmk-os-controller1 'unknown error' (1): call=343, status=complete, exitreason='local node <pcmk-os-controller1> is started, but not in primary mode. Unknown state.',
    last-rc-change='Mon Dec  7 13:23:47 2015', queued=0ms, exec=0ms

The galera cluster suffers from network communication issues, which either:
1. causes loss of galera quorum on a node and makes pacemaker stop the node,
2. prevents a node from starting due to inability to connect to the galera cluster
For instance, cluster was last bootstrapped on controller 3:

  151203 13:46:16 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer ''

Examples of network issue in the logs on controller-3 follow...

151203 13:46:23 [Note] WSREP: (47ccc208-9a07-11e5-8fac-337607ce7f52, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.144.70.131:4567 
151203 13:46:38 [Warning] WSREP: Send action {(nil), 409, TORDERED} returned -107 (Transport endpoint is not connected)
151203 13:46:40 [Warning] WSREP: Quorum: No node with complete state:

When the node cannot contact other nodes anynmore, it considers it is "partitionned", without "galera quorum". This is called being in a "Non-Primary partition" in galera:

151203 14:01:23 [Note] WSREP: New cluster view: global state: e7c57526-3194-11e5-b8a8-9237c5f89c96:24216772, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3

This is a fatal condition for the galera resource agent, so pacemaker stops the node. From crm_mon error log history:

* galera_monitor_10000 on pcmk-os-controller3 'unknown error' (1): call=616, status=complete, exitreason='local node <pcmk-os-controller3> is started, but not in primary mode. Unknown state.',
    last-rc-change='Thu Dec  3 14:01:24 2015', queued=0ms, exec=0ms

controller-2 failed to start due to a network port already in use:

151203 13:57:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://pcmk-os-controller1,pcmk-os-controller2,pcmk-os-controller3': -98 (Address already in use)
151203 13:57:31 [ERROR] WSREP: gcs connect failed: Address already in use

Environment

Red Hat OpenStack 6.0

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Ansible.com

Red Hat Ecosystem Catalog

Red Hat Hybrid Cloud Console

Red Hat Store

Red Hat Marketplace

Red Hat Summit and AnsibleFest

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links