RHEL Pacemaker Cluster SAP HANA resource Showing WAITING4PRIM Status

Latest response

Hi I am facing issue in my RHEL pacemaker scaleup Cluster for SAP HANA .

Had anyone faced similar issue after failover.

[root@prd1 ~]# pcs status --full
Cluster name: Sap_hana
Cluster Summary:
* Stack: corosync
* Current DC: prd2.poc.hana.com (2) (version 2.0.5-9.el8_4.1-ba59be7122) - partition with quorum
* Last updated: Mon Sep 13 22:34:11 2021
* Last change: Mon Sep 13 18:13:10 2021 by root via cibadmin on prd1.poc.hana.com
* 2 nodes configured
* 6 resource instances configured

Node List:
* Online: [ prd1.poc.hana.com (1) prd2.poc.hana.com (2) ]

Full List of Resources:
* vmfence (stonith:fence_vmware_rest): Started prd1.poc.hana.com
* Clone Set: SAPHanaTopology_PRD_00-clone [SAPHanaTopology_PRD_00]:
* SAPHanaTopology_PRD_00 (ocf::heartbeat:SAPHanaTopology): Started prd2.poc.hana.com
* SAPHanaTopology_PRD_00 (ocf::heartbeat:SAPHanaTopology): Started prd1.poc.hana.com
* Clone Set: SAPHana_PRD_00-clone [SAPHana_PRD_00] (promotable):
* SAPHana_PRD_00 (ocf::heartbeat:SAPHana): FAILED prd2.poc.hana.com (Monitoring)
* SAPHana_PRD_00 (ocf::heartbeat:SAPHana): Slave prd1.poc.hana.com
* vip_PRD_00 (ocf::heartbeat:IPaddr2): Started prd2.poc.hana.com

Node Attributes:
* Node: prd1.poc.hana.com (1):
* hana_prd_clone_state : WAITING4PRIM
* hana_prd_op_mode : logreplay
* hana_prd_remoteHost : prd2.poc.hana.com
* hana_prd_roles : 1:S:master1::worker:
* hana_prd_site : PR
* hana_prd_srmode : syncmem
* hana_prd_sync_state : SFAIL
* hana_prd_version : 2.00.055.00.1615413201
* hana_prd_vhost : prd1.poc.hana.com
* lpa_prd_lpt : 10
* master-SAPHana_PRD_00 : -INFINITY
* Node: prd2.poc.hana.com (2):
* hana_prd_clone_state : DEMOTED
* hana_prd_op_mode : logreplay
* hana_prd_remoteHost : prd1.poc.hana.com
* hana_prd_roles : 1:N:master1::worker:
* hana_prd_site : DR
* hana_prd_srmode : syncmem
* hana_prd_sync_state : SFAIL
* hana_prd_version : 2.00.055.00.1615413201
* hana_prd_vhost : prd2.poc.hana.com
* lpa_prd_lpt : 10
* master-SAPHana_PRD_00 : 0

Migration Summary:
* Node: prd2.poc.hana.com (2):
* SAPHana_PRD_00: migration-threshold=1000000 fail-count=114 last-failure='Mon Sep 13 22:32:56 2021'

Failed Resource Actions:
* SAPHana_PRD_00_monitor_61000 on prd2.poc.hana.com 'error' (1): call=706, status='complete', exitreason='', last-rc-change='2021-09-13 22:32:56 +08:00', queued=0ms, exec=3668ms

Tickets:

PCSD Status:
prd1.poc.hana.com: Online
prd2.poc.hana.com: Online

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Responses

Was it working before the failover? I mean, healthy is denoted by several bits from:

crm_mon -A1

Some important bits:

Clone set:  SAPHana_{SID}_{Instance}-clone [SAPHana_{SID}_{Instance}] (promotable)
...
* Node node1.domain:
    + hana_{sid}_clone_state              : PROMOTED  
    + hana_{sid}_op_mode                  : logreplay 
    + hana_{sid}_remoteHost               : node1
    + hana_{sid}_roles                    : 4:P:master1:master:worker:master
    + hana_{sid}_srmode                   : syncmem   
    + hana_{sid}_sync_state               : PRIM      
    + master-SAPHana_{SID}_{Instance}             : 150   
...    
* Node node2.domain:
    + hana_{sid}_clone_state              : DEMOTED   
    + hana_{sid}_op_mode                  : logreplay 
    + hana_{sid}_remoteHost               : node2
    + hana_{sid}_roles                    : 4:S:master1:master:worker:master
    + hana_{sid}_srmode                   : syncmem   
    + hana_{sid}_sync_state               : SOK       
    + master-SAPHana_{SID}_{Instance}             : 100  
...

You should have:

   1. 4 hana_{sid}_roles should appear for each node.
   2.  master-SAPHana_{SID}_{Instance}
        a. 100 = slave
        b. 140 = master
        c. 150 = Promoted master
   3. Sync state has to be SOK or PRIM.

Fixing an issue like what yours displays, I would:

  1. disable SAPHana_{SID}_{Instance} and SAPHanaTopology_{SID}_{Instance}
   2. verify sync in studio 
   3. HDB stop
   4. clean shared memory
   5. enable SAPHana_{SID}_{Instance} and SAPHanaTopology_{SID}_{Instance}

Hi John , thanks for your email . i have tried the same steps as you suggested. disable the Cluster resource . 1 Node hana is UP but other having issue . it is not coming up.

[root@prd2 ~]# pcs status Cluster name: Sap_hana Cluster Summary: * Stack: corosync * Current DC: prd2.poc.hana.com (version 2.0.5-9.el8_4.1-ba59be7122) - partition with quorum * Last updated: Tue Sep 14 15:30:17 2021 * Last change: Tue Sep 14 11:34:07 2021 by root via cibadmin on prd1.poc.hana.com * 2 nodes configured * 6 resource instances configured (5 DISABLED)

Node List: * Online: [ prd1.poc.hana.com prd2.poc.hana.com ]

Full List of Resources: * vmfence (stonith:fence_vmware_rest): Stopped (disabled) * Clone Set: SAPHanaTopology_PRD_00-clone [SAPHanaTopology_PRD_00]: * Stopped (disabled): [ prd1.poc.hana.com prd2.poc.hana.com ] * Clone Set: SAPHana_PRD_00-clone [SAPHana_PRD_00] (promotable): * Stopped (disabled): [ prd1.poc.hana.com prd2.poc.hana.com ] * vip_PRD_00 (ocf::heartbeat:IPaddr2): Started prd2.poc.hana.com

Failed Resource Actions: * SAPHana_PRD_00_monitor_61000 on prd2.poc.hana.com 'error' (1): call=2932, status='complete', exitreason='', last-rc-change='2021-09-14 11:16:22 +08:00', queued=0ms, exec=4396ms

Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled

===========================

[root@prd2 ~]# crm_mon -A1 Cluster Summary: * Stack: corosync * Current DC: prd2.poc.hana.com (version 2.0.5-9.el8_4.1-ba59be7122) - partition with quorum * Last updated: Tue Sep 14 15:30:49 2021 * Last change: Tue Sep 14 11:34:07 2021 by root via cibadmin on prd1.poc.hana.com * 2 nodes configured * 6 resource instances configured (5 DISABLED)

Node List: * Online: [ prd1.poc.hana.com prd2.poc.hana.com ]

Active Resources: * vip_PRD_00 (ocf::heartbeat:IPaddr2): Started prd2.poc.hana.com

Node Attributes: * Node: prd1.poc.hana.com: * hana_prd_clone_state : DEMOTED * hana_prd_op_mode : logreplay * hana_prd_remoteHost : prd2.poc.hana.com * hana_prd_site : PR * hana_prd_srmode : syncmem * hana_prd_version : 2.00.055.00.1615413201 * hana_prd_vhost : prd1.poc.hana.com * lpa_prd_lpt : 10 * Node: prd2.poc.hana.com: * hana_prd_clone_state : UNDEFINED * hana_prd_op_mode : logreplay * hana_prd_remoteHost : prd1.poc.hana.com * hana_prd_roles : 1:N:-:-:-:- * hana_prd_site : DR * hana_prd_srmode : syncmem * hana_prd_sync_state : SFAIL * hana_prd_version : 2.00.055.00.1615413201 * hana_prd_vhost : prd2.poc.hana.com * lpa_prd_lpt : 10 * master-SAPHana_PRD_00 : -INFINITY

Failed Resource Actions: * SAPHana_PRD_00_monitor_61000 on prd2.poc.hana.com 'error' (1): call=2932, status='complete', exitreason='', last-rc-change='2021-09-14 11:16:22 +08:00', queued=0ms, exec=4396ms

The Hana instance will need to be clean in studio before it can be managed by the pacemaker services. Is it up and synced when started manually with HDB start?

[root@prd2 ~]# crm_mon -A1 Cluster Summary: * Stack: corosync * Current DC: prd2.poc.hana.com (version 2.0.5-9.el8_4.1-ba59be7122) - partition with quorum * Last updated: Tue Sep 14 15:30:49 2021 * Last change: Tue Sep 14 11:34:07 2021 by root via cibadmin on prd1.poc.hana.com * 2 nodes configured * 6 resource instances configured (5 DISABLED)

Node List: * Online: [ prd1.poc.hana.com prd2.poc.hana.com ]

Active Resources: * vip_PRD_00 (ocf::heartbeat:IPaddr2): Started prd2.poc.hana.com

Node Attributes: * Node: prd1.poc.hana.com: * hana_prd_clone_state : DEMOTED * hana_prd_op_mode : logreplay * hana_prd_remoteHost : prd2.poc.hana.com * hana_prd_site : PR * hana_prd_srmode : syncmem * hana_prd_version : 2.00.055.00.1615413201 * hana_prd_vhost : prd1.poc.hana.com * lpa_prd_lpt : 10 * Node: prd2.poc.hana.com: * hana_prd_clone_state : UNDEFINED * hana_prd_op_mode : logreplay * hana_prd_remoteHost : prd1.poc.hana.com * hana_prd_roles : 1:N:-:-:-:- * hana_prd_site : DR * hana_prd_srmode : syncmem * hana_prd_sync_state : SFAIL * hana_prd_version : 2.00.055.00.1615413201 * hana_prd_vhost : prd2.poc.hana.com * lpa_prd_lpt : 10 * master-SAPHana_PRD_00 : -INFINITY

Failed Resource Actions: * SAPHana_PRD_00_monitor_61000 on prd2.poc.hana.com 'error' (1): call=2932, status='complete', exitreason='', last-rc-change='2021-09-14 11:16:22 +08:00', queued=0ms, exec=4396ms

Hi Ameriprise Unix Unix,

Besides John's great and much appreciated help, you may also want to open a support case.

I hope you are not facing a DNS 'leak' as you seem to use a DNS domain that is claimed by a Japanse online flower shop.

In general it is wise to check if you are not using some else's domainname. It always is possible to cause trouble, once your servers can reach a DNS server on the internet directly or indirectly.

Regards,

Jan Gerrit Kootstra