Red Hat Customer Portal - Access to 24x7 support and knowledge

Environment:
OS: 7.5
pacemaker 1.1.19-8.el7_6.1.x86_64
corosync: 2.4.3-2.el7_5.1.x86_64

8 nodes setup with one resource being PostgreSQL 10 with 1 standby node.
main db node: DB1
secondary db node: DB2

Issue:
PostgreSQL stop suddenly.
in corosync logs of DB1:

Dec 17 13:26:08 [435476] DB1       crmd:    error: crm_ipc_read:   Connection to pengine failed
Dec 17 13:26:08 [435476] DB1       crmd:    error: mainloop_gio_callback:  Connection to pengine[0x55e53f095dc0] closed (I/O condition=25)
Dec 17 13:26:08 [435476] DB1       crmd:     crit: pe_ipc_destroy: Connection to the Policy Engine failed | pid=-1 uuid=562cd7f3-9093-4626-adec-cd8b376e1b13
Dec 17 13:26:08 [435476] DB1       crmd:     info: register_fsa_error_adv: Resetting the current action list
Dec 17 13:26:09 [435459] DB1 pacemakerd:  warning: pcmk_child_exit:        The pengine process (435475) terminated with signal 9 (core=0)
Dec 17 13:26:09 [435476] DB1       crmd:   notice: save_cib_contents:      Saved Cluster Information Base to /var/lib/pacemaker/pengine/pe-core-562cd7f3-9093-4626-adec-cd8b376e1b13.bz2 after Policy Engine crash
Dec 17 13:26:09 [435459] DB1 pacemakerd:   notice: pcmk_process_exit:      Respawning failed child process: pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: start_child:    Using uid=189 and group=189 for process pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: start_child:    Forked child 126688 for process pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: mcp_cpg_deliver:        Ignoring process list sent by peer for local node
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: mcp_cpg_deliver:        Ignoring process list sent by peer for local node
Dec 17 13:26:09 [126688] DB1    pengine:     info: crm_log_init:   Changed active directory to /var/lib/pacemaker/cores
Dec 17 13:26:09 [126688] DB1    pengine:     info: qb_ipcs_us_publish:     server name: pengine
Dec 17 13:26:09 [126688] DB1    pengine:     info: main:   Starting pengine
Dec 17 13:26:09 [435476] DB1       crmd:    error: do_log: Input I_ERROR received in state S_POLICY_ENGINE from save_cib_contents
Dec 17 13:26:09 [435471] DB1        cib:     info: cib_process_ping:       Reporting our current digest to DB1: 9a96a41da603b14d0edab0b6bc962009 for 3.1248.21817663 (0x56045abbe830 0)
Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_state_transition:    State transition S_POLICY_ENGINE -> S_RECOVERY | input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents
Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_recover:     Fast-tracking shutdown in response to errors
Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_election_vote:       Not voting in election, we're in state S_RECOVERY
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_dc_release:  DC role released
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_te_control:  Transitioner is now inactive
Dec 17 13:26:09 [435476] DB1       crmd:    error: do_log: Input I_TERMINATE received in state S_RECOVERY from do_recover
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_state_transition:    State transition S_RECOVERY -> S_TERMINATE | input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_shutdown:    Disconnecting STONITH...
Dec 17 13:26:09 [435476] DB1       crmd:     info: tengine_stonith_connection_destroy:     Fencing daemon disconnected
Dec 17 13:26:09 [435476] DB1       crmd:     info: stop_recurring_actions: Cancelling op 420 for AsapIP (AsapIP:420)
Dec 17 13:26:09 [435473] DB1       lrmd:     info: cancel_recurring_action:        Cancelling ocf operation AsapIP_monitor_10000
Dec 17 13:26:09 [435476] DB1       crmd:     info: stop_recurring_actions: Cancelling op 378 for Ping (Ping:378)

It states that pengine was killed by signal 9.
as checked there is no crash report at /var/lib/pacemaker/cores
it did recovered from this crash:

Dec 17 13:26:09 [126690] DB1       crmd:     info: get_cluster_type:       Verifying cluster type: 'corosync'
Dec 17 13:26:09 [126690] DB1       crmd:     info: get_cluster_type:       Assuming an active 'corosync' cluster
Dec 17 13:26:09 [126690] DB1       crmd:     info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init
Dec 17 13:26:09 [126690] DB1       crmd:     info: do_cib_control: CIB connection established

aftermath, another node APP1 became the DC as checked in corosync logs of that node.
it took the following action:

Dec 17 13:26:12 [379816] APP1    pengine:   notice: LogAction:   * Promote    Postgresql9:0     ( Slave -> Master DB2 )
Dec 17 13:26:12 [379816] APP1    pengine:     info: LogActions: Leave   Postgresql9:1   (Stopped)

as I understood, the new DC is trying to promote PostgreSQL on node DB2

but before this action could be completed it took the following actions:

Dec 17 13:26:15 [379816] APP1    pengine:   notice: LogAction:   * Stop       Postgresql9:0     (              Master DB1 )   due to node availability
Dec 17 13:26:15 [379816] APP1    pengine:   notice: LogAction:   * Promote    Postgresql9:1     (     Slave -> Master DB2 )

it asked to stop postgres on DB1 and start postgres on DB2.

but in corosync logs of DB1, it never tried to stop postgresql when asked by DC APP1.
after this at each LogAction log the same thing is present.

Unfortunatly, I dont have /var/log/messages

I have two question:
a) what could be the possible reasons that pengine was killed forcefully (signal 9).
b) even if it was killed and pacemaker was able to recover, why the DC tried to stop the postgres on DB1,
c) even after rolling action to stop postgres on DB1 , DB1 never perfomed any action, what could be the possible reason?

Thanks in advance for help.

Select Your Language

Infrastructure and Management

Cloud Computing

Storage

Runtimes

Integration and Automation

pacemaker subprocess pengine terminated

Responses

Quick Links

Help

Site Info

Related Sites

Red Hat legal and privacy links

Red Hat legal and privacy links