pacemaker subprocess pengine terminated
Environment:
OS: 7.5
pacemaker 1.1.19-8.el7_6.1.x86_64
corosync: 2.4.3-2.el7_5.1.x86_64
8 nodes setup with one resource being PostgreSQL 10 with 1 standby node.
main db node: DB1
secondary db node: DB2
Issue:
PostgreSQL stop suddenly.
in corosync logs of DB1:
Dec 17 13:26:08 [435476] DB1 crmd: error: crm_ipc_read: Connection to pengine failed Dec 17 13:26:08 [435476] DB1 crmd: error: mainloop_gio_callback: Connection to pengine[0x55e53f095dc0] closed (I/O condition=25) Dec 17 13:26:08 [435476] DB1 crmd: crit: pe_ipc_destroy: Connection to the Policy Engine failed | pid=-1 uuid=562cd7f3-9093-4626-adec-cd8b376e1b13 Dec 17 13:26:08 [435476] DB1 crmd: info: register_fsa_error_adv: Resetting the current action list Dec 17 13:26:09 [435459] DB1 pacemakerd: warning: pcmk_child_exit: The pengine process (435475) terminated with signal 9 (core=0) Dec 17 13:26:09 [435476] DB1 crmd: notice: save_cib_contents: Saved Cluster Information Base to /var/lib/pacemaker/pengine/pe-core-562cd7f3-9093-4626-adec-cd8b376e1b13.bz2 after Policy Engine crash Dec 17 13:26:09 [435459] DB1 pacemakerd: notice: pcmk_process_exit: Respawning failed child process: pengine Dec 17 13:26:09 [435459] DB1 pacemakerd: info: start_child: Using uid=189 and group=189 for process pengine Dec 17 13:26:09 [435459] DB1 pacemakerd: info: start_child: Forked child 126688 for process pengine Dec 17 13:26:09 [435459] DB1 pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node Dec 17 13:26:09 [435459] DB1 pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node Dec 17 13:26:09 [126688] DB1 pengine: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores Dec 17 13:26:09 [126688] DB1 pengine: info: qb_ipcs_us_publish: server name: pengine Dec 17 13:26:09 [126688] DB1 pengine: info: main: Starting pengine Dec 17 13:26:09 [435476] DB1 crmd: error: do_log: Input I_ERROR received in state S_POLICY_ENGINE from save_cib_contents Dec 17 13:26:09 [435471] DB1 cib: info: cib_process_ping: Reporting our current digest to DB1: 9a96a41da603b14d0edab0b6bc962009 for 3.1248.21817663 (0x56045abbe830 0) Dec 17 13:26:09 [435476] DB1 crmd: warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY | input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents Dec 17 13:26:09 [435476] DB1 crmd: warning: do_recover: Fast-tracking shutdown in response to errors Dec 17 13:26:09 [435476] DB1 crmd: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY Dec 17 13:26:09 [435476] DB1 crmd: info: do_dc_release: DC role released Dec 17 13:26:09 [435476] DB1 crmd: info: do_te_control: Transitioner is now inactive Dec 17 13:26:09 [435476] DB1 crmd: error: do_log: Input I_TERMINATE received in state S_RECOVERY from do_recover Dec 17 13:26:09 [435476] DB1 crmd: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE | input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover Dec 17 13:26:09 [435476] DB1 crmd: info: do_shutdown: Disconnecting STONITH... Dec 17 13:26:09 [435476] DB1 crmd: info: tengine_stonith_connection_destroy: Fencing daemon disconnected Dec 17 13:26:09 [435476] DB1 crmd: info: stop_recurring_actions: Cancelling op 420 for AsapIP (AsapIP:420) Dec 17 13:26:09 [435473] DB1 lrmd: info: cancel_recurring_action: Cancelling ocf operation AsapIP_monitor_10000 Dec 17 13:26:09 [435476] DB1 crmd: info: stop_recurring_actions: Cancelling op 378 for Ping (Ping:378)
It states that pengine was killed by signal 9.
as checked there is no crash report at /var/lib/pacemaker/cores
it did recovered from this crash:
Dec 17 13:26:09 [126690] DB1 crmd: info: get_cluster_type: Verifying cluster type: 'corosync' Dec 17 13:26:09 [126690] DB1 crmd: info: get_cluster_type: Assuming an active 'corosync' cluster Dec 17 13:26:09 [126690] DB1 crmd: info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init Dec 17 13:26:09 [126690] DB1 crmd: info: do_cib_control: CIB connection established
aftermath, another node APP1 became the DC as checked in corosync logs of that node.
it took the following action:
Dec 17 13:26:12 [379816] APP1 pengine: notice: LogAction: * Promote Postgresql9:0 ( Slave -> Master DB2 ) Dec 17 13:26:12 [379816] APP1 pengine: info: LogActions: Leave Postgresql9:1 (Stopped)
as I understood, the new DC is trying to promote PostgreSQL on node DB2
but before this action could be completed it took the following actions:
Dec 17 13:26:15 [379816] APP1 pengine: notice: LogAction: * Stop Postgresql9:0 ( Master DB1 ) due to node availability Dec 17 13:26:15 [379816] APP1 pengine: notice: LogAction: * Promote Postgresql9:1 ( Slave -> Master DB2 )
it asked to stop postgres on DB1 and start postgres on DB2.
but in corosync logs of DB1, it never tried to stop postgresql when asked by DC APP1.
after this at each LogAction log the same thing is present.
Unfortunatly, I dont have /var/log/messages
I have two question:
a) what could be the possible reasons that pengine was killed forcefully (signal 9).
b) even if it was killed and pacemaker was able to recover, why the DC tried to stop the postgres on DB1,
c) even after rolling action to stop postgres on DB1 , DB1 never perfomed any action, what could be the possible reason?
Thanks in advance for help.
Responses