Select Your Language

Infrastructure and Management

Cloud Computing

Storage

Runtimes

Integration and Automation

  • Comments
  • pacemaker subprocess pengine terminated

    Posted on

    Environment:
    OS: 7.5
    pacemaker 1.1.19-8.el7_6.1.x86_64
    corosync: 2.4.3-2.el7_5.1.x86_64

    8 nodes setup with one resource being PostgreSQL 10 with 1 standby node.
    main db node: DB1
    secondary db node: DB2

    Issue:
    PostgreSQL stop suddenly.
    in corosync logs of DB1:

    Dec 17 13:26:08 [435476] DB1       crmd:    error: crm_ipc_read:   Connection to pengine failed
    Dec 17 13:26:08 [435476] DB1       crmd:    error: mainloop_gio_callback:  Connection to pengine[0x55e53f095dc0] closed (I/O condition=25)
    Dec 17 13:26:08 [435476] DB1       crmd:     crit: pe_ipc_destroy: Connection to the Policy Engine failed | pid=-1 uuid=562cd7f3-9093-4626-adec-cd8b376e1b13
    Dec 17 13:26:08 [435476] DB1       crmd:     info: register_fsa_error_adv: Resetting the current action list
    Dec 17 13:26:09 [435459] DB1 pacemakerd:  warning: pcmk_child_exit:        The pengine process (435475) terminated with signal 9 (core=0)
    Dec 17 13:26:09 [435476] DB1       crmd:   notice: save_cib_contents:      Saved Cluster Information Base to /var/lib/pacemaker/pengine/pe-core-562cd7f3-9093-4626-adec-cd8b376e1b13.bz2 after Policy Engine crash
    Dec 17 13:26:09 [435459] DB1 pacemakerd:   notice: pcmk_process_exit:      Respawning failed child process: pengine
    Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: start_child:    Using uid=189 and group=189 for process pengine
    Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: start_child:    Forked child 126688 for process pengine
    Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: mcp_cpg_deliver:        Ignoring process list sent by peer for local node
    Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: mcp_cpg_deliver:        Ignoring process list sent by peer for local node
    Dec 17 13:26:09 [126688] DB1    pengine:     info: crm_log_init:   Changed active directory to /var/lib/pacemaker/cores
    Dec 17 13:26:09 [126688] DB1    pengine:     info: qb_ipcs_us_publish:     server name: pengine
    Dec 17 13:26:09 [126688] DB1    pengine:     info: main:   Starting pengine
    Dec 17 13:26:09 [435476] DB1       crmd:    error: do_log: Input I_ERROR received in state S_POLICY_ENGINE from save_cib_contents
    Dec 17 13:26:09 [435471] DB1        cib:     info: cib_process_ping:       Reporting our current digest to DB1: 9a96a41da603b14d0edab0b6bc962009 for 3.1248.21817663 (0x56045abbe830 0)
    Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_state_transition:    State transition S_POLICY_ENGINE -> S_RECOVERY | input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents
    Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_recover:     Fast-tracking shutdown in response to errors
    Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_election_vote:       Not voting in election, we're in state S_RECOVERY
    Dec 17 13:26:09 [435476] DB1       crmd:     info: do_dc_release:  DC role released
    Dec 17 13:26:09 [435476] DB1       crmd:     info: do_te_control:  Transitioner is now inactive
    Dec 17 13:26:09 [435476] DB1       crmd:    error: do_log: Input I_TERMINATE received in state S_RECOVERY from do_recover
    Dec 17 13:26:09 [435476] DB1       crmd:     info: do_state_transition:    State transition S_RECOVERY -> S_TERMINATE | input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover
    Dec 17 13:26:09 [435476] DB1       crmd:     info: do_shutdown:    Disconnecting STONITH...
    Dec 17 13:26:09 [435476] DB1       crmd:     info: tengine_stonith_connection_destroy:     Fencing daemon disconnected
    Dec 17 13:26:09 [435476] DB1       crmd:     info: stop_recurring_actions: Cancelling op 420 for AsapIP (AsapIP:420)
    Dec 17 13:26:09 [435473] DB1       lrmd:     info: cancel_recurring_action:        Cancelling ocf operation AsapIP_monitor_10000
    Dec 17 13:26:09 [435476] DB1       crmd:     info: stop_recurring_actions: Cancelling op 378 for Ping (Ping:378)
    

    It states that pengine was killed by signal 9.
    as checked there is no crash report at /var/lib/pacemaker/cores
    it did recovered from this crash:

    Dec 17 13:26:09 [126690] DB1       crmd:     info: get_cluster_type:       Verifying cluster type: 'corosync'
    Dec 17 13:26:09 [126690] DB1       crmd:     info: get_cluster_type:       Assuming an active 'corosync' cluster
    Dec 17 13:26:09 [126690] DB1       crmd:     info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init
    Dec 17 13:26:09 [126690] DB1       crmd:     info: do_cib_control: CIB connection established
    

    aftermath, another node APP1 became the DC as checked in corosync logs of that node.
    it took the following action:

    Dec 17 13:26:12 [379816] APP1    pengine:   notice: LogAction:   * Promote    Postgresql9:0     ( Slave -> Master DB2 )
    Dec 17 13:26:12 [379816] APP1    pengine:     info: LogActions: Leave   Postgresql9:1   (Stopped)
    

    as I understood, the new DC is trying to promote PostgreSQL on node DB2

    but before this action could be completed it took the following actions:

    Dec 17 13:26:15 [379816] APP1    pengine:   notice: LogAction:   * Stop       Postgresql9:0     (              Master DB1 )   due to node availability
    Dec 17 13:26:15 [379816] APP1    pengine:   notice: LogAction:   * Promote    Postgresql9:1     (     Slave -> Master DB2 )
    

    it asked to stop postgres on DB1 and start postgres on DB2.

    but in corosync logs of DB1, it never tried to stop postgresql when asked by DC APP1.
    after this at each LogAction log the same thing is present.

    Unfortunatly, I dont have /var/log/messages

    I have two question:
    a) what could be the possible reasons that pengine was killed forcefully (signal 9).
    b) even if it was killed and pacemaker was able to recover, why the DC tried to stop the postgres on DB1,
    c) even after rolling action to stop postgres on DB1 , DB1 never perfomed any action, what could be the possible reason?

    Thanks in advance for help.

    by

    points

    Responses

    Red Hat LinkedIn YouTube Facebook X, formerly Twitter

    Quick Links

    Help

    Site Info

    Related Sites

    © 2025 Red Hat