Systemd service fails to start in Pacemaker and reports "error (inactive)"
Issue
When starting many systemd services in pacemaker at the same time, some of the resource report a failure during start due to an inactive
error. This occurs within a few seconds of initiating the start operation initiating:
$ pcs status
------------------------------------>8----------------------------------------
Migration Summary:
* Node: rhel8-node2 (2):
* test1-service: migration-threshold=1000000 fail-count=1000000 last-failure='Mon May 15 08:59:39 2023'
* test2-service: migration-threshold=1000000 fail-count=1000000 last-failure='Mon May 15 08:59:39 2023'
* test3-service: migration-threshold=1000000 fail-count=1000000 last-failure='Mon May 15 08:59:39 2023'
Failed Resource Actions:
* test1-service_start_0 on rhel8-node2 'error' (1): call=362, status='complete', exitreason='inactive', last-rc-change='Mon May 15 08:59:40 2023', queued=0ms, exec=4064ms
* test2-service_start_0 on rhel8-node2 'error' (1): call=369, status='complete', exitreason='inactive', last-rc-change='Mon May 15 08:59:41 2023', queued=0ms, exec=3820ms
* test3-service_start_0 on rhel8-node2 'error' (1): call=364, status='complete', exitreason='inactive', last-rc-change='Mon May 15 08:59:40 2023', queued=0ms, exec=4027ms
$ cat ./plwmscups02-May11/sos_commands/logs/journalctl_--no-pager
----------------------------------->8-----------------------------------------
May 10 12:21:38 plwmscups02 pacemaker-controld[2825486]: notice: Initiating start operation test1-service_start_0 locally on rhel8-node2
------------------------------------>8----------------------------------------
May 10 12:21:42 rhel8-node2 pacemaker-controld[2825486]: notice: Result of start operation for test1-service on rhel8-node2: error (inactive) <---
May 10 12:21:42 rhel8-node2 pacemaker-controld[2825486]: notice: Result of start operation for test2-service on rhel8-node2: error (inactive) <---
May 10 12:21:42 rhel8-node2 pacemaker-controld[2825486]: notice: Result of start operation for test3-service on rhel8-node2: error (inactive) <---
May 10 12:21:42 rhel8-node2 pacemaker-controld[2825486]: notice: Transition 55 aborted by operation test1-service_start_0 'modify' on rhel8-node2: Event failed
May 10 12:21:42 rhel8-node2 pacemaker-controld[2825486]: notice: Transition 55 action 227 (test1-service_start_0 on rhel8-node2): expected 'ok' but got 'error'
May 10 12:21:42 rhel8-node2 pacemaker-attrd[2825484]: notice: Setting fail-count-test1-service#start_0[rhel8-node2]: (unset) -> INFINITY
May 10 12:21:42 rhel8-node2 pacemaker-attrd[2825484]: notice: Setting last-failure-test1-service#start_0[rhel8-node2]: (unset) -> 1683739302
Despite the cluster failure, the actual systemd service may report started in the shortly after:
$ cat /var/log/messages
----------------------------------->8-----------------------------------------
May 10 12:21:42 rhel8-node2 systemd[1]: Starting Cluster Controlled test1...
----------------------------------->8-----------------------------------------
May 10 12:22:30 rhel8-node2 podman[3255228]: test1
May 10 12:22:30 rhel8-node2 systemd[1]: Started Cluster Controlled test1. <--- service started
Environment
- Red Hat Enterprise Linux 7, 8 and 9
- High Availability w/ Pacemaker
- Systemd Pacemaker Resources
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.