Why did the service failed to relocate to another cluster node when the status on one of the resource was failed ?

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux Server 5 (with the High Availability and Resilient Storage Add Ons)
  • Red Hat Enterprise Linux Server 6 (with the High Availability and Resilient Storage Add Ons)

Issue

  • Status on one of the script resource in a service failed but the service did not failed over (relocated) to another node, why ?
Aug  9 09:33:01 node1 clurgmgrd: [7060]: <err> script:/etc/init.d/ndD3-1.sh: status of /etc/init.d/ndD3-1.sh failed (returned 1)  <-------- [1]
Aug  9 09:33:01 node1 clurgmgrd[7060]: <notice> status on script "/etc/init.d/ndD3-1.sh" returned 1 (generic error) 
Aug  9 09:33:01 node1 clurgmgrd[7060]: <warning> Some independent resources in service:nd1 failed; Attempting inline recovery 
Aug  9 09:33:01 node1 clurgmgrd: [7060]: <err> script:/etc/init.d/ndD3-1.sh: stop of /etc/init.d/ndD3-1.sh failed (returned 1)  <-------- [2]
Aug  9 09:33:01 node1 clurgmgrd[7060]: <notice> stop on script "/etc/init.d/ndD3-1.sh" returned 1 (generic error) 
Aug  9 09:33:02 node1 nsca[4827]: Caught SIGTERM - shutting down... 
Aug  9 09:33:02 node1 nsca[4827]: Cannot remove pidfile '/var/run/nsca1.pid' - check your privileges.
Aug  9 09:33:02 node1 nsca[4827]: Daemon shutdown 
Aug  9 09:33:07 node1 multipathd: dm-16: umount map (uevent) 
Aug  9 09:33:15 node1 clurgmgrd: [7060]: <notice> Deactivating vg13/nd13 
Aug  9 09:33:15 node1 clurgmgrd: [7060]: <notice> Making resilient : lvchange -an vg13/nd13 
Aug  9 09:33:15 node1 clurgmgrd: [7060]: <notice> Resilient command: lvchange -an vg13/nd13 --config devices{filter=["a|/dev/mapper/mpath0|","a|/dev/mapper/mpath1|","a|/dev/mapper/mpath3|","a|/dev/sda2|","r|.*|"]} 
Aug  9 09:33:15 node1 multipathd: dm-16: remove map (uevent) 
Aug  9 09:33:15 node1 multipathd: dm-16: devmap not registered, can't remove 
Aug  9 09:33:15 node1 clurgmgrd: [7060]: <notice> Removing ownership tag (node1.example.com) from vg13/nd13 
Aug  9 09:33:26 node1 clurgmgrd[7060]: <warning> Inline recovery of service:nd1 failed 
Aug  9 09:33:26 node1 clurgmgrd[7060]: <notice> Stopping service service:nd1 
Aug  9 09:33:26 node1 clurgmgrd: [7060]: <err> script:/etc/init.d/ndD3-1.sh: stop of /etc/init.d/ndD3-1.sh failed (returned 1) <---------- [2]
Aug  9 09:33:26 node1 clurgmgrd[7060]: <notice> stop on script "/etc/init.d/ndD3-1.sh" returned 1 (generic error) 
Aug  9 09:33:26 node1 clurgmgrd: [7060]: <notice> Deactivating vg13/nd13 
Aug  9 09:33:26 node1 clurgmgrd: [7060]: <notice> Making resilient : lvchange -an vg13/nd13 
Aug  9 09:33:26 node1 clurgmgrd: [7060]: <notice> Resilient command: lvchange -an vg13/nd13 --config devices{filter=["a|/dev/mapper/mpath0|","a|/dev/mapper/mpath1|","a|/dev/mapper/mpath3|","a|/dev/sda2|","r|.*|"]} 
Aug  9 09:33:26 node1 clurgmgrd: [7060]: <notice> Removing ownership tag (node1.example.com) from vg13/nd13 
Aug  9 09:33:26 node1 clurgmgrd[7060]: <crit> #12: RG service:nd1 failed to stop; intervention required 
Aug  9 09:33:26 node1 clurgmgrd[7060]: <notice> Service service:nd1 is failed <--------- [3] 

Resolution

When the status on the script resource was failed [1], stop function on the script resource too failed [2] and hence the service was marked as failed [3]. When the service is marked as failed, manual intervention is required to un-fail the service. The service, first needs to be disable and then enable to start it again. Failed service will not relocate. In order that the relocate to happen, the service should be stopped on the source node.

In order to start the failed service, service needs to be disabled first and then enable/start it.

clusvcadm -d service-name   <-- to disable the service
clusvcadm -e service-name   <-- to enable the service

In order to write the service script to enable a service in cluster, see

How to write a service script to enable my service in Red Hat Clustering?

What are the requirements of a "script" resource in Red Hat Enterprise Linux Clusters?

Also see the below article to avoid failing of status,start and stop functions in the script resource.

Why does my cluster service fail when attempting to start, status, or stop a script resource in RHEL?

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.