How to change ethmonitor failure detection interval time in Pacemaker?

Solution Verified - Updated -

Environment

Issue

  • How to change ethmonitor failure detection interval time in Pacemaker?
  • How to shorten the time to failover vip (IPaddr2) resource when corresponding network link is down.
  • What are 'repeat_count' and 'repeat_interval' options in ethmonitor resource?
  • When the vip resource is failed in one node, how to make it failover to another node in a quicker manner?

Resolution

  • Change repeat_count and repeat_interval within ethmonitor resource
[root@node1 ~]# pcs resource describe ethmonitor

<snip>

  repeat_count: Specify how often the interface will be monitored, before the status is set to failed. You need to
                  set the timeout of the monitoring operation to at least repeat_count * repeat_interval
  repeat_interval: Specify how long to wait in seconds between the repeat_counts.

</snip>

For example, if we want ethmonitor to claim that eth1 is failed after network link is down for about 20s~30s, I could set:

[root@node0 ~]# pcs resource update eth1-monitor repeat_count=3 repeat_interval=10
[root@node0 ~]# pcs resource show eth1-monitor
 Resource: eth1-monitor (class=ocf provider=heartbeat type=ethmonitor)
  Attributes: interface=eth1 repeat_count=3 repeat_interval=10                   <<<---- When monitor function is called, it will check link status.

  Operations: start interval=0s timeout=60s (eth1-monitor-start-interval-0s)
              stop interval=0s timeout=20s (eth1-monitor-stop-interval-0s)
              monitor interval=10s timeout=60s (eth1-monitor-monitor-interval-10s)   <<<---- The monitor function of eth1-monitor would be called every 10s.

In this example, the monitor function of eth1-monitor would be called every 10s. When monitor function is called, it will check link status, and if link is fail, it will retry (3 - 1 = 2) times, in 10s interval each.

So it might take 20~30s to claim eth1 is failed after link down.

Root Cause

First, LRMd will call monitor function of ethmonitor in every monitor interval.
Then, inside the monitor function, it will check the link status immediately. If the link is detected failed, monitor function will retry detection every repeat_interval time, until the repeat_count exhausted.

When Network Link is fine:

+-------------------------------------------+
|             ** LRMd **                    |
|                                           |
| lrmd calls ethmonitor's monitor function  |
|  every 10s (monitor interval).            |
|                                           |
+-------------------------------------------+
                 |
                 |
                 V
+-------------------------------------------+
|           ** ethmonitor **                |
|                                           |
| Report that Network Link is working fine. |
|                                           |
+-------------------------------------------+

When Network Link is down:

    +-------------------------------------------+
    |             ** LRMd **                    |
    |                                           |
    | lrmd calls ethmonitor's monitor function  |------------------
    |  every 10s (monitor interval).            |        ^
    |                                           |        |
    +-------------------------------------------+      Takes
                     |                                0 ~ 10s
                     |                                   |
                     V                                   v
+-----------------------------------------------------------+----------
|                   ** ethmonitor **                        |     ^
|                                                           |     |
| Check and find out Network Link is down. (repeat_count--) |     |
|            |                                              |     |
|            | after 5s (repeat_interval)                   |     |
|            V                                              |     |
| Retry and find out Network Link is down. (repeat_count--) |     |
|            |                                              |    Takes
|            | after 5s (repeat_interval)                   |   10s to
|            V                                              |   claim
| Retry and find out Network Link is down. (repeat_count--) |    fail
|            |                                              |     |
|            |                                              |     |
|            V                                              |     |
|    Report Network Link is down.                           |     |
|                                                           |     v
+-----------------------------------------------------------+---------
                      |
                      |
                      V
 Further operations (eg. Failover the resource to another node)

In this example, from Network Link becoming down to reporting Network Link is down may take about 10~20s.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

Please refer if_monitor() in /usr/lib/ocf/resource.d/heartbeat/ethmonitor to see the detailed logic behind.

# rpm -qf /usr/lib/ocf/resource.d/heartbeat/ethmonitor
resource-agents-3.9.5-54.el7.x86_64