4.9. 配置隔离

隔离配置可确保 AWS 集群上的故障节点被自动隔离,这样可防止节点消耗集群的资源或影响集群的功能。

您可以使用多种方法在 AWS 集群上配置隔离。本节提供以下内容:

  • 默认配置的标准过程。
  • 另一种配置过程,用于更高级的配置,专注于自动化。

标准流程

  1. 输入以下 AWS 元数据查询以获取每个节点的实例 ID。您需要这些 ID 来配置隔离设备。如需更多信息,请参阅实例元数据和用户数据

    # echo $(curl -s http://169.254.169.254/latest/meta-data/instance-id)

    例如:

    [root@ip-10-0-0-48 ~]# echo $(curl -s http://169.254.169.254/latest/meta-data/instance-id) i-07f1ac63af0ec0ac6
  2. 输入以下命令配置隔离设备。使用 pcmk_host_map 命令将 RHEL 主机名映射到实例 ID。使用您之前设置的 AWS 访问密钥和 AWS Secret 访问密钥。

    # pcs stonith \
        create <name> fence_aws access_key=access-key secret_key=<secret-access-key> \
        region=<region> pcmk_host_map="rhel-hostname-1:Instance-ID-1;rhel-hostname-2:Instance-ID-2;rhel-hostname-3:Instance-ID-3" \
        power_timeout=240 pcmk_reboot_timeout=480 pcmk_reboot_retries=4

    例如:

    [root@ip-10-0-0-48 ~]# pcs stonith \
    create clusterfence fence_aws access_key=AKIAI123456MRMJA secret_key=a75EYIG4RVL3hdsdAslK7koQ8dzaDyn5yoIZ/ \
    region=us-east-1 pcmk_host_map="ip-10-0-0-48:i-07f1ac63af0ec0ac6;ip-10-0-0-46:i-063fc5fe93b4167b2;ip-10-0-0-58:i-08bd39eb03a6fd2c7" \
    power_timeout=240 pcmk_reboot_timeout=480 pcmk_reboot_retries=4

备用步骤

  1. 获取集群的 VPC ID。

    # aws ec2 describe-vpcs --output text --filters "Name=tag:Name,Values=clustername-vpc" --query 'Vpcs[*].VpcId'
    vpc-06bc10ac8f6006664
  2. 通过使用集群的 VPC ID,获取 VPC 实例。

    $ aws ec2 describe-instances --output text --filters "Name=vpc-id,Values=vpc-06bc10ac8f6006664" --query 'Reservations[*].Instances[*].{Name:Tags[? Key==Name]|[0].Value,Instance:InstanceId}' | grep "\-node[a-c]"
    i-0b02af8927a895137     clustername-nodea-vm
    i-0cceb4ba8ab743b69     clustername-nodeb-vm
    i-0502291ab38c762a5     clustername-nodec-vm
  3. 使用获得的实例 ID 在集群的每个节点中配置隔离。例如:

    [root@nodea ~]# CLUSTER=clustername && pcs stonith create fence${CLUSTER} fence_aws access_key=XXXXXXXXXXXXXXXXXXXX pcmk_host_map=$(for NODE \
    in node{a..c}; do ssh ${NODE} "echo -n \${HOSTNAME}:\$(curl -s http://169.254.169.254/latest/meta-data/instance-id)\;"; done) \
    pcmk_reboot_retries=4 pcmk_reboot_timeout=480 power_timeout=240 region=xx-xxxx-x secret_key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    
    [root@nodea ~]# pcs stonith config fence${CLUSTER}
    
    Resource: clustername (class=stonith type=fence_aws)
    Attributes: access_key=XXXXXXXXXXXXXXXXXXXX pcmk_host_map=nodea:i-0b02af8927a895137;nodeb:i-0cceb4ba8ab743b69;nodec:i-0502291ab38c762a5;
    pcmk_reboot_retries=4 pcmk_reboot_timeout=480 power_timeout=240 region=xx-xxxx-x secret_key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    Operations: monitor interval=60s (clustername-monitor-interval-60s)

验证

  1. 测试其中一个集群节点的隔离代理。

    # pcs stonith fence awsnodename
    注意

    这个命令的响应可能需要几分钟时间来显示。如果您监视节点被隔离的活跃终端会话,您会在进入 fence 命令后马上终止终端连接。

    例如:

    [root@ip-10-0-0-48 ~]# pcs stonith fence ip-10-0-0-58
    
    Node: ip-10-0-0-58 fenced
  2. 检查状态以验证该节点是否已隔离。

    # pcs status

    例如:

    [root@ip-10-0-0-48 ~]# pcs status
    
    Cluster name: newcluster
    Stack: corosync
    Current DC: ip-10-0-0-46 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
    Last updated: Fri Mar  2 19:55:41 2018
    Last change: Fri Mar  2 19:24:59 2018 by root via cibadmin on ip-10-0-0-46
    
    3 nodes configured
    1 resource configured
    
    Online: [ ip-10-0-0-46 ip-10-0-0-48 ]
    OFFLINE: [ ip-10-0-0-58 ]
    
    Full list of resources:
    clusterfence  (stonith:fence_aws):    Started ip-10-0-0-46
    
    Daemon Status:
    corosync: active/disabled
    pacemaker: active/disabled
    pcsd: active/enabled
  3. 启动上一步中隔离的节点。

    # pcs cluster start awshostname
  4. 检查状态以验证节点已启动。

    # pcs status

    例如:

    [root@ip-10-0-0-48 ~]# pcs status
    
    Cluster name: newcluster
    Stack: corosync
    Current DC: ip-10-0-0-46 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
    Last updated: Fri Mar  2 20:01:31 2018
    Last change: Fri Mar  2 19:24:59 2018 by root via cibadmin on ip-10-0-0-48
    
    3 nodes configured
    1 resource configured
    
    Online: [ ip-10-0-0-46 ip-10-0-0-48 ip-10-0-0-58 ]
    
    Full list of resources:
    
      clusterfence  (stonith:fence_aws):    Started ip-10-0-0-46
    
    Daemon Status:
      corosync: active/disabled
      pacemaker: active/disabled
      pcsd: active/enabled